salient object detection: a discriminative regional feature integration approach · 2014-10-23 ·...

1

Salient Object Detection: A DiscriminativeRegional Feature Integration Approach

Huaizu Jiang, Zejian Yuan, Ming-Ming Cheng, Yihong Gong,Nanning Zheng, and Jingdong Wang

Abstract—Salient object detection has been attracting a lot of interest, and recently various heuristic computational models havebeen designed. In this paper, we formulate saliency map computation as a regression problem. Our method, which is basedon multi-level image segmentation, utilizes the supervised learning approach to map the regional feature vector to a saliencyscore. Saliency scores across multiple layers are finally fused to produce the saliency map. The contributions lie in two-fold.One is that we propose a discriminate regional feature integration approach for salient object detection. Compared with existingheuristic models, our proposed method is able to automatically integrate high-dimensional regional saliency features and choosediscriminative ones. The other is that by investigating standard generic region properties as well as two widely studied conceptsfor salient object detection, i.e., regional contrast and backgroundness, our approach significantly outperforms state-of-the-artmethods on six benchmark datasets. Meanwhile, we demonstrate that our method runs as fast as most existing algorithms.

F

1 INTRODUCTION

Visual saliency has been a fundamental problem inneuroscience, psychology, neural systems, and com-puter vision for a long time. It is originally definedas a task of predicting the eye-fixations on images [2].Recently it is extended to identifying a region [3], [4]containing the salient object, known as salient objectdetection or salient region detection. Applications ofsalient object detection include object detection andrecognition [5], [6], image compression [7], imagecropping [8], photo collage [9], [10], dominant colordetection [11], [12] and so on.

The study on human visual systems suggests thatthe saliency is related to uniqueness, rarity and sur-prise of a scene, characterized by primitive featureslike color, texture, shape, etc. Recently a lot of ef-forts have been made to design various heuristicalgorithms to compute the saliency [13]–[21]. Builtupon the feature integration theory [2], [22], almost allthe approaches compute conspicuity (feature) mapsfrom different saliency cues and then combine themtogether to form the final saliency map. Hand-craftedintegration rules, however, are fragile and poor togeneralize. For instance, in a recent survey [23],none of the algorithms can consistently outperformsothers on the benchmark data sets. Though somelearning-based salient object detection algorithms areproposed [19], [24], [25], the potential of supervisedlearning is not deeply investigated.

• H. Jiang, Z. Yuan, Y. Gong and N. Zheng, are with Xi’an JiaotongUniversity. M.M. Cheng is with Oxford University. J. Wang is withMicrosoft Research Asia.

• A preliminary version of this work appeared at CVPR [1].• Project website jianghz.com/drfi.

In this paper, we formulate salient object detectionas a regression problem, learning a regressor thatdirectly maps the regional feature vector to a saliencyscore. Our approach consists of three main steps. Thefirst one is multi-level segmentation, which decom-poses the image to multiple segmentations. Second,we conduct a region saliency computation step witha Random Forest regressor that maps the regionalfeatures to a saliency score. Last, a saliency map iscomputed by fusing the saliency maps across multiplelayers of segmentations.

The key contributions lie in the second step, regionsaliency computation. Firstly, unlike most existingalgorithms that compute saliency maps heuristicallyfrom various features and combine them to get thesaliency map, which we call saliency integration,we learn a Random Forest regressor that directlymaps the feature vector of each region to a saliencyscore, which we call discriminative regional featureintegration (DRFI). This is a principle way in im-age classification [26], but rarely studied in salientobject detection. Secondly, by investigating standardgeneric region properties and two widely studied con-cepts in salient object detection, i.e., regional contrastand backgroundness, our proposed approach consis-tently outperforms state-of-the-art algorithms on allsix benchmark data sets with large margins. Ratherthan heuristically hand-crafting special features, itturns out that the learned regressor is able to auto-matically integrate features and pick up discrimina-tive ones for saliency. Even though the regressor istrained on a small set of images, it demonstrates goodgeneralization ability to other data sets.

The rest of this paper is organized as follows. Sec. 2introduces related work and discusses their differ-ences with our proposed method. The saliency com-

arX

iv:1

410.

5926

v1 [

cs.C

V]

22

Oct

201

4

http://jianghz.com/drfi

2

putation framework is presented in Sec. 3. Sec. 4describes the regional saliency features adopted inthis paper. Sec. 5 presents the learning frameworkof our approach. Empirical analysis of our proposedmethod and comparisons with other algorithms aredemonstrated in Sec. 6. Finally, Sec. 7 discusses andconcludes this paper.

2 RELATED WORK

Salient object detection, stemming from eye fixationprediction, aims to separate the entire salient objectfrom the background. Since the pioneer work of Itti etal. [2], it attracts more and more research interestsin computer vision, driven by applications such ascontent-aware image resizing [8], picture collage [10],etc. In the following, we focus on salient objectdetection (segmentation) and briefly review existingalgorithms. A comprehensive survey can be foundfrom a recent work [23]. A literature review of eyefixation prediction can be seen in [27], which alsoincludes some analysis on salient object detection.We simply divide existing algorithms into two cat-egories: unsupervised and supervised, according toif the groundtruth annotations of salient objects areadopted.

Unsupervised approaches. Most salient object detec-tion algorithms characterize the uniqueness of a sceneas salient regions following the center-surround con-trast framework [2], where different kinds of featuresare combined according to the feature integrationtheory [22]. The multi-scale pixel contrast is studiedin [19], [28]. The discriminant center-surround hy-pothesis is analyzed in [16], [29]. Color histograms,computed to represent the center and the surround,are used to evaluate the center-surround dissimilar-ity [19]. An information theory perspective is intro-duced to yield a sound mathematical formulation,computing the center-surround divergence based onfeature statistics [18]. A cost-sensitive SVM is trainedto measure the separability of a center region w.r.t.its surroundings [30]. The uniqueness can also becaptured in a global scope by comparing a patchto its k nearest neighbors [17] or as its distance tothe average patch over the image along the principalcomponent axis coordinates [31].

The center-surround difference framework is alsoinvestigated to compute the saliency from region-based image representation. The multi-level imagesegmentation is adopted for salient object detectionbased on local regional contrast [32]. The global re-gional contrast is studied in [15], [21] as well. Tofurther enhance the performance, saliency maps onhierarchical segmentations are computed and finallycombined through a tree model via dynamic pro-gramming [33]. Both the color and textural globaluniqueness are investigated in [34], [35].

Recently, Cheng et al. [36] propose a soft imageabstraction using a Gaussian Mixture Model (GMM),where each pixel maintains a probability belonging toall the regions instead of a single hard region label,to better compute the saliency. The global unique-ness can also be captured with the low-rank matrixrecovery framework [37]–[39]. The low-rank matrixcorresponds to the background regions while sparsenoises are indications of salient regions. A submod-ular salient object detection algorithm is presentedin [40], where superpixels are gradually groupedto form potential salient regions by iteratively opti-mizing a submodular facility location problem. TheBayesian fraemwork is introduced for salient objectdetection in [41], [42]. A partial differential equation(PDE) is also introduced for salient object detection ina recent work [43].

In addition to capturing the uniqueness, many otherpriors are also proposed for saliency computation.Centeral prior, i.e., the salient object usually lies inthe center of an image, is investigated in [32], [44].Object prior, such as connectivity prior [45], concavitycontext [20], and auto-context cue [46], background-ness prior [47]–[50], generic objectness prior [51]–[53],and background connectivity prior [38], [54], [55]are also studied for saliency computation. Example-based approaches, searching for similar images of theinput, are developed for salient object detection [8],[56]. The depth cue is leveraged for saliency analysisderived from stereopsic image pairs in [57] and adepth camera (e.g., Kinect) in [58]. Li et al. [59] adoptthe light field camera for salient object detection.Besides, spectral analysis in the frequency domain isused to detect salient regions [13].

Supervised approaches. Inspired by the feature in-tegration theory, some approaches focus on learningthe linear fusion weight of saliency features. Liu etal. [19] propose to learn the linear fusion weightof saliency features in a Conditional Random Field(CRF) framework. Recently, the large-margin frame-work was adopted to learn the weights in [60]. Dueto the highly non-linear essence of the saliency mech-anism, the linear mapping might not perfectly capturethe characteristics of saliency. In [24], a mixture oflinear Support Vector Machines (SVM) is adoptedto partition the feature space into a set of sub-regions that were linearly separable using a divide-and-conquer strategy. Alternatively, a Boosted Deci-sion Tree (BDT) is learned to get an initial saliencymap, which will be further refined using a highdimensional color transform [61]. In [25], generic re-gional properties are investigated for salient objectdetection. Li et al. [62] propose to generate a saliencymap by adaptively averaging the object proposals [63]with their foreground probabilities that are learnedbased on eye fixations features using the RandomForest regressor. Additionally, Wang et al. [64] learn

3

a Random Forest to directly localize the salient objecton thumbnail images. In [65], a saliency map is usedto guide the sampling of sliding windows for objectcategory recognition, which is online learned duringthe classification process.

Our proposed discriminative regional feature inte-gration (DRFI) approach is a supervised salient objectdetection algorithm. Compared with unsupervisedmethods, our approach extends the contrast valueused in existing algorithms to the contrast vectorto represent a region. More importantly, instead ofdesigning heuristic integration rules, our approachis able to automatically combine the high-dimensionalsaliency features in a data-driven fashion and pickup the discriminative ones. Compared with existingsupervised methods, our method learns a highly non-linear combination of saliency features and does notrequire any assumption of the feature space. Themost similar approaches to ours might be [25], [61].[25] is a light touch on the discriminative featureintegration without presenting a deep investigation,which only considers the regional property descriptor.In [61], the learned saliency map is only used as apre-processing step to provide a coarse estimation ofsalient and background regions while our approachdirectly output the saliency map.

It is noted that some supervised learning ap-proaches exist to predict eye fixation [66], [67]. Thefeatures, e.g., the local energy of the steerable pyramidlters [68] in [66] and the perceptual Gestalt group-ing cues in [67], seem to be more suitable for eyefixation prediction, while our approach is specificallydesigned for salient object detection. We also note thediscriminative feature fusion has also been studied inimage classification [69], which learns the adaptiveweights of features according to the classification taskto better distinguish one class from others. Instead,our approach integrates three types of regional fea-tures in a discriminative strategy for the saliencyregression on multiple segmentations.

3 IMAGE SALIENCY COMPUTATION

The pipeline of our approach consists of three mainsteps: multi-level segmentation that decomposes animage into regions, regional saliency computation thatmaps the features extracted from each region to asaliency score, and multi-level saliency fusion thatcombines the saliency maps over all the layers of seg-mentations to get the final saliency map. The wholeprocess is illustrated in Fig. 1.

Multi-level segmentation. Given an image I , werepresent it by a set of M -level segmentations S ={S1,S2, · · · ,SM}, where each segmentation Sm is adecomposition of the image I . We apply the graph-based image segmentation approach [70] to generatemultiple segmentations using M groups of differentparameters.

…Multi-Level

Image Segmentation

…

Multi-Level Saliency Computation

Multi-Level Saliency Fusion

Input Image

Annotated Training Images

Fig. 1. The framework of our proposed discriminativeregional feature integration (DRFI) approach.

Due to the limitation of low-level cues, none of thecurrent segmentation algorithms can reliably segmentthe salient object. Therefore, we resort to the multi-level segmentation for robustness purpose. In Sec. 5.1,we will further demonstrate how to utilize multi-level segmentation to generate a large amount oftraining samples.

Regional saliency computation. In our approach, wepredict saliency scores for each region that is jointlyrepresented by three types of features: regional con-trast, regional property, and regional backgroundness,which will be described in Sec. 4. At present, wedenote the feature as a vector x. Then the feature xis passed into a random forest regressor f , yielding asaliency score. The Random Forest regressor is learntfrom the regions of the training images and integratesthe features together in a discriminative strategy. Thelearning procedure will be given in Sec. 5.Multi-level saliency fusion. After conducting re-gion saliency computation, each region has a saliencyvalue. For each level, we assign the saliency valueof each region to its contained pixels. As a result,we generate M saliency maps {A1,A2, · · · ,AM}, andthen fuse them together, A = g(A1, · · · ,AM ), to getthe final saliency map A, where g is a combinatorfunction introduced in Sec. 5.3.

4 REGIONAL SALIENCY FEATURES

In this section, we present three types of regionalsaliency features, leading to a 93-dimensional featurevector for each region.

4.1 Regional contrast descriptorA region is likely thought to be salient if it is dif-ferent from others. Unlike most existing approachesthat compute the contrast values, e.g., the distancesof region features like color and texture, and thencombine them together directly forming a saliencyscore, our approach computes a contrast descriptor, avector representing the differences of feature vectorsof regions.

4

Color and texture features Differences of features Contrast Backgroundnessfeatures dim definition dima1 average RGB values 3 d(aRi

1 ,aS1 ) 3 c1 ∼ c3 b1 ∼ b3

h1 RGB histogram 256 χ2(hRi1 ,hS

1 ) 1 c4 b4a2 average HSV values 3 d(aRi

2 ,aS2 ) 3 c5 ∼ c7 b5 ∼ b7

h2 HSV histogram 256 χ2(hRi2 ,hS

2 ) 1 c8 b8a3 average L*a*b* values 3 d(aRi

3 ,aS3 ) 3 c9 ∼ c11 b9 ∼ b11

h3 L*a*b* histogram 256 χ2(hRi3 ,hS

3 ) 1 c12 b12r absolute response of LM filters 15 d(rRi , rS) 15 c13 ∼ c27 b13 ∼ b27h4 max response histogram of the LM filters 15 χ2(hRi

4 ,hS4 ) 1 c28 b28

h5 histogram of the LBP feature 256 χ2(hRi4 ,hS

5 ) 1 c29 b29

Fig. 2. Color and texture features describing the visual characteristics of a region which are used to compute the regionalfeature vector. d(x1,x2) = (|x11 − x21|, · · · , |x1d − x2d|) where d is the number of elements in the vectors x1 and x2. Andχ2(h1,h2) =

∑bi=1

2(h1i−h2i)2

h1i+h2iwith b being the number of histogram bins. The last two columns denote the symbols for

regional contrast and backgroundness descriptors. (In the definition of a feature, S corresponds to Rj for the regional contrastdescriptor and B for the regional backgroundness descriptor, respectively.)

To compute the contrast descriptor, we describeeach region Ri ∈ Sm by a feature vector, includingcolor and texture features, denoted by vRi . The de-tailed description is given in Fig. 2. For color features,we consider RGB, HSV, and L*a*b* color spaces. Fortexture features, we adopt the LBP feature [71] andthe responses of the LM filter bank [72].

As suggested in previous works [15], [21], the re-gional contrast value xck derived from the k-th featurechannel is computed by checking Ri against all otherregions,

xck(Ri) =

Nm∑j=1

αjwijDk(vRi ,vRj ), (1)

where Dk(vRi ,vRj ) captures the difference of the

k-th channel of the feature vectors vRi and vRj .Specifically, the difference of the histogram featureis computed as the χ2 distance and as their absolute

differences for other features. wij = e−||pmi −pmj ||

2

2σ2s is aspatial weighting term, where pi and pj are the meanpositions of Ri and Rj , respectively. σs controls thestrength of the spatial weighting effect. We empiricallyset it as 1.0 in our implementation. αj is introducedto account for the irregular shapes of regions, definedas the normalized area of the region Rj . Nm is thenumber of regions in Sm. As a result, we get a 29-dimensional feature vector. The details of the regionalcontrast descriptor are given in Fig. 2.

4.2 Regional backgroundness descriptorThere exist a few algorithms attempting to make useof the characteristics of the background (e.g., homo-geneous color or textures) to heuristically determineif one region is background, e.g., [47]. In contrast,our algorithm extracts a set of features and adoptsthe supervised learning approach to determine thebackground degree (accordingly the saliency degree)of a region.

It has been observed that the background identi-fication depends on the whole image context. Image

regions with similar appearances might belong to thebackground in one image but belong to the salientobject in some other images. It is not enough to merelyuse the property features to check if one region is inthe background or the salient object.

Therefore, we extract the pseudo-background re-gion and compute the backgroundness descriptor foreach region with the pseudo-background region as areference. The pseudo-background region B is definedas the 15-pixel wide narrow border region of theimage. To verify such a definition, we made a simplesurvey on the MSRA-B data set with 5000 images andfound that 98% of pixels in the border area belongs tothe background. The backgroundness value xbk of theregion Ri on the k-th feature is then defined as

xbk(Ri) = Dk(vRi ,vB). (2)

We get a 29-dimensional feature vector. See details inFig. 2.

4.3 Regional property descriptorAdditionally, we consider the generic properties of aregion, including appearance and geometric features.These two features are extracted independently fromeach region like the feature extraction algorithm inimage labeling [73]. The appearance features attemptto describe the distribution of colors and texturesin a region, which can characterize their commonproperties for the salient object and the background.For example, the background usually has homoge-neous color distribution or similar texture pattern. Thegeometric features include the size and position ofa region that may be useful to describe the spatialdistribution of the salient object and the background.For example, the salient object tends to be placed nearthe center of the image while the background usuallyscatters over the entire image. Finally, we obtain a 35-dimensional regional property descriptor. The detailsare given in Fig. 3.

In summary, we obtain a 93-dimensional (2× 29 +35) feature vector for each region. Fig. 4 demonstrates

5

description notation dimaverage normalized x coordinates p1 1average normalized y coordinates p2 110th percentile of the normalized x coord. p3 110th percentile of the normalized y coord. p4 190th percentile of the normalized x coord. p5 190th percentile of the normalized y coord. p6 1normalized perimeter p7 1aspect ratio of the bounding box p8 1variances of the RGB values p9 ∼ p11 3variances of the L*a*b* values p12 ∼ p14 3variances of the HSV values p15 ∼ p17 3variance of the response of the LM filters p18 ∼ p32 15variance of the LBP feature p33 1normalized area p34 1normalized area of the neighbor regions p35 1

Fig. 3. The regional property descriptor. (The abbrevi-ation coord. indicates coordinates.)

visualizations of the most important features for eachkind of regional feature descriptor.

5 LEARNING

In this section, we introduce how to learn a RandomForest to map the feature vector of each region toa saliency score. Learning the multi-level saliencyfusion weight is also presented.

5.1 Generating training samplesWe use supervised multi-level segmentation to gener-ate training samples. We first learn the similarity scoreof each adjacent regions, to show the probability thatthe adjacent regions both belong to the salient regionor the background. Similar regions will be groupedtogether in a hierarchical way. Training samples of thesaliency regressor are those confident regions in thegrouping hierarchy.

By learning the similarity score, we hope that thoseregions from the object (or background) are morelikely to be grouped together. In specific, given anover-segmentation of an image, we connect each re-gion and its spatially-neighboring regions forming aset of pairs P = {(Ri, Rj)} and learn the probabilityp(ai = aj), where ai is the saliency label of the regionRi. Such a set of pairs into two parts: a positivepart P+ = {(Ri, Rj)|ai = aj} and a negative partP− = {(Ri, Rj)|ai 6= aj}. Following [74], each regionpair is described by a set of features including theregional saliency of two regions (2×93 features), thefeature contrast of two regions (similar to the regionalcontrast descriptor, 29 features), and the geometryfeatures of the superpixel boundary of two regions(similar to p1 ∼ p7 in Fig. 3, 7 features). Given these222-dimensional feature, we learn a boosted decisiontree classifier to estimate the similarity score of eachadjacent region pair.

Based on the learned similarity of two adja-cent regions, we produce multi-level segmentation

Fig. 4. Illustration of the most important features.From top to bottom: input images, the most importantcontrast feature (c12), the most important background-ness feature (b12), the most important property feature(p5), and the saliency map of our approach (DRFIs)produced on a single-level segmentation. Brighter areaindicates larger feature value (thus larger saliencyvalue according to c12, b12 and the saliency map).

{St1,St2, . . . ,StM} to gather a large mount of train-ing samples. Specifically, denote St0 as the over-segmentation of the image generated using the graph-based image segmentation algorithm [70]. The regionsin St0 are represented by a weighted graph, which con-nects the spatially neighboring regions. The weightof each edge is the learned similarity of two adjacentsuperpixels. Similar to the pixel-wise grouping in [70],pairs of regions are sequentially merged in the orderof decreasing the weights of edges. We change thetolerance degree of small regions, i.e., the parameter kof the approach [70] (see the details in [70]) to generatethe segmentations from St1 to StM}. To avoid too finegroupings, we discard Sti if |S

t0||Sti |

> 0.6, where | · |denotes the number of superpixels.

Given a set of training images with groundtruth annotations and their multi-level segmenta-tion, we can collect lots of confident regions R ={R(1), R(2), · · · , R(Q)} and the corresponding saliencyscores A = {a(1), a(2), · · · , a(Q)} to learn a RandomForest saliency regressor. Only confident regions arekept for training since some regions may containpixels from both the salient object and background.A region is considered to be confident if the numberof pixels belonging to the salient object or the back-

6

ground exceeds 80% of the total number of pixelsin the region. Its saliency score is set as 1 or 0 ac-cordingly. In experiments we find that few regions ofall the training examples, around 6%, are unconfidentand we discard them from training.

One benefit to generate multi-level segmentation isthat a large amount of training samples can be gath-ered. In the Sec. 6.3, we empirically analyze differentsettings of Mt and validate our motivation to generatetraining samples based on multi-level image segmen-tation. Additionally, with the guiding of learned sim-ilarity metric, there might be only a few large regionsleft in the high-level segmentation, which are help-ful for the Random Forest regressor to learn object-level properties. However, the learned similarity ishard to generalize across datasets. This is why ourapproach did not perform the best on SED2 datasetin our pervious version [1]. To this end, we adopt theunsupervised multi-level segmentation in the testingphrase, which is also more efficient without learningthe similarity score.

5.2 Learning the regional saliency regressor

Our aim is to learn the regional saliency estimatorfrom a set of training examples. As aforementioned,each region is described by a feature vector x ∈ Rd,composed of the regional contrast, regional property,and regional backgroundness descriptors (i.e., d = 93).From the training data X = {x(1),x(2), · · · ,x(Q)} andthe saliency scores A = {a(1), a(1), · · · , a(Q)}, we learna random forest regressor fr : Rd → R which mapsthe feature vector of each region to a saliency score.

A Random Forest saliency regressor is an ensembleof T decision trees, where each tree consists of splitand leaf nodes. Each split node stores a feature indexf and a threshold τ . Given a feature vector x, eachsplit node in the tree makes a decision based on thefeature index and threshold pair (f, τ). If x(f) < τ ittraverses to the left child, otherwise to the right child.When reaching a leaf node, its stored prediction valuewill be given. The final prediction of the forest is theaverage of the predictions over all the decision trees.

Training a Random Forest regressor is to inde-pendently build each decision tree. For each tree,the training samples are randomly drawn withreplacement, Xt = {x(t1),x(t2), · · · ,x(tQ)}, At ={a(t1), a(t2), · · · , a(tQ)}, where ti ∈ [1, Q], i ∈ [1, Q].Constructing a tree is to find the pair (f, τ) for eachsplit node and the prediction value for each leafnode. Starting from the root node, m features Fmare randomly chosen without replacement from the fullfeature vector. The best split will be found among

these features Fm to maximize the splitting criterion

(f∗, τ∗) = maxf∈Fm,τ

(∑ti∈Dl

(a(ti)

)2|Dl|

+

∑ti∈Dr

(a(ti)

)2|Dr|

−∑ti∈D

(a(ti)

)2|D|

), (3)

where Dl = {(x(ti), a(ti))|x(ti)(f) < τ},Dr = {(x(ti), a(ti))|x(ti)(f) ≥ τ}, and D = Dl ∪ Dr.Such a spliting procedure is repeated until |D| < 5and a leaf node is created. The prediction value of theleaf node is the average saliency scores of the trainingsamples falling in it. We will empirically examine thesettings of parameters T and m in Sec. 6.3.

Learning a saliency regressor can automatically in-tegrate the features and discover the most discrimi-native ones. Additionally, in the training procedureof the random forest, the feature importance can beestimated simultaneously. Refer to the supplementaryfor more details. Fig. 6 presents the most important 60features.

5.3 Learning the multi-level saliency fusorGiven the multi-level saliency maps{A1,A2, · · · ,AM} for an image, our aim is tolearn a combinator g(A1,A2, · · · ,AM ) to fuse themtogether to form the final saliency map A. Sucha problem has been already addressed in existingmethods, such as the conditional random fieldsolution [19]. In our implementation, we find thata linear combinator, A =

∑Mm=1 wmAm, performs

well by learning the weights using a least squareestimator, i.e., minimizing the sum of the losses(‖A−

∑Mm=1 wmAm‖2F ) over all the training images.

In practice, we found that the average of multi-level saliency maps performs as nearly well as aweighted average.

6 EXPERIMENTS

In this section, we empirically analyze our proposedapproach. Comparisons with state-of-the-art methodson benchmark data sets are also demonstrated.

6.1 Data setsWe evaluate the performance over five data sets thatare widely used in salient object detection and seg-mentation.

MSRA-B.1 This data set [19] includes 5000 images,originally containing labeled rectangles from nineusers drawing a bounding box around what theyconsider the most salient object. There is a large varia-tion among images including natural scenes, animals,indoor, outdoor, etc. We manually segment the salient

1. http://research.microsoft.com/en-us/um/people/jiansun/

http://research.microsoft.com/en-us/um/people/jiansun/

7

object (contour) within the user-drawn rectangle toobtain binary masks. The ASD data set [13] is a subset(with binary masks) of MSRA-B, and thus we nolonger make evaluations on it.iCoSeg.2 This is a publicly available co-segmentationdata set [75], including 38 groups of totally 643 im-ages. Each image is along with a pixel-wise ground-truth annotation, which may contain one or multiplesalient objects. In this paper, we use it to evaluate theperformance of salient object detection.SED.3 This data set [76] contains two subsets: SED1that has 100 images containing only one salient objectand SED2 that has 100 images containing exactly twosalient objects. Pixel-wise groundtruth annotations forthe salient objects in both SED1 and SED2 are pro-vided. We only make evaluations on SED2. Similarto the larger MSRA-B dataset, only single one salientobject exists in each image in SED1, where state-of-the-art performance was reported in our previous ver-sion [1]. Additionally, evaluations on SED2 may helpus check the adaptability of salient object detectionalgorithms on multiple-object cases.ECSSD.4 To overcome the weakness of existing dataset such as ASD, in which background structuresare primarily simple and smooth, a new data setdenoted as Extended Complex Scene Saliency Dataset(ECSSD) is proposed recently in [77]. It contains 1000images with diversified patterns in both foregroundand background, where many semantically meaning-ful but structurally complex images are available.Binary masks for salient objects are produced by 5subjects.DUT-OMRON.5 Similarly, this dataset is also intro-duced to evaluate salient object detection algorithmson images with more than a single salient object andrelatively complex background. It contains 5,168 highquality natural images, where each image is resizedto have a maximum side length of 400 pixels. Anno-tations are available in forms of both bounding boxesand pixel-wise binary object masks. Furthermore, eyefixation annotations are also provided makeing thisdataset suitable for simultaneously evaluating salientobject localization and detection models as well asfixation prediction models.

We randomly sample 3000 images from the MSRA-B data set to train our model. Five-fold cross valida-tion is run to select the parameters. The remaining2000 images are used for testing. Rather than traininga model for each data set, we use the model trainedfrom the MSRA-B data set and test it over others. Thiscan help test the adaptability to other different datasets of the model trained from one data set and avoidthe model overfitted to a specific one.

2. http://chenlab.ece.cornell.edu/projects/touch-coseg3. http://www.wisdom.weizmann.ac.il/∼vision/Seg Evaluation DB/4. http://www.cse.cuhk.edu.hk/leojia/projects/hsaliency5. http://ice.dlut.edu.cn/lu/dut-omron/homepage.htm

6.2 Evaluation MetricsWe evaluate the performance using the measuresused in [23] based on the overlapping area betweengroundtruth annotation and saliency prediction, in-cluding the PR (precision-recall) curve, the ROC (re-ceiver operating characteristic) curve and the AUC(Area Under ROC Curve) score. Precision correspondsto the percentage of salient pixels correctly assigned,and recall is the fraction of detected salient pixelsbelonging to the salient object in the ground truth.

For a grayscale saliency map, whose pixel valuesare in the range [0, 255], we vary the threshold from0 to 255 to obtain a series of salient object segmen-tations. The PR curve is created by computing theprecision and recall values at each threshold. The ROCcurve can also be generated based on true positiverates and false positive rates obtained during thecalculation of the PR curve.

6.3 Parameters AnalysisIn this section, we empirically analyze the perfor-mance of salient object detection against the set-tings of parameters during both training and testingphrases. Since we want to test the cross-data gen-eralization ability of our approach, we run five-foldcross-validation on the training set. Settings of pa-rameters are thus blind to other testing data and faircomparisons with other approaches can be conducted.Average AUC scores resulting from cross-validationunder different parameter setting are plotted in Fig. 5.

Training parameters analysis.. There are three param-eters during training, number of segmentations Mt

to generate training samples, number of trees T andnumber of randomly chosen features m when trainingthe Random Forest regressor.

Larger number of segmentations lead to largeramount of training data. As a classifier usually ben-efits more from greater quantity of training samples,we can observe from Fig. 5(a) that the performancesteadily increase when Mt becomes larger. We finallyset Mt = 48 to generate around 1.7 million samples totrain our regional Random Forest saliency regressor.

As shown in Fig. 5(b), the performance of our ap-proach with more trees in the Random Forest saliencyregressor is higher. The more trees there are, theless variances are among the decision trees, and thusthe better performance can be achieved. Though theperformance keeps increasing as more trees adopted,we choose to set T = 200 trees to train the regressorto balance the efficiency and the effectiveness.

When splitting each node during the constructiona decision tree, only m randomly chosen features canbe observed . Intuitively, on one hand, increasingm will give the node greater chance to select morediscriminative features. On the other hand, however,larger m will bring smaller variances between de-cision trees. For instance, suppose m is set to the

http://chenlab.ece.cornell.edu/projects/touch-coseg

http://www.wisdom.weizmann.ac.il/~vision/Seg_Evaluation_DB/

http://www.cse.cuhk.edu.hk/leojia/projects/hsaliency

http://ice.dlut.edu.cn/lu/DUT-OMRON/Homepage.htm

8

M=1 M=10 M=20 M=30 M=40 M=480.96

0.961

0.962

0.963

0.964

0.965

0.966

0.967

0.968

0.969

0.97

AU

C S

core

s

T=5 T=10 T=50 T=100 T=150 T=200 T=3000.95

0.952

0.954

0.956

0.958

0.96

0.962

0.964

0.966

0.968

0.97

AU

C S

core

s

m=3 m=5 m=10 m=15 m=200.95

0.952

0.954

0.956

0.958

0.96

0.962

0.964

0.966

0.968

0.97

AU

C S

core

s

Mt=1 Mt=2 Mt=5 Mt=10 Mt=15 Mt=200.95

0.952

0.954

0.956

0.958

0.96

0.962

0.964

0.966

0.968

0.97

F−

measure

(a) (b) (c) (d)

Fig. 5. Empirical analysis of parameters in terms of the AUC scores based on five-fold cross-validation of thetraining set. From left to right, (a) AUC scores versus the number of segmentations to generate training samples,(b)(c) AUC scores versus number of decision trees and number of randomly chosen features at each node in theRandom Forest saliency regressor, and (d) number of segmentations in generating saliency maps.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Featu

re Im

port

ance

b12

b4

b8

p5

p3

p12

p6

p1

p26

p2

p17

p28

p10

p35

p4

p9

p34

b11

b1

b2

b29

p11

p25

b21

b6

b28

p29

b23

b5

p8

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Featu

re Im

port

ance

p7

b9

b10

c12

p27

p13

b3

p33

b7

p20

p31

p22

c4

p15

p21

p16

p14

b20

b24

c21

c8

p30

p19

b19

p23

p32

b26

p24

p18

c23

Fig. 6. The most important 60 regional saliency features given by Random Forest regressor, occupying around90% of the energy of total features. There are 5 contrast features, 20 backgroundness features, and 35 propertyfeatures. From left to right: the first and second 30 important features, respectively. See Fig. 2 and Fig. 3 for thedescription of the features.

dimension of the feature vector, i.e., all of the featurescan be seen during the splitting, most likely the samemost discriminative feature will be chosen at eachsplit node. Consequently, nearly identical decisiontrees are built that can not complement each otherwell and thus may result in inferior performance.According to Fig. 5(c), we empirically set m = 15 sincethe performance is the best.Testing parameters analysis. One can see in Fig. 5(d)that the AUC scores of the saliency maps increasewhen more layers of segmentations are adopted. Thereason is that there may exist some confident regionsthat cover the most (even entire) part of an object inmore layers of segmentations. However, a larger num-ber of segmentations introduce more computationalburden. Therefore, to balance the efficiency and theeffectiveness, we set M to 15 segmentations in ourexperiments.

6.4 Feature ImportanceOur approach uses a wide variety of features. Inthis section, we empirically analyze the usefulness ofthese regional saliency features. Fig. 6 shows the rankof the most important 60 regional features producedduring the training of Random Forest regressor, whichoccupy around 90% of the energy of total features.

The feature rank indicates that the property descrip-tor is the most critical one in our feature set (occupies35 out of top 60 features). The reason might be thatsalient objects share some common properties, vali-dating our motivation to exploit the generic regional

properties for salient object detection that are widelystudied in other tasks such as image classification [26].For example, high rank of geometric features p5, p3and p6 might correspond to the compositional bias ofsalient objects. The importance of variance featuresp12 and p26, on the other hand, might be relatedwith the background properties. Among the contrast-based descriptors, regional contrast descriptor is theleast important. Since it might be affected by clutteredscenes and less important compared with the regionalbackgroundness descriptor which is in some sensemore robust. Moreover, we also observe that colorfeatures are much more discriminative than texturefeatures.

To further validate the importance of features acrossdifferent data sets, we train classifiers by removingeach kind of feature descriptor on each benchmarkdata set (testing set of MSRA-B). AUC scores ofsaliency maps are demonstrated in Fig. 7. As canbe seen, removing some feature descriptors does notnecessarily lead to performance decrease. Consistentwith the feature rank given by the Random Fore-stregressor, regional contrast descriptor is the leastimportant one on most of the benchmark data sets asleast performance drop are observed with its removalon most of the data sets. Regional property descriptorstill plays the most important role on MSRA-B, ECSSDand DUT-OMRON. Since there are multiple salientobjects in an image in SED2 and iCoSeg, the commonproperties learned from the training data, where onlya single salient object exists in most of the images,

9

MSRA−B iCoSeg ECSSD DUT−OMRON SED2 DUT−OMRON*0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

No BackgroundnessNo ContrastNo PropertyTop 60Full

Fig. 7. Feature importance across different data sets.For each data set, we report the AUC scores ofsaliency maps by removing each kind of descriptorto see the performance drop. Additionally, we alsodemonstrate the performance exploiting only the top60 features shown in Fig. 6.

may not perform the best. The backgroundness de-scriptor performs well on MSRA-B, SED2, and iCoSeg.However, it plays the least important role on ECSSDdata set. Its removal even leads to better performance,which indicates the pseudo-background assumptionmight not always hold well. Finally, instead of con-sidering all of the 93 features, we also adopt only thetop 60 features for training. Surprisingly, this featurevector performs as well as the entire feature descrip-tor, even slightly better on DUT-OMRON, implyingsome features contribute little.

We visualize the most important features of eachdescriptor in Fig. 4. As we can see, even the mostpowerful backgroundness feature provide far less ac-curate information of salient objects. By integratingall of the weak information, much better saliencymaps can be achieved. Note that we do not adoptthe multi-level fusion enhancement. Another advan-tages of our approach is the automatic fusion offeatures. For example, the rules to employ geometricfeatures are discovered from the training data insteadof heuristically defined as previous approaches [32],which might be poor to generalize.

6.5 Performance ComparisonWe report both quantitative and qualitative compar-isons of our approach with state-of-the-art methods.To save the space, we only consider the top fourmodels ranked in the survey [23]: SVO [51], CA [17],CB [32], and RC [15] and recently-developed methods:SF [21], LRK [78], HS [33], GMR [48], PCA [31],MC [50], DSR [49], RBD [55] that are not coveredin [23]. Note that we compare our approach with theextended version of RC. In total, we make compar-isons with 12 approaches. Additionally, we also report

MSRA-B iCoSeg ECSSD DUT-OMRON SED2 DUT-OMRON*

SVO 0.899 0.861 0.799 0.866 0.834 0.793CA 0.860 0.837 0.738 0.815 0.854 0.760CB 0.930 0.852 0.819 0.831 0.825 0.624RC 0.937 0.880 0.833 0.859 0.840 0.679SF 0.917 0.911 0.777 0.803 0.872 0.715LRK 0.925 0.908 0.810 0.859 0.881 0.758HS 0.930 0.882 0.829 0.860 0.820 0.735GMR 0.942 0.902 0.834 0.853 0.831 0.646PCA 0.938 0.895 0.817 0.887 0.903 0.776MC 0.951 0.898 0.849 0.887 0.863 0.715DSR 0.956 0.921 0.856 0.899 0.895 0.776RBD 0.945 0.941 0.840 0.894 0.873 0.779DRFIs 0.954 0.944 0.858 0.910 0.902 0.804DRFI 0.971 0.968 0.875 0.931 0.933 0.822

Fig. 8. AUC: area under ROC curve (larger is better).The best three results are highlighted with red, green,and blue fonts, respectively.

the performance of our DRFI approach with a singlelayer (DRFIs).

Quantitative comparison.. Quantitative comparisonsare shown in Fig. 8, Fig. 9 and Fig. 10. As can be seen,our approach (DRFI) consistently outperforms otherson all benchmark data sets with large margins interms of AUC scores, PR and ROC curves. In specific,it improves by 1.57%, 2.66%, 2.34%, 3.45% and 3.21%over the best-performing state-of-the-art algorithmaccording to the AUC scores on MSRA-B, iCoSeg,ECSSD, DUT-OMRON, and SED2, respectively.

Our single-level version (DRFIs) performs best oniCoSeg, ECSSD, and DUT-OMRON as well. It im-proves by 0.32%, 0.23%, and 1.22% over the best-performing state-of-the-art method in terms of AUCscores on these three data sets, respectively. It isslightly worse (but still one of the top 3 best models)on MSRA-B and SED2 data set. Such improvementis substantial by considering the already high per-formance of state-of-the-art algorithms. More impor-tantly, though the Random Forest regressor is trainedon MSRA-B, it performs best on other challengingdata sets like ECSSD and DUT-OMRON.

With the multi-level enhancement, performance ofour approach can be further improved. For instance,it improves by 1.22% on MSRA-B and 1.78% on DUT-OMRON.

Qualitative comparison. We also provide the qualita-tive comparisons of different methods in Fig. 11. Ascan be seen, our approach (shown in Fig. 11 (n)(o))can deal well with the challenging cases where thebackground is cluttered. For example, in the firsttwo rows, other approaches may be distracted by thetextures on the background while our method almostsuccessfully highlights the entire salient object. It isalso worth pointing out that our approach performs

10

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cis

ion

MSRA

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cis

ion

iCoSeg

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cis

ion

ECSSD

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cis

ion

DUT−OMRON

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cis

ion

SED2

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

Recall

Pre

cis

ion

DUT−OMRON*

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

Fig. 9. Quantitative comparisons of saliency maps produced by different approaches on different data sets interms of PR curves. See supplemental materials for more evaluations.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositiv

e R

ate

MSRA

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositiv

e R

ate

iCoSeg

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositiv

e R

ate

ECSSD

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositiv

e R

ate

DUT−OMRON

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositiv

e R

ate

SED2

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositiv

e R

ate

DUT−OMRON*

SVO

CA

CB

RC

SF

LRK

HS

GMR

PCA

MC

DSR

RBD

DRFIs

DRFI

Fig. 10. Quantitative comparison of saliency maps produced by different approaches on different data sets interms of ROC curves. See supplemental materials for more evaluations.

well when the object touches the image border, e.g.,the first and last third rows in Fig. 11, even thoughit violates the pseudo-background assumption. Withthe multi-level enhancement, more appealing resultscan be achieved.

6.6 Robustness Analysis

As suggested by Fig. 6 and Fig. 7, the region back-groundness and property descriptors, especially thegeometric properties, play important roles in ourapproach. In natural images, the pseudo-backgroundassumption may not be held well. Additionally, thedistributions of salient objects may be different fromour training set. It is natural to doubt that whether

11

(a) input (b) SVO (c) CA (d) CB (e) RC (f) SF (g) LRK (h) HS (i) GMR (j) PCA (k) MC (l) DSR (m) RBD (n) DRFIs (o) DRFI

Fig. 11. Visual comparison of the saliency maps. Our method (DRFI) consistently generates better saliencymaps.

our approach can still perform well on these chal-lenging cases. To this end, we select 635 images fromDUT-OMRON dataset (we call it DUT-OMRON*),where salient objects touch the image border andare far from the image center. Quantitative compar-isons with state-of-the-art approaches are presentedin Fig. 8, Fig. 9, and Fig. 10. Check our project websitejianghz.com/drfi and the supplementary material formore details.

Not surprisingly, performances of all approachesdecline. But our approach DRFI still significantly out-performs other methods in terms of PR curve, ROCcurve and AUC scores. Even with a single level,our approach DRFIs performs slightly better thanothers (ranked as the second best in terms of AUCscores). In specific, DRFIs and DRFI are better than thetop-performing method by around 2.22% and 1.36%(compared with 2.20% and 1.26% on the whole DUT-

OMRON data set) according to the AUC scores.

6.7 EfficiencySince the computation on each level of the multiplesegmentations is independent, we utilize the multi-thread technique to accelerate our C++ code. Fig. 12summarizes the running time of different approaches,tested on the MSRA-B data set with a typical 400×300image using a PC with an Intel i5 CPU of 2.50GHz and8GB memory. 8 threads are utilized for acceleration.As we can see, our approach can run as fast as mostexisting approaches. If equipped as a pre-processingstep for an application, e.g., picture collage, our ap-proach will not harm the user experiences.

For training, it takes around 24h with around 1.7million training samples. As training each decisiontree is also independent to each other, parallel com-puting techniques can also be utilized for acceleration.

http://jianghz.com/drfi

12

Method SVO CA CB RC SF LRK HSTime(s) 56.5 52.3 1.40 0.138 0.210 11.5 0.365

Code M+C M+C M+C C C M+C EXE

Method GMR PCA MC DSR RBD DRFIs DRFI*Time(s) 1.16 2.07 0.129 4.19 0.267 0.183 0.418

Code M+C M+C M+C M+C M C C

Fig. 12. Comparison of running time. M indicates thecode is written in MATLAB and EXE is corresponding tothe executable. (*8 threads are used for acceleration.)

7 DISCUSSIONS AND FUTURE WORK

7.1 Unsupervised vs Supervised

As data-driven approaches, especially supervisedlearning methods, dominate other fields of computervision, it is somewhat surprising that the potential ofsupervised salient object detection is relatively underexploited. The main research efforts of salient ob-ject detection still concentrate on developing heuris-tic rules to combine saliency features. Note we arenot saying that heuristic models are useless. We in-stead favor the supervised approaches for followingtwo advantages. On one hand, supervised learningapproaches can automatically fuse different kinds ofsaliency features, which is valuable especially whenfacing high-dimensional feature vectors. It is nearlyinfeasible for humans, even domain experts, to designrules to integrate the 93 dimensional feature vector ofthis paper. For example, the sixth important featurep12 (variance of the L* values of a region) seemsto be rather obscure for salient object detection. Yetintegration rules discovered from training samplesindicate it is highly discriminative, more than thetraditional regional contrast features.

On the other hand, data-driven approaches alwaysown much better generalization ability than heuristicmethods. In a recent survey of salient object detec-tion [23], none of existing unsupervised algorithmscan consistently outperforms others on all benchmarkdata sets since different pre-defined heuristic rulesfavor different settings of the data set (e.g., the num-ber of objects, center bias, etc). Leveraging the largeamount of training samples (nearly two milliion), ourlearned regional saliency regressor almost performsthe best over all six benchmark data sets. Even thoughit is trained on a single data set, it performs better thanothers on challenging cases which are significantlydifferent from the training set.

One potential reason that learning-based ap-proaches is not favored for salient object detectionmight be related to the efficiency. Though a lot oftraining time is required to train a classifier, testingtime is more likely to be the major concern of a systemrather than the offline training time. Once trained, theclassifier can be used as an off-shelf tool. In this paper,we demonstrate that a learning-based approach canrun as fast as some heuristic methods.

Fig. 13. Failure cases of our approach.

7.2 Limitations of Our Approach

Since our approach mainly consider the regionalcontrast and backgroudness features, it may fail oncluttered scenes. See Fig. 13 for illustration. In thefirst column, high saliency values are assigned to thetexture ares in the background as they are distinct interms of either contrast or background features. Thesalient object in the second column have similar colorwith the background and occupies a large portion ofthe image, making it challenging to generate gooddetection result. For the third column, it is not fairto say that our approach completely fails. The flagis indeed salient. However, as the statue violates thepseudo-background assumption and occupying largeportions as well, it is difficult to generate an appealingsaliency map using our approach.

7.3 Conclusion and Future Work

In this paper, we address the salient object detectionproblem using a discriminative regional feature inte-gration approach. The success of our approach stemsfrom the utilization of supervised learning algorithm:we learn a Random Forest regressor to automaticallyintegrate a high-dimensional regional feature descrip-tor to predict the saliency score and automatically dis-cover the most discriminative features. Experimentalresults validate that compared with traditional ap-proaches, which heuristically compute saliency mapsfrom different types of features

Our approach is closely related to the image la-beling method [26]. The goal is to assign prede-fined labels (geometric category in [26] and object orbackground in salient object detection) to the pixels.It needs further study to investigate the connectionbetween image labeling and salient object, and ifthe two problems are essentially equivalent. Utilizingthe data-driven image labeling approaches for salientobject detection is also worth exploring in the future.

Additionally, there exist some obvious directions tofurther improve our approach.• Incorporating more saliency features. In this pa-

per, we consider only contrast, backgroundness,and generic property features of a region. By con-sidering more saliency features, better detection

13

results can be expected. For example, backgroundconnectivity prior [55] can be incorporated torelax the pseudo-background assumption. Addi-tionally, spatial distribution prior [19], [21], focus-ness prior [52], diverse density score [53] basedon generic objectness, and graph-based manifoldranking score [48] can also be integrated.

• Better fusion strategy. We simply investigate thelinear fusion of saliency maps with any post op-timization step. As a future work, we can utilizethe optimization step of other approaches to en-hance the performance. For example, we can runsaliency detection on hierarchical detections andfuse them as suggested in [77]. The optimizationmethod proposed in [55] is also applicable.

• Integrating more cues. A recent trend on salientobject detection is to integrate more cues in ad-dition to traditional RGB data. Our approach isnatural to be extedned to consider cues such asdepth on RGB-D input, temporal consistency onvideo sequences, and saliency co-occurrence forco-salient object detection.

ACKNOWLEDGEMENTS

This work was supported in part by the NationalBasic Research Program of China under Grant No.2015CB351703 and 2012CB316400, and the NationalNatural Science Foundation of China under Grant No.91120006.

REFERENCES

[1] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salientobject detection: A discriminative regional feature integrationapproach,” in IEEE CVPR, 2013, pp. 2083–2090.

[2] L. Itti, C. Koch, and E. Niebur, “A model of saliency-basedvisual attention for rapid scene analysis,” IEEE TPAMI, 1998.

[3] Y.-F. Ma and H.-J. Zhang, “Contrast-based image attentionanalysis by using fuzzy growing,” in ACM Multimedia, 2003.

[4] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, “Learn-ing to detect a salient object,” CVPR, vol. 0, pp. 1–8, 2007.

[5] C. Kanan and G. W. Cottrell, “Robust classification of objects,faces, and flowers using natural image statistics,” in CVPR,2010, pp. 2472–2479.

[6] D. Walther and C. Koch, “Modeling attention to salient proto-objects,” Neural Networks, vol. 19, no. 9, pp. 1395–1407, 2006.

[7] L. Itti, “Automatic foveation for video compression using aneurobiological model of visual attention,” IEEE TIP, 2004.

[8] L. Marchesotti, C. Cifarelli, and G. Csurka, “A framework forvisual saliency detection with applications to image thumb-nailing,” in ICCV, 2009, pp. 2232–2239.

[9] S. Goferman, A. Tal, and L. Zelnik-Manor, “Puzzle-like col-lage,” Comput. Graph. Forum, vol. 29, no. 2, pp. 459–468, 2010.

[10] J. Wang, L. Quan, J. Sun, X. Tang, and H.-Y. Shum, “Picturecollage,” in CVPR (1), 2006, pp. 347–354.

[11] P. Wang, D. Zhang, J. Wang, Z. Wu, X.-S. Hua, and S. Li, “Colorfilter for image search,” in ACM Multimedia, 2012.

[12] P. Wang, D. Zhang, G. Zeng, and J. Wang, “Contextual dom-inant color name extraction for web image search,” in ICMEWorkshops, 2012, pp. 319–324.

[13] R. Achanta, S. S. Hemami, F. J. Estrada, and S. Susstrunk,“Frequency-tuned salient region detection,” in CVPR, 2009.

[14] A. Borji and L. Itti, “Exploiting local and global patch raritiesfor saliency detection,” in CVPR, 2012, pp. 478–485.

[15] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEETPAMI, 2014.

[16] D. Gao, V. Mahadevan, and N. Vasconcelos, “The discriminantcenter-surround hypothesis for bottom-up saliency,” in NIPS,2007.

[17] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-awaresaliency detection,” in CVPR, 2010, pp. 2376–2383.

[18] D. A. Klein and S. Frintrop, “Center-surround divergence offeature statistics for salient object detection,” in ICCV, 2011.

[19] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” IEEE TPAMI,vol. 33, no. 2, pp. 353–367, 2011.

[20] Y. Lu, W. Zhang, H. Lu, and X. Xue, “Salient object detectionusing concavity context,” in ICCV, 2011, pp. 233–240.

[21] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliencyfilters: Contrast based filtering for salient region detection,” inCVPR, 2012, pp. 733–740.

[22] A. Treisman and G. Gelad, “A feature-integration theory ofattention,” Cognitive Psychology, vol. 12, no. 1, pp. 97–136, 1980.

[23] A. Borji, D. N. Sihite, and L. Itti, “Salient object detection: Abenchmark,” in ECCV (2), 2012, pp. 414–429.

[24] P. Khuwuthyakorn, A. Robles-Kelly, and J. Zhou, “Object ofinterest detection by saliency learning,” in ECCV, 2010.

[25] P. Mehrani and O. Veksler, “Saliency segmentation based onlearning and graph cut refinement,” in BMVC, 2010.

[26] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surfacelayout from an image,” IJCV, vol. 75, no. 1, pp. 151–172, 2007.

[27] A. Borji and L. Itti, “State-of-the-art in visual attention mod-eling,” IEEE Trans. Pattern Anal. Mach. Intell., 2013.

[28] F. Liu and M. Gleicher, “Region enhanced scale-invariantsaliency detection,” in ICME, 2006, pp. 1477–1480.

[29] D. Gao and N. Vasconcelos, “Bottom-up saliency is a discrim-inant process,” in ICCV, 2007, pp. 1–6.

[30] X. Li, Y. Li, C. Shen, A. R. Dick, and A. van den Hengel, “Con-textual hypergraph modeling for salient object detection,” inICCV, 2013, pp. 3328–3335.

[31] R. Margolin, A. Tal, and L. Zelnik-Manor, “What makes apatch distinct?” in CVPR, 2013.

[32] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li,“Automatic salient object segmentation based on context andshape prior,” in BMVC, 2011.

[33] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detec-tion,” in CVPR, 2013.

[34] C. Scharfenberger, A. Wong, K. Fergani, J. S. Zelek, and D. A.Clausi, “Statistical textural distinctiveness for salient regiondetection in natural images,” in CVPR, 2013, pp. 979–986.

[35] K. Shi, K. Wang, J. Lu, and L. Lin, “Pisa: Pixelwise imagesaliency by aggregating complementary appearance contrastmeasures with spatial priors,” in CVPR, 2013, pp. 2115–2122.

[36] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, andN. Crook, “Efficient salient region detection with soft imageabstraction,” in ICCV, 2013, pp. 1529–1536.

[37] X. Shen and Y. Wu, “A unified approach to salient objectdetection via low rank matrix recovery,” in CVPR, 2012.

[38] W. Zou, K. Kpalma, Z. Liu, J. Ronsin et al., “Segmentationdriven low-rank matrix recovery for saliency detection,” inBMVC, 2013, pp. 1–13.

[39] H. Peng, B. Li, R. Ji, W. Hu, W. Xiong, and C. Lang, “Salientobject detection via low-rank and structured sparse matrixdecomposition,” in AAAI, 2013.

[40] Z. Jiang and L. S. Davis, “Submodular salient region detec-tion,” in CVPR, 2013, pp. 2043–2050.

[41] E. Rahtu, J. Kannala, M. Salo, and J. Heikkila, “Segmentingsalient objects from images and videos,” in ECCV (5), 2010,pp. 366–379.

[42] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low andmid level cues,” IEEE TIP, vol. 22, no. 5, pp. 1689–1698, 2013.

[43] R. Liu, J. Cao, G. Zhong, Z. Lin, S. Shan, and Z. Su, “Adap-tive partial differential equation learning for visual saliencydetection,” in CVPR, 2014.

[44] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, “Salientobject detection for searched web images via global saliency,”in CVPR, 2012, pp. 3194–3201.

[45] S. Vicente, V. Kolmogorov, and C. Rother, “Graph cut basedimage segmentation with connectivity priors,” in CVPR, 2008.

14

[46] L. Wang, J. Xue, N. Zheng, and G. Hua, “Automatic salientobject extraction with contextual cue,” in ICCV, 2011.

[47] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency usingbackground priors,” in ECCV (3), 2012, pp. 29–42.

[48] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliencydetection via graph-based manifold ranking,” in CVPR, 2013.

[49] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliencydetection via dense and sparse reconstruction,” in ICCV, 2013.

[50] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang, “Saliencydetection via absorbing markov chain,” in ICCV, 2013.

[51] K.-Y. Chang, T.-L. Liu, H.-T. Chen, and S.-H. Lai, “Fusinggeneric objectness and visual saliency for salient object de-tection,” in ICCV, 2011, pp. 914–921.

[52] P. Jiang, H. Ling, J. Yu, and J. Peng, “Salient region detectionby ufo: Uniqueness, focusness and objectness,” in ICCV, 2013.

[53] Y. Jia and M. Han, “Category-independent object-level saliencydetection,” in ICCV, 2013.

[54] J. Zhang and S. Sclaroff, “Saliency detection: A boolean mapapproach,” in ICCV, 2013, pp. 153–160.

[55] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimizationfrom robust background detection,” in CVPR, 2014.

[56] M. Wang, J. Konrad, P. Ishwar, K. Jing, and H. A. Rowley,“Image saliency: From intrinsic to extrinsic context,” in CVPR,2011, pp. 417–424.

[57] Y. Niu, Y. Geng, X. Li, and F. Liu, “Leveraging stereopsis forsaliency analysis,” in CVPR, 2012, pp. 454–461.

[58] K. Desingh, K. M. Krishna, D. Rajan, and C. Jawahar, “Depthreally matters: Improving visual salient region detection withdepth,” in BMVC, 2013.

[59] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu, “Saliency detection onlight fields,” in CVPR, 2014.

[60] S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimalseeds for diffusion-based salient object detection,” in CVPR,2014.

[61] J. Kim, D. Han, Y.-W. Tai, and J. Kim, “Salient region detectionvia high-dimensional color transform,” in CVPR, 2014.

[62] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “Thesecrets of salient object segmentation,” in CVPR, 2014.

[63] J. Carreira and C. Sminchisescu, “Constrained parametric min-cuts for automatic object segmentation,” in CVPR, 2010, pp.3241–3248.

[64] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, “Salientobject detection for searched web images via global saliency,”in CVPR, 2012, pp. 3194–3201.

[65] F. Moosmann, D. Larlus, and F. Jurie, “Learning saliency mapsfor object categorization,” in EECVW, 2006.

[66] T. Judd, K. A. Ehinger, F. Durand, and A. Torralba, “Learningto predict where humans look,” in ICCV, 2009, pp. 2106–2113.

[67] Y. Lu, W. Zhang, C. Jin, and X. Xue, “Learning attention mapfrom images,” in CVPR, 2012, pp. 1067–1074.

[68] E. P. Simoncelli and W. T. Freeman, “The steerable pyramid:a flexible architecture for multi-scale derivative computation,”in ICIP (3), 1995, pp. 444–447.

[69] B. Fernando, E. Fromont, D. Muselet, and M. Sebban, “Dis-criminative feature fusion for image classification,” in CVPR,2012, pp. 3434–3441.

[70] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” IJCV, vol. 59, no. 2, 2004.

[71] M. Heikkila, M. Pietikainen, and C. Schmid, “Description ofinterest regions with local binary patterns,” Pattern Recognition,vol. 42, no. 3, pp. 425–436, 2009.

[72] T. K. Leung and J. Malik, “Representing and recognizingthe visual appearance of materials using three-dimensionaltextons,” IJCV, vol. 43, no. 1, pp. 29–44, 2001.

[73] D. Hoiem, A. A. Efros, and M. Hebert, “Geometric contextfrom a single image,” in ICCV, 2005, pp. 654–661.

[74] H. Jiang, Y. Wu, and Z. Yuan, “Probabilistic salient objectcontour detection based on superpixels,” in ICIP, 2013, pp.3069–3072.

[75] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “iCoseg:Interactive co-segmentation with intelligent scribble guid-ance.” in IEEE CVPR, 2010, pp. 3169–3176.

[76] S. Alpert, M. Galun, R. Basri, and A. Brandt, “Image seg-mentation by probabilistic bottom-up aggregation and cueintegration,” in CVPR, 2007.

[77] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detec-tion,” in CVPR. CVPR, 2013, pp. 1155–1162.

[78] X. Shen and Y. Wu, “A unified approach to salient objectdetection via low rank matrix recovery,” in CVPR, 2012.

salient object detection: a discriminative regional feature integration approach · 2014-10-23 ·...

Documents