learning spatial context: 1 using stuff to find things...

13
Learning Spatial Context: 1 1 Using Stuff to Find Things 2 2 Anonymous ECCV submission 3 3 Paper ID 892 4 4 Abstract. The sliding window approach of detecting rigid objects (such 5 5 as cars) is predicated on the belief that the object can be identified 6 6 from the appearance in a defined region around the object. Other types 7 7 of objects of amorphous spatial extent (e.g., trees, sky), however, are 8 8 more naturally classified based on texture or color. In this paper, we 9 9 seek to combine recognition of these two types of objects into a system 10 10 that leverages “context” toward improving detection. In particular, we 11 11 cluster regions of the image based on their ability to serve as context 12 12 for the detection of objects. Rather than providing an explicit training 13 13 set with region labels, our method automatically groups regions based 14 14 on both their appearance and their relationships to the detections in the 15 15 image. We show that our things and stuff (TAS) context model produces 16 16 meaningful clusters that are readily interpretable, and helps improve our 17 17 detection ability over state-of-the-art detectors. We present results on 18 18 object detection in images from the PASCAL VOC 2005/2006 datasets 19 19 and on the task of overhead car detection in satellite images. 20 20 1 Introduction 21 21 Recognizing objects in an image requires combining many different signals from 22 22 the raw image data. Figure 1 shows an example satellite image of a street scene, 23 23 where we may want to identify all of the cars. From a human perspective, there 24 24 are two primary signals that we leverage. The first is the local appearance in the 25 25 window near the potential car. The second is our knowledge that cars appear 26 26 on roads. This second signal is a form of contextual knowledge, and our goal in 27 27 this paper is to capture this idea in a rigorous probabilistic model. 28 28 Recent papers have demonstrated that boosted object detectors can be effec- 29 29 tive at detecting monolithic object classes, such as cars [1] or faces [2]. Similarly, 30 30 several works on multiclass segmentation have shown that regions in images 31 31 can effectively be classified based on their color or texture properties [3]. These 32 32 two lines of work have made significant progress on the problems of identifying 33 33 “things” and “stuff,” respectively. The important differentiation between these 34 34 two classes of visual objects is summarized in Forsyth et al. [4] as: 35 35 The distinction between materials — “stuff” — and objects — “things” 36 36 — is particularly important. A material is defined by a homogeneous or 37 37 repetitive pattern of fine-scale properties, but has no specific or distinc- 38 38 tive spatial extent or shape. An object has a specific size and shape. 39 39

Upload: others

Post on 18-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

Learning Spatial Context:1 1

Using Stuff to Find Things2 2

Anonymous ECCV submission3 3

Paper ID 8924 4

Abstract. The sliding window approach of detecting rigid objects (such5 5

as cars) is predicated on the belief that the object can be identified6 6

from the appearance in a defined region around the object. Other types7 7

of objects of amorphous spatial extent (e.g., trees, sky), however, are8 8

more naturally classified based on texture or color. In this paper, we9 9

seek to combine recognition of these two types of objects into a system10 10

that leverages “context” toward improving detection. In particular, we11 11

cluster regions of the image based on their ability to serve as context12 12

for the detection of objects. Rather than providing an explicit training13 13

set with region labels, our method automatically groups regions based14 14

on both their appearance and their relationships to the detections in the15 15

image. We show that our things and stuff (TAS) context model produces16 16

meaningful clusters that are readily interpretable, and helps improve our17 17

detection ability over state-of-the-art detectors. We present results on18 18

object detection in images from the PASCAL VOC 2005/2006 datasets19 19

and on the task of overhead car detection in satellite images.20 20

1 Introduction21 21

Recognizing objects in an image requires combining many different signals from22 22

the raw image data. Figure 1 shows an example satellite image of a street scene,23 23

where we may want to identify all of the cars. From a human perspective, there24 24

are two primary signals that we leverage. The first is the local appearance in the25 25

window near the potential car. The second is our knowledge that cars appear26 26

on roads. This second signal is a form of contextual knowledge, and our goal in27 27

this paper is to capture this idea in a rigorous probabilistic model.28 28

Recent papers have demonstrated that boosted object detectors can be effec-29 29

tive at detecting monolithic object classes, such as cars [1] or faces [2]. Similarly,30 30

several works on multiclass segmentation have shown that regions in images31 31

can effectively be classified based on their color or texture properties [3]. These32 32

two lines of work have made significant progress on the problems of identifying33 33

“things” and “stuff,” respectively. The important differentiation between these34 34

two classes of visual objects is summarized in Forsyth et al. [4] as:35 35

The distinction between materials — “stuff” — and objects — “things”36 36

— is particularly important. A material is defined by a homogeneous or37 37

repetitive pattern of fine-scale properties, but has no specific or distinc-38 38

tive spatial extent or shape. An object has a specific size and shape.39 39

Page 2: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

2 ECCV-08 submission ID 892

Fig. 1. (Left) An aerial photograph. (Center) Detected cars in the image (solid green= true detections, dashed red = false detections). (Right) Finding “stuff” such asbuildings, by classifying regions, shown delineated by red boundaries.

Recent work has also shown that classifiers targeting particular classes of40 40

things or stuff can benefit from the proper use of contextual cues. The use of41 41

context can be split into a number of categories. Scene-Thing context allows42 42

scene-level information, such as the scale or the “gist” [5], to determine prior43 43

location probabilities for the various objects. Stuff-Stuff context captures the no-44 44

tion that sky occurs above sea and road below building [6]. Thing-Thing context45 45

considers the co-occurrence of objects such as the notion that a tennis racket is46 46

more likely to occur with a tennis ball than with a lemon [7]. Finally Stuff-Thing47 47

context allows the texture regions (e.g., the roads and buildings in Figure 1) to48 48

add predictive power to the detection of objects (e.g., the cars in Figure 1). We49 49

focus on this fourth type of context. Figure 2 shows an example of this context50 50

in the case of satellite imagery.51 51

In this paper, we present a probabilistic model linking the detection of things52 52

to the unsupervised classification of stuff. Our method can be viewed as an at-53 53

tempt to cluster “stuff,” represented by coherent image regions, into clusters54 54

that are both visually similar and best able to provide context for the detectable55 55

“things” in the image. Cluster labels for the image regions are probabilistically56 56

linked to the detection window labels through the use of region-detection “rela-57 57

tionships,” which encode their relative spatial locations. We call this model the58 58

things and stuff (TAS) context model because it links these two components into59 59

a coherent whole. The graphical representation of the TAS model is depicted in60 60

Figure 3. At training time, this model leverages supervised (groundtruth) de-61 61

tection labels to learn parameters using the Expectation-Maximization (EM)62 62

algorithm. For test instances, both the region labels and detection labels are63 63

inferred from the model.64 64

We present results of the TAS method on diverse datasets. Using the Pascal65 65

Visual Object Classes challenge datasets from 2005 and 2006 (VOC2005 and66 66

VOC2006), we utilize one of the top competitors as our baseline detector [10],67 67

and demonstrate that our TAS method improves the performance of detecting68 68

cars, bicycles and motorbikes in street scenes and cows and sheep in rural scenes69 69

(see Figure 4). In addition, we consider a very different dataset of satellite images70 70

from Google Earth, of the city and suburbs of Brussels, Belgium. The task here71 71

is to identify the cars from above; see Figure 2. For clarity, the satellite data is72 72

used as the running example throughout the paper, but all descriptions apply73 73

equally to the other datasets.74 74

Page 3: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

ECCV-08 submission ID 892 3

Fig. 2. Example detections from the satellite dataset that demonstrate context. Clas-sifying using local appearance only, we might think that both windows at left are cars.However, when seen in context, the bottom detection is unlikely to be an actual car.

2 Related Work75 75

The role of context in object recognition has become an important topic, due76 76

both to the psychological basis of context in the human visual system [11] and to77 77

the striking algorithmic improvements that “visual context” has provided [12].78 78

The word “context” has been attached to many different ideas. One of the79 79

simplest forms is co-occurrence context. The work of Rabinovich et al. [7] demon-80 80

strates the use of this context, where the presence of a certain object class in an81 81

image probabilistically influences the presence of a second class. The context of82 82

Torralba et al. [12] assumes that certain objects occur more frequently in cer-83 83

tain rooms, as monitors tend to occur in offices. While these methods achieve84 84

excellent results when many different object classes are labeled per image, they85 85

are unable to leverage unsupervised data for contextual object recognition.86 86

In addition to co-occurrence context, many approaches take into account87 87

the spatial relationships between objects. At the local descriptor level, Wolf et88 88

al. [13] detect objects using a descriptor with a large capture range, allowing the89 89

detection of the object to be influenced by surrounding image features. Because90 90

these methods use only the raw features for context, however, they cannot obtain91 91

a holistic view of an entire scene. Similarly, Fink and Perona [14] use the output92 92

of boosted detectors for other classes as additional features for the detection of93 93

a given class. This allows the inclusion of signal beyond the raw features, but94 94

requires that all “parts” of a scene be supervised. In contrast, Murphy et al. [5]95 95

use a global feature known as the “gist” to learn statistical priors on the locations96 96

of various objects within the context of the specific scene. The gist descriptor97 97

is excellent at predicting the scene type and large structures in the scene, but98 98

cannot handle the local interactions present in the satellite data, for example.99 99

Another approach to modeling spatial relationships is to use a Markov Ran-100 100

dom Field (MRF) or variant (CRF,DRF) [8, 9] to encode the preferences for cer-101 101

tain spatial relationships. These techniques offer a great deal of flexibility in the102 102

formulation of the affinity function and all the standard benefits of a graphical103 103

model formulation (e.g., well-known learning and inference techniques). Singhal104 104

et al. [6] also use similar concepts to the MRF formulation to aggregate de-105 105

cisions across the image. These methods, however, suffer from two drawbacks.106 106

First, they tend to require a large amount of annotation in the training set. Sec-107 107

ond, they put things and stuff on the same footing, representing both as “sites”108 108

Page 4: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

4 ECCV-08 submission ID 892

in the MRF. Our method requires less annotation and allows detections and109 109

image regions to be represented in their (different) natural spaces.110 110

Perhaps the most ambitious attempts to use context involves the attempt to111 111

model the scene of an image holistically. Torralba [1], for instance, uses global112 112

image features to “prime” the detector with the likely presence/absence of ob-113 113

jects, the likely locations, and the likely scales. The work of Hoiem and Efros [15]114 114

takes this one level further by explicitly modeling the 3D layout of the scene.115 115

This allows the natural use of scale and location constraints (e.g., things closer116 116

to the camera are larger). Their approach, however, is tailored to street scenes,117 117

and requires domain-specific modeling. The specific form of their priors would118 118

be useless in the case of satellite images, for example.119 119

3 Things and Stuff (TAS) Context Model120 120

Our probabilistic context model builds on two standard components that are121 121

commonly used in the literature. The first is sliding window object detection,122 122

and the second is unsupervised image region clustering. A common method for123 123

finding “things” is to slide a window over the image, score each window’s match124 124

to the model, and return the highest matching such windows. We denote the125 125

features in the ith candidate window by Wi, the presence/absence of the target126 126

class in that window by Ti (T for “thing”), and assume that what we learn in127 127

our detector is a conditional probability P (Ti | Wi) from the window features128 128

to the probability that the window contains the object; this probability can be129 129

derived from most standard classifiers, such as the highly successful boosting130 130

approaches [1]. The set of windows included in the model can, in principle,131 131

include all windows in the image. We, however, limit ourselves to looking at the132 132

top scoring detection windows according to the detector (i.e., all windows above133 133

some low threshold, chosen in order to capture most of the true positives).134 134

The second component we build on involves clustering coherent regions of the135 135

image into groups based on appearance. We begin by segmenting the image into136 136

regions, known as superpixels, using the normalized cut algorithm of Ren and137 137

Malik [16]. For each region j, we extract a feature vector F j that includes color138 138

and texture features. For our stuff model, we will use a generative model where139 139

each region has a hidden class, and the features are generated from a Gaussian140 140

distribution with parameters depending on the class. Denote by Sj (S for “stuff”)141 141

the (hidden) class label of the jth region. We assume that the features are derived142 142

from a standard naive Bayes model, where we have a probability distribution143 143

P (F j | Sj) of the image features given the class label.144 144

In order to relate the detector for “things” (T ’s) to the clustering model over145 145

“stuff” (S’s), we need to develop an intuitive representation for the relationships146 146

between these units. Human knowledge in this area comes in sentences like “cars147 147

drive on roads,” or “cars park 20 feet from buildings.” We introduce variables148 148

Rij that represent the relationship between candidate detection i and region149 149

j. The values of Rij indicate different relationships, such as: “detection i is in150 150

region j”, or “detection i is about 100 pixels away from region j”.151 151

We can now link our component models into a single coherent probabilistic152 152

things and stuff (TAS) model, as depicted in the plate model of Figure 3(a).153 153

Page 5: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

ECCV-08 submission ID 892 5

RijTi Sj

Fj

ImageWindow

Wi

Wi: Window

Ti: Object Presence

Sj: Region Label

Fj: Region Features

Rij: Relationship

N

J

(a) TAS plate model

T1

S1

S2

S3

S4

S5

T2

T3

R21 = “Above”

R31 = “Left”

R13 = “In”

R33 = “In”

R11 = “Left”

CandidateWindows

ImageRegions

(b) TAS ground network

Fig. 3. The TAS model. The plate representation (a) gives a compact visualization ofthe model, which unrolls into a “ground” network (b) for any particular image.

Probabilistic influence flows between the detection window labels and the image154 154

region labels through the v-structures that are activated by observing the rela-155 155

tionship variables. For a particular input image, this plate model unrolls into a156 156

“ground” network that encodes a distribution over the detections and regions of157 157

the image. Figure 3(b) shows a toy example of a ground network for the image of158 158

Figure 1. It is interesting to note the similarities between TAS and the MRF ap-159 159

proaches in the literature. In effect, the relationship variables link the detections160 160

and the regions into a probabilistic web where signals at one point in the web161 161

(say the strong appearance of a particular region) influence a detection in an-162 162

other part of the image, which in turn influences the label of a different region in163 163

yet another location. In the example of Figure 3(b), if region 3 has strong road164 164

appearance, it might influence T1 through the relationship R13. This in turn165 165

will influence the labeling of region 1 through the relationship R11. Because our166 166

method uses a generative model for the region labels, training can be performed167 167

very effectively, even when the Sj variables are unobserved (see below).168 168

All variables in the TAS model are discrete except for the feature F j vari-ables. This allows for simple table conditional probability distributions (CPDs)for all discrete nodes in this Bayesian network. The probability distribution overthese variables decomposes according to:

P (TSFRW ) =∏

i

P (Wi)P (Ti | Wi)∏

j

P (Sj)P (F j | Sj)∏

ij

P (Rij | Ti, Sj).

169 169One of the main benefits of the TAS approach is its simplicity and modularity.170 170

It allows us to “plug in” any sliding window detector and any generative approach171 171

for region clustering (e.g., [3]).172 172

Page 6: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

6 ECCV-08 submission ID 892

4 Learning and Inference in the TAS Model173 173

Because TAS unrolls into a Bayesian network for each image, we can use standard174 174

learning and inference methods. In particular, we learn the parameters of our175 175

model using the Expectation-Maximization (EM) [17] algorithm and perform176 176

inference using Gibbs sampling, a standard variant of MCMC sampling [18].177 177

Learning the Model with EM. At learning time, we have a set of images178 178

with annotated labels for our target object class(es). We first train the base179 179

detectors using this set, and select as our candidate windows all detections above180 180

a threshold; we thus obtain a set of candidate detection windows W1 . . .WN181 181

along with their groundtruth labels T1 . . . TN . We also have a set of regions and182 182

a feature vector Fj for each, and an Rij relationship variable for every window-183 183

region pair. Our goal is to learn parameters for the TAS model.184 184

One option is to learn in two phases: we first use an existing clustering method185 185

(such as K-means or Mixture-of-Gaussian clustering) to assign the Sj variables186 186

in the training data; we then learn parameters for the TAS model with fully ob-187 187

served data. This approach is attractive as it allows the first part of the learning188 188

to be fully unsupervised, enabling us to use more images than exist in our labeled189 189

dataset. We call the model learned with this method Pre-Clustered TAS, but190 190

while the resulting clusters are likely to be visually coherent, they do not exploit191 191

the spatial relationships between the regions and the object detections.192 192

We therefore propose a joint training method where we perform EM in the193 193

full model; this model is called the Full TAS model. The EM algorithm iterates194 194

between using probabilistic inference to derive a soft completion of the hidden195 195

variables (E-step) and finding maximum-likelihood parameters relative to this196 196

soft completion (M-step). The E-step here is particularly easy: at training time,197 197

only the Sj ’s are unobserved; moreover, because the T variables are observed,198 198

the Sj ’s are conditionally independent of each other. Thus, the inference step199 199

turns into a simple computation for each Sj separately, a process which can be200 200

performed in linear time. The M-step for table CPDs can be performed easily in201 201

closed form. To provide a good starting point for EM, we initialize the cluster202 202

assignments using the K-means algorithm. EM is guaranteed to converge to a203 203

local maximum of the likelihood function of the observed data.204 204

Inference with Gibbs Sampling. At test time, our system must determinewhich windows in a new image contain the target object. We observe the can-didate detection windows (Wi’s, extracted by thresholding the base detectoroutput), the features of each image region (F j ’s), and the relationships (Rij ’s).Our task is to find the probability that each window contains the object:

P (T | F , R, W ) =∑

S

P (T ,S | F , R,W ) (1)

Unfortunately, this expression involves a summation over an exponential set ofvalues for the S vector of variables. We solve the inference problem approxi-mately using a Gibbs sampling [18] MCMC method. We begin with some as-signment to the variables. Then, in each Gibbs iteration we first resample all of

Page 7: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

ECCV-08 submission ID 892 7

the S’s and then resample all the T ’s according to the following two probabilities:

P (Sj | T , F , R, W ) ∝ P (Sj)P (Fj | Sj)∏

i

P (Rij | Ti, Sj) (2)

P (Ti | S, F , R, W ) ∝ P (Ti | Wi)∏

j

P (Rij | Ti, Sj). (3)

These sampling steps can be performed efficiently, as the Ti variables are con-205 205

ditionally independent given the S’s and the Si’s are conditionally independent206 206

given the T ’s. In the last Gibbs iteration for each sample, rather than resampling207 207

T , we compute the posterior probability over T given our current S samples,208 208

and use these distributional particles for our estimate of the probability in (1).209 209

5 Experimental Results210 210

In order to evaluate the ability of the TAS model to learn and utilize context,211 211

we perform experiments on three datasets that differ along several axes. The212 212

first two datasets are from the PASCAL Visual Object Classes challenges 2005213 213

and 2006[19]. The scenes are urban and rural, indoor and outdoor, and there is214 214

a great deal of scale and shape variation amongst the objects. The third is a set215 215

of satellite images acquired from Google Earth. The goal in these images is to216 216

detect cars from above. Because of the impoverished visual information, there217 217

are many false positives when a sliding window detector is applied. In this case,218 218

context provides a filtering mechanism to remove the false positives. Because219 219

these two applications are different, and in order to demonstrate the flexibility220 220

of our approach, we use different detectors for each. In all experiments, we allow221 221

S a cardinality of K = 101, and use 42 features for the image regions that222 222

represent color, texture, and shape [20]. For clusters, we keep a mean and full223 223

covariance matrix over these features. A small regularization (10−6I) is added224 224

to the covariance to ensure positive semi-definiteness.225 225

PASCAL VOC Datasets. For these experiments, we used four classes from226 226

the VOC2005 data, and two classes from the VOC2006 data. The VOC2005227 227

dataset consists of 2232 images, manually annotated with bounding boxes for228 228

four image classes: cars, people, motorbikes, and bicycles. We use the “train+val”229 229

set (684 images) for training, and the “test2” set (859 images) for testing. The230 230

VOC2006 dataset contains 5304 images, manually annotated with 12 classes, of231 231

which we use the cow and sheep classes. We train on the “trainval” set (2618232 232

images) and test on the “test” set (2686 images). To compare with the results233 233

of the challenges, we adopted as our detector the HOG (histogram of oriented234 234

gradients) detector of Dalal and Triggs [10]. This detector uses an SVM and235 235

therefore outputs a score margini ∈ (−∞,+∞), which we convert into a proba-236 236

bility by learning a logistic regression function for P (Ti | margini). We also plot237 237

the precision-recall curve using the code provided in the challenge toolkit.238 238

Because these images are taken parallel to the ground plane, we use relation-239 239

ships that capture axis-aligned interactions at distances relative to size of the240 240

1 Results were robust to a range of K between 5 and 20.

Page 8: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

8 ECCV-08 submission ID 892

(a) (b) (c)

(d) (e) (f)

Fig. 4. (a,b) Example training detections from the bicycle class. The related imageregions are colored by their most likely cluster label. In both examples, the blue regionthat is horizontally offset from the detection belongs to cluster #3. (c) 16 of the topscoring regions for cluster #3, ranked by P (F | S = 3) (likelihood of image features).This cluster corresponds to “roads” or “bushes” as things that are gray/green andoccur next to cars. (d) A case where context helped find a true detection. (e,f) Twoexamples where incorrect detections are filtered out by context.

candidate detected object. We therefore use the following relationships: Rij = 1241 241

(IN) for the region j closest to the center of the detection window; Rij = 2, 3, 4, 5242 242

for the regions that are one bounding box width to the LEFT of, to the RIGHT243 243

of, ABOVE, and BELOW the window; Rij = 6 (NONE) for all other regions.244 244

Figure 4 (top row) shows example bicycle detection candidates, and the re-245 245

lationships they have to the regions in their images. These examples suggest246 246

the type of context that might be learned. For example, the region beside both247 247

detections (colored blue) belongs to cluster #3, which looks visually like a road248 248

or bush cluster. The learned values of the model parameters also indicate that249 249

being to the left or right of this cluster increases the probability of a window250 250

containing a bicycle (e.g., by about 33% in the case where Rij is RIGHT).251 251

We performed a single run of EM learning to convergence, which takes around252 252

2 hours on an Intel Dual Core 1.9 GHz machine with 2 GB of memory. We run253 253

separate experiments for each class, though in principle it would be possible to254 254

learn a single joint model over all classes. By separating the classes, we are able255 255

to isolate the contextual contribution from the stuff, rather than between the dif-256 256

ferent types of things present in the images. For our MCMC inference, we found257 257

that, due to the strength of the baseline detectors, the Markov chain converged258 258

fairly rapidly; we achieved very good results using merely 10 MCMC samples,259 259

Page 9: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

ECCV-08 submission ID 892 9

0.1 0.2 0.3 0.4 0.5

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

TAS ModelBase DetectorsINRIA-Dalal

0.1 0.2 0.3 0.4 0.5 0.6

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

0.1 0.2 0.3 0.4 0.5

0.2

0.4

0.6

0.8

1TAS ModelBase DetectorsINRIA-Douze

Recall Rate

Pre

cisi

on

(a) Cars (2005) (b) Motorbikes (2005) (e) Cows (2006)

0.1 0.2 0.3 0.4 0.5 0.6

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

0.1 0.2 0.3 0.4

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

0.1 0.2 0.3 0.4 0.5

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

(c) People (2005) (d) Bicycles (2005) (f) Sheep (2006)

Fig. 5. Precision-recall curves for the VOC2005 and VOC2006 classes.

where each is initialized randomly and then undergoes 5 Gibbs iterations. The260 260

entire inference process takes about one second per image.261 261

The bottom row of Figure 4 shows some detections that were corrected using262 262

context. We show one example where a true bicycle was discovered using context,263 263

and two examples where false positives were filtered out by our model. These264 264

examples demonstrate the type of information that is being leveraged by TAS.265 265

In the first example, the sky above, and the road beside the window give a signal266 266

that this detection is at ground level, and is therefore likely to be a bicycle.267 267

Figure 5 shows the full recall-precision curve for each class. For (a-d) we268 268

compare to the 2005 INRIA-Dalal challenge entry, and for (e,f) we compare to269 269

the 2006 INRIA-Douze entry, both of which used the HOG detector. We also270 270

show the curve produced by our Base Detector alone2. Finally, we plot the271 271

curves produced by our TAS Model, trained using full EM, which scores win-272 272

dows using the probability of (1). The model trained using the Pre-Clustered273 273

approach performed similarly. From these curves, we see that the TAS model274 274

provided a significant improvement in accuracy for all but the “people” and275 275

“sheep” classes. We believe the lack of improvement for people is due to the276 276

wide variation of backgrounds in these images, including streets, grass, forests,277 277

2 Differences in PR curves between our base detector and the INRIA-Dalal/INRIA-Douze results come from the use of slightly different training windows and param-eters. We selected algorithm parameters based on the numbers in Everingham etal. [19] and by looking at results on the “train+val” set. We note that a slightchange in parameters raised our “people” results far above the INRIA-Dalal chal-lenge 3 results to the level of their challenge 4 results, where they trained on theirown dataset. Also, INRIA-Dalal did not report results for “bicycle.”

Page 10: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

10 ECCV-08 submission ID 892

deserts, etc. With no strong context cues to latch onto, TAS is unable to improve278 278

on the base HOG detector, which was in fact originally optimized to detect peo-279 279

ple. For sheep, TAS provides an improvement at the low recall rates only. This280 280

may be due to the wide scale variation present in the sheep dataset.281 281

Satellite Images. The second dataset is a set of 30 images extracted from282 282

Google Earth. The images are color, and of size 792 × 636, and contain 1319283 283

manually labeled cars. The average car window is approximately 45× 45 pixels,284 284

and all windows are scaled to these dimensions for training. We used 5-fold cross-285 285

validation, and results below report the mean performance across the folds.286 286

Here, we use a patch-based boosted detector very similar to that of Tor-287 287

ralba [1]. We use 50 rounds of boosting with two level decision trees over patch288 288

cross-correlation features that were computed for 15,000–20,000 rectangular patches289 289

of various aspect ratios and widths from 4 pixels up to 22. Patches are extracted290 290

from the intensity and gradient magnitude images. We learned a single detec-291 291

tor using every positive window, rather than learning detectors for different car292 292

orientations. As above, we convert the boosting score into a probability using293 293

logistic regression. For training the TAS model, we used 10 random restarts of294 294

EM, selecting the parameters that provided the best likelihood of the observed295 295

data. For inference, we need to account for the fact that our detectors are much296 296

weaker, and so more samples are necessary to adequately capture the posterior.297 297

We utilize 100 samples, where each sample undergoes 5 iterations.298 298

Because the in-plane rotation of these images is arbitrary, we need to use299 299

relationships that are more sophisticated than axis-aligned offsets. We observe300 300

that many regions are elongated along roads in the images. Thus, we can use301 301

the region shape to define a local coordinate system that roughly aligns with the302 302

roads. For every candidate detection window, we its containing region, determine303 303

the major and minor axis of that region, and re-project the locations of all other304 304

image regions into this new coordinate system. We then encode the relationships305 305

(in the new coordinate system) of IN (R = 1), 100 pixels ABOVE (R = 2),306 306

BELOW (R = 3), LEFT (R = 4), RIGHT (R = 5), and NONE (R = 6)3.307 307

Figure 6 shows some clusters that are learned with the context model. Eight308 308

of the ten learned clusters are shown, visualized by presenting 16 of the image309 309

regions that rank highest with respect to P (F | S). These clusters have a clear310 310

interpretation: cluster #4, for instance, represents the roofs of houses and cluster311 311

#6 trees and water regions. With each cluster, we also show the probability that312 312

a candidate window contains a car given that it is IN (R = 1) this region. The313 313

parameters provide interesting insights. Clusters #7 and #8 are road clusters,314 314

and both give a nearby window a 80% chance of being a car. Clusters #1 and #7,315 315

however, which represent forest and grass areas drive the probability of nearby316 316

detections down below 2%. Figure 7 shows an example image with the detections317 317

scored by the detector only, and by the TAS model. Many of the false positives318 318

that are not near roads are filtered out by the model.319 319

3 100 pixels represents 2–3 car lengths, which is approximately the range at whichroad-car and building-car interactions occur.

Page 11: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

ECCV-08 submission ID 892 11

Cluster #1 Cluster #2 Cluster #3 Cluster #4P (car | “In′′) = 0.02 P (car | “In′′) = 0.79 P (car | “In′′) = 0.23 P (car | “In′′) = 0.12

Cluster #5 Cluster #6 Cluster #7 Cluster #8P (car | “In′′) = 0.78 P (car | “In′′) = 0.01 P (car | “In′′) = 0.79 P (car | “In′′) = 0.81

Fig. 6. Clusters learned by the context model on the satellite image dataset. Eachcluster shows 16 of the training image regions that are most likely to be in the clusterbased on P (F | S). For each cluster, we also show the probability that a high-scoringwindow (according to the detector) in a region labeled by that cluster contains a car.

Here, there are many detections per image, so we plot the recall versus the320 320

number of false detections per image in Figure 8. We compare the Base De-321 321

tectors to the TAS Model trained using the Full TAS learning method.322 322

We found that the Full TAS learning outperformed the Pre-Clustered TAS323 323

learning, raising the recall rate by an average of 2.2% for any given FPPI, and324 324

a max of 4.7% for 1.2 FPPI. This demonstrates that some of the benefit from325 325

the model comes from the joint learning phase. The curves verify that context326 326

indeed improves our results by filtering out false positives.327 327

6 Discussion and Future Directions328 328

In this paper, we have presented the TAS model, a probabilistic framework that329 329

captures the contextual information between “stuff” and “things”, by linking330 330

discriminative detection of objects with unsupervised clustering of image regions.331 331

Importantly, the method does not require extensive labeling of image regions;332 332

standard labeling of object bounding boxes suffices for learning a model of the333 333

appearance of stuff regions and their contextual cues. We have demonstrated334 334

that the TAS model improves the performance even of strong base classifiers,335 335

including one of the top performing detectors in the PASCAL challenge.336 336

The flexibility of the TAS model provides several important benefits. The337 337

model can accommodate almost any choice of object detector that produces a338 338

score for a window that is monotonically increasing with the likelihood of the339 339

object being in the window. It is also flexible to many classes of region features340 340

Page 12: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

12 ECCV-08 submission ID 892

(a) Base Detectors (b) Region Labels (c) TAS Detections

Fig. 7. An example satellite image, with detections found by the base detector (a), andby the TAS model (c) with a threshold of 0.15. The TAS model successfully filters outmany of the false positives that occur far away from roads, but many of the remainingfalse positives are “in context,” and thus cannot be fixed. Region labels (b) show thathouses (in purple) are correctly grouped, and able to provide context for the cars.

Fig. 8. A plot of recall rate vs. false posi-tives per image for the satellite data. Theresults here are averaged across 5 folds, andshow a significant improvement from usingTAS over the base detectors.

0 40 80 120 160

0.2

0.4

0.6

0.8

1

False Positives Per Image

Rec

all R

ate

Base DetectorTAS Model

coupled with any generative model over these features. For instance, we might341 341

want to pre-cluster the regions into so-called visual words, and then use a cluster-342 342

dependent multinomial distribution over these words [20]. These characteristics343 343

also make the model applicable to a wide range of problems in object detection.344 344

Because the image region clusters are learned in an unsupervised fashion,345 345

they are able to capture a wide range of possible concepts. While a human might346 346

label the regions in one way (say trees and buildings), the automatic learning347 347

procedure might find a more contextually relevant grouping. For instance, the348 348

TAS model might split buildings into two categories: apartments, which often349 349

have cars parked near them, and factories, which rarely co-occur with cars.350 350

As discussed in Section 2, recent work has amply demonstrated the impor-351 351

tance of context in computer vision. The type of context modeled by the TAS352 352

framework is a natural complement for many of the other types of context in353 353

the literature. In particular, while many other forms of context can relate known354 354

objects that have been labeled in the data, our model can extract the signals355 355

present in the unlabeled part of the data.356 356

Page 13: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode

ECCV-08 submission ID 892 13

A major limitation of the TAS model is that it captures only 2D context. This357 357

issue also affects our ability to determine the appropriate scale for the contextual358 358

relationships. It would be interesting to integrate a TAS-like definition of context359 359

into an approach that attempts some level of 3D reconstruction, such as the work360 360

of Hoiem and Efros [15] or of Saxena et al. [21], allowing us to utilize 3D context,361 361

and simultaneously address the issue of scale.362 362

References363 363

[1] Torralba, A. Contextual priming for object detection. IJCV 53, 2003.364 364

[2] Viola, P., Jones, M.: Robust real-time face detection. ICCV, 2001.365 365

[3] Shotton, J., Winn, J., Rother, C., Criminisi, A. Textonboost: Joint appearance,366 366

shape and context modeling for multi-class object recognition and segmentation.367 367

ECCV, 2006.368 368

[4] Forsyth, D.A., Malik, J., Fleck, M.M., Greenspan, H., Leung, T.K., Belongie, S.,369 369

Carson, C., Bregler, C. Finding pictures of objects in large collections of images.370 370

Object Representation in Computer Vision, 1996371 371

[5] Murphy, K., Torralba, A., Freeman, W. Using the forest to see the tree: a graphical372 372

model relating features, objects and the scenes NIPS, 2003.373 373

[6] Singhal, A., Luo, J., Zhu, W. Probabilistic spatial context models for scene content374 374

understanding. CVPR, 2003.375 375

[7] Rabinovich, A, Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S. Objects376 376

in context. ICCV, 2007.377 377

[8] Kumar, S., Hebert, M.: A hierarchical field framework for unified context-based378 378

classification. ICCV, 2005.379 379

[9] Carbonetto, P., de Freitas, N., Barnard, K. A statistical model for general con-380 380

textual object recognition ECCV, 2004.381 381

[10] Dalal, N., Triggs, B. Histograms of oriented gradients for human detection. CVPR,382 382

2005.383 383

[11] Oliva, A., Torralba, A. The role of context in object recognition. Trends Cogn384 384

Sci, 2007.385 385

[12] Torralba, A., Murphy, K., Freeman, W., Rubin, M.: Context-based vision system386 386

for place and object recognition ICCV, 2003.387 387

[13] Wolf, L., Bileschi, S. A critical view of context. IJCV 69, 2006.388 388

[14] Fink, M., Perona, P. Mutual boosting for contextual inference. NIPS, 2003.389 389

[15] Hoiem, D., Efros, A.A., Hebert, M. Putting objects in perspective. CVPR, 2006.390 390

[16] Ren, X., Malik, J. Learning a classification model for segmentation. ICCV, 2003.391 391

[17] Dempster, A.P., Laird, N.M., Rubin, D.B. Maximum likelihood from incomplete392 392

data via the em algorithm. JRSS, 1977.393 393

[18] Geman, S., Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian394 394

restoration of images. Readings in computer vision: issues, problems, principles,395 395

and paradigms. (1987) 564–584396 396

[19] Everingham, M., et al. The 2005 pascal visual object classes challenge. MLCW,397 397

2005.398 398

[20] Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.399 399

Matching words and pictures. JMLR 3, 2003.400 400

[21] Saxena, A., Sun, M., Ng, A.Y. Learning 3-d scene structure from a single still401 401

image. ICCV, 2007.402 402