learning spatial context: 1 using stuff to find things...
TRANSCRIPT
![Page 1: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/1.jpg)
Learning Spatial Context:1 1
Using Stuff to Find Things2 2
Anonymous ECCV submission3 3
Paper ID 8924 4
Abstract. The sliding window approach of detecting rigid objects (such5 5
as cars) is predicated on the belief that the object can be identified6 6
from the appearance in a defined region around the object. Other types7 7
of objects of amorphous spatial extent (e.g., trees, sky), however, are8 8
more naturally classified based on texture or color. In this paper, we9 9
seek to combine recognition of these two types of objects into a system10 10
that leverages “context” toward improving detection. In particular, we11 11
cluster regions of the image based on their ability to serve as context12 12
for the detection of objects. Rather than providing an explicit training13 13
set with region labels, our method automatically groups regions based14 14
on both their appearance and their relationships to the detections in the15 15
image. We show that our things and stuff (TAS) context model produces16 16
meaningful clusters that are readily interpretable, and helps improve our17 17
detection ability over state-of-the-art detectors. We present results on18 18
object detection in images from the PASCAL VOC 2005/2006 datasets19 19
and on the task of overhead car detection in satellite images.20 20
1 Introduction21 21
Recognizing objects in an image requires combining many different signals from22 22
the raw image data. Figure 1 shows an example satellite image of a street scene,23 23
where we may want to identify all of the cars. From a human perspective, there24 24
are two primary signals that we leverage. The first is the local appearance in the25 25
window near the potential car. The second is our knowledge that cars appear26 26
on roads. This second signal is a form of contextual knowledge, and our goal in27 27
this paper is to capture this idea in a rigorous probabilistic model.28 28
Recent papers have demonstrated that boosted object detectors can be effec-29 29
tive at detecting monolithic object classes, such as cars [1] or faces [2]. Similarly,30 30
several works on multiclass segmentation have shown that regions in images31 31
can effectively be classified based on their color or texture properties [3]. These32 32
two lines of work have made significant progress on the problems of identifying33 33
“things” and “stuff,” respectively. The important differentiation between these34 34
two classes of visual objects is summarized in Forsyth et al. [4] as:35 35
The distinction between materials — “stuff” — and objects — “things”36 36
— is particularly important. A material is defined by a homogeneous or37 37
repetitive pattern of fine-scale properties, but has no specific or distinc-38 38
tive spatial extent or shape. An object has a specific size and shape.39 39
![Page 2: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/2.jpg)
2 ECCV-08 submission ID 892
Fig. 1. (Left) An aerial photograph. (Center) Detected cars in the image (solid green= true detections, dashed red = false detections). (Right) Finding “stuff” such asbuildings, by classifying regions, shown delineated by red boundaries.
Recent work has also shown that classifiers targeting particular classes of40 40
things or stuff can benefit from the proper use of contextual cues. The use of41 41
context can be split into a number of categories. Scene-Thing context allows42 42
scene-level information, such as the scale or the “gist” [5], to determine prior43 43
location probabilities for the various objects. Stuff-Stuff context captures the no-44 44
tion that sky occurs above sea and road below building [6]. Thing-Thing context45 45
considers the co-occurrence of objects such as the notion that a tennis racket is46 46
more likely to occur with a tennis ball than with a lemon [7]. Finally Stuff-Thing47 47
context allows the texture regions (e.g., the roads and buildings in Figure 1) to48 48
add predictive power to the detection of objects (e.g., the cars in Figure 1). We49 49
focus on this fourth type of context. Figure 2 shows an example of this context50 50
in the case of satellite imagery.51 51
In this paper, we present a probabilistic model linking the detection of things52 52
to the unsupervised classification of stuff. Our method can be viewed as an at-53 53
tempt to cluster “stuff,” represented by coherent image regions, into clusters54 54
that are both visually similar and best able to provide context for the detectable55 55
“things” in the image. Cluster labels for the image regions are probabilistically56 56
linked to the detection window labels through the use of region-detection “rela-57 57
tionships,” which encode their relative spatial locations. We call this model the58 58
things and stuff (TAS) context model because it links these two components into59 59
a coherent whole. The graphical representation of the TAS model is depicted in60 60
Figure 3. At training time, this model leverages supervised (groundtruth) de-61 61
tection labels to learn parameters using the Expectation-Maximization (EM)62 62
algorithm. For test instances, both the region labels and detection labels are63 63
inferred from the model.64 64
We present results of the TAS method on diverse datasets. Using the Pascal65 65
Visual Object Classes challenge datasets from 2005 and 2006 (VOC2005 and66 66
VOC2006), we utilize one of the top competitors as our baseline detector [10],67 67
and demonstrate that our TAS method improves the performance of detecting68 68
cars, bicycles and motorbikes in street scenes and cows and sheep in rural scenes69 69
(see Figure 4). In addition, we consider a very different dataset of satellite images70 70
from Google Earth, of the city and suburbs of Brussels, Belgium. The task here71 71
is to identify the cars from above; see Figure 2. For clarity, the satellite data is72 72
used as the running example throughout the paper, but all descriptions apply73 73
equally to the other datasets.74 74
![Page 3: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/3.jpg)
ECCV-08 submission ID 892 3
Fig. 2. Example detections from the satellite dataset that demonstrate context. Clas-sifying using local appearance only, we might think that both windows at left are cars.However, when seen in context, the bottom detection is unlikely to be an actual car.
2 Related Work75 75
The role of context in object recognition has become an important topic, due76 76
both to the psychological basis of context in the human visual system [11] and to77 77
the striking algorithmic improvements that “visual context” has provided [12].78 78
The word “context” has been attached to many different ideas. One of the79 79
simplest forms is co-occurrence context. The work of Rabinovich et al. [7] demon-80 80
strates the use of this context, where the presence of a certain object class in an81 81
image probabilistically influences the presence of a second class. The context of82 82
Torralba et al. [12] assumes that certain objects occur more frequently in cer-83 83
tain rooms, as monitors tend to occur in offices. While these methods achieve84 84
excellent results when many different object classes are labeled per image, they85 85
are unable to leverage unsupervised data for contextual object recognition.86 86
In addition to co-occurrence context, many approaches take into account87 87
the spatial relationships between objects. At the local descriptor level, Wolf et88 88
al. [13] detect objects using a descriptor with a large capture range, allowing the89 89
detection of the object to be influenced by surrounding image features. Because90 90
these methods use only the raw features for context, however, they cannot obtain91 91
a holistic view of an entire scene. Similarly, Fink and Perona [14] use the output92 92
of boosted detectors for other classes as additional features for the detection of93 93
a given class. This allows the inclusion of signal beyond the raw features, but94 94
requires that all “parts” of a scene be supervised. In contrast, Murphy et al. [5]95 95
use a global feature known as the “gist” to learn statistical priors on the locations96 96
of various objects within the context of the specific scene. The gist descriptor97 97
is excellent at predicting the scene type and large structures in the scene, but98 98
cannot handle the local interactions present in the satellite data, for example.99 99
Another approach to modeling spatial relationships is to use a Markov Ran-100 100
dom Field (MRF) or variant (CRF,DRF) [8, 9] to encode the preferences for cer-101 101
tain spatial relationships. These techniques offer a great deal of flexibility in the102 102
formulation of the affinity function and all the standard benefits of a graphical103 103
model formulation (e.g., well-known learning and inference techniques). Singhal104 104
et al. [6] also use similar concepts to the MRF formulation to aggregate de-105 105
cisions across the image. These methods, however, suffer from two drawbacks.106 106
First, they tend to require a large amount of annotation in the training set. Sec-107 107
ond, they put things and stuff on the same footing, representing both as “sites”108 108
![Page 4: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/4.jpg)
4 ECCV-08 submission ID 892
in the MRF. Our method requires less annotation and allows detections and109 109
image regions to be represented in their (different) natural spaces.110 110
Perhaps the most ambitious attempts to use context involves the attempt to111 111
model the scene of an image holistically. Torralba [1], for instance, uses global112 112
image features to “prime” the detector with the likely presence/absence of ob-113 113
jects, the likely locations, and the likely scales. The work of Hoiem and Efros [15]114 114
takes this one level further by explicitly modeling the 3D layout of the scene.115 115
This allows the natural use of scale and location constraints (e.g., things closer116 116
to the camera are larger). Their approach, however, is tailored to street scenes,117 117
and requires domain-specific modeling. The specific form of their priors would118 118
be useless in the case of satellite images, for example.119 119
3 Things and Stuff (TAS) Context Model120 120
Our probabilistic context model builds on two standard components that are121 121
commonly used in the literature. The first is sliding window object detection,122 122
and the second is unsupervised image region clustering. A common method for123 123
finding “things” is to slide a window over the image, score each window’s match124 124
to the model, and return the highest matching such windows. We denote the125 125
features in the ith candidate window by Wi, the presence/absence of the target126 126
class in that window by Ti (T for “thing”), and assume that what we learn in127 127
our detector is a conditional probability P (Ti | Wi) from the window features128 128
to the probability that the window contains the object; this probability can be129 129
derived from most standard classifiers, such as the highly successful boosting130 130
approaches [1]. The set of windows included in the model can, in principle,131 131
include all windows in the image. We, however, limit ourselves to looking at the132 132
top scoring detection windows according to the detector (i.e., all windows above133 133
some low threshold, chosen in order to capture most of the true positives).134 134
The second component we build on involves clustering coherent regions of the135 135
image into groups based on appearance. We begin by segmenting the image into136 136
regions, known as superpixels, using the normalized cut algorithm of Ren and137 137
Malik [16]. For each region j, we extract a feature vector F j that includes color138 138
and texture features. For our stuff model, we will use a generative model where139 139
each region has a hidden class, and the features are generated from a Gaussian140 140
distribution with parameters depending on the class. Denote by Sj (S for “stuff”)141 141
the (hidden) class label of the jth region. We assume that the features are derived142 142
from a standard naive Bayes model, where we have a probability distribution143 143
P (F j | Sj) of the image features given the class label.144 144
In order to relate the detector for “things” (T ’s) to the clustering model over145 145
“stuff” (S’s), we need to develop an intuitive representation for the relationships146 146
between these units. Human knowledge in this area comes in sentences like “cars147 147
drive on roads,” or “cars park 20 feet from buildings.” We introduce variables148 148
Rij that represent the relationship between candidate detection i and region149 149
j. The values of Rij indicate different relationships, such as: “detection i is in150 150
region j”, or “detection i is about 100 pixels away from region j”.151 151
We can now link our component models into a single coherent probabilistic152 152
things and stuff (TAS) model, as depicted in the plate model of Figure 3(a).153 153
![Page 5: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/5.jpg)
ECCV-08 submission ID 892 5
RijTi Sj
Fj
ImageWindow
Wi
Wi: Window
Ti: Object Presence
Sj: Region Label
Fj: Region Features
Rij: Relationship
N
J
(a) TAS plate model
T1
S1
S2
S3
S4
S5
T2
T3
R21 = “Above”
R31 = “Left”
R13 = “In”
R33 = “In”
R11 = “Left”
CandidateWindows
ImageRegions
(b) TAS ground network
Fig. 3. The TAS model. The plate representation (a) gives a compact visualization ofthe model, which unrolls into a “ground” network (b) for any particular image.
Probabilistic influence flows between the detection window labels and the image154 154
region labels through the v-structures that are activated by observing the rela-155 155
tionship variables. For a particular input image, this plate model unrolls into a156 156
“ground” network that encodes a distribution over the detections and regions of157 157
the image. Figure 3(b) shows a toy example of a ground network for the image of158 158
Figure 1. It is interesting to note the similarities between TAS and the MRF ap-159 159
proaches in the literature. In effect, the relationship variables link the detections160 160
and the regions into a probabilistic web where signals at one point in the web161 161
(say the strong appearance of a particular region) influence a detection in an-162 162
other part of the image, which in turn influences the label of a different region in163 163
yet another location. In the example of Figure 3(b), if region 3 has strong road164 164
appearance, it might influence T1 through the relationship R13. This in turn165 165
will influence the labeling of region 1 through the relationship R11. Because our166 166
method uses a generative model for the region labels, training can be performed167 167
very effectively, even when the Sj variables are unobserved (see below).168 168
All variables in the TAS model are discrete except for the feature F j vari-ables. This allows for simple table conditional probability distributions (CPDs)for all discrete nodes in this Bayesian network. The probability distribution overthese variables decomposes according to:
P (TSFRW ) =∏
i
P (Wi)P (Ti | Wi)∏
j
P (Sj)P (F j | Sj)∏
ij
P (Rij | Ti, Sj).
169 169One of the main benefits of the TAS approach is its simplicity and modularity.170 170
It allows us to “plug in” any sliding window detector and any generative approach171 171
for region clustering (e.g., [3]).172 172
![Page 6: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/6.jpg)
6 ECCV-08 submission ID 892
4 Learning and Inference in the TAS Model173 173
Because TAS unrolls into a Bayesian network for each image, we can use standard174 174
learning and inference methods. In particular, we learn the parameters of our175 175
model using the Expectation-Maximization (EM) [17] algorithm and perform176 176
inference using Gibbs sampling, a standard variant of MCMC sampling [18].177 177
Learning the Model with EM. At learning time, we have a set of images178 178
with annotated labels for our target object class(es). We first train the base179 179
detectors using this set, and select as our candidate windows all detections above180 180
a threshold; we thus obtain a set of candidate detection windows W1 . . .WN181 181
along with their groundtruth labels T1 . . . TN . We also have a set of regions and182 182
a feature vector Fj for each, and an Rij relationship variable for every window-183 183
region pair. Our goal is to learn parameters for the TAS model.184 184
One option is to learn in two phases: we first use an existing clustering method185 185
(such as K-means or Mixture-of-Gaussian clustering) to assign the Sj variables186 186
in the training data; we then learn parameters for the TAS model with fully ob-187 187
served data. This approach is attractive as it allows the first part of the learning188 188
to be fully unsupervised, enabling us to use more images than exist in our labeled189 189
dataset. We call the model learned with this method Pre-Clustered TAS, but190 190
while the resulting clusters are likely to be visually coherent, they do not exploit191 191
the spatial relationships between the regions and the object detections.192 192
We therefore propose a joint training method where we perform EM in the193 193
full model; this model is called the Full TAS model. The EM algorithm iterates194 194
between using probabilistic inference to derive a soft completion of the hidden195 195
variables (E-step) and finding maximum-likelihood parameters relative to this196 196
soft completion (M-step). The E-step here is particularly easy: at training time,197 197
only the Sj ’s are unobserved; moreover, because the T variables are observed,198 198
the Sj ’s are conditionally independent of each other. Thus, the inference step199 199
turns into a simple computation for each Sj separately, a process which can be200 200
performed in linear time. The M-step for table CPDs can be performed easily in201 201
closed form. To provide a good starting point for EM, we initialize the cluster202 202
assignments using the K-means algorithm. EM is guaranteed to converge to a203 203
local maximum of the likelihood function of the observed data.204 204
Inference with Gibbs Sampling. At test time, our system must determinewhich windows in a new image contain the target object. We observe the can-didate detection windows (Wi’s, extracted by thresholding the base detectoroutput), the features of each image region (F j ’s), and the relationships (Rij ’s).Our task is to find the probability that each window contains the object:
P (T | F , R, W ) =∑
S
P (T ,S | F , R,W ) (1)
Unfortunately, this expression involves a summation over an exponential set ofvalues for the S vector of variables. We solve the inference problem approxi-mately using a Gibbs sampling [18] MCMC method. We begin with some as-signment to the variables. Then, in each Gibbs iteration we first resample all of
![Page 7: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/7.jpg)
ECCV-08 submission ID 892 7
the S’s and then resample all the T ’s according to the following two probabilities:
P (Sj | T , F , R, W ) ∝ P (Sj)P (Fj | Sj)∏
i
P (Rij | Ti, Sj) (2)
P (Ti | S, F , R, W ) ∝ P (Ti | Wi)∏
j
P (Rij | Ti, Sj). (3)
These sampling steps can be performed efficiently, as the Ti variables are con-205 205
ditionally independent given the S’s and the Si’s are conditionally independent206 206
given the T ’s. In the last Gibbs iteration for each sample, rather than resampling207 207
T , we compute the posterior probability over T given our current S samples,208 208
and use these distributional particles for our estimate of the probability in (1).209 209
5 Experimental Results210 210
In order to evaluate the ability of the TAS model to learn and utilize context,211 211
we perform experiments on three datasets that differ along several axes. The212 212
first two datasets are from the PASCAL Visual Object Classes challenges 2005213 213
and 2006[19]. The scenes are urban and rural, indoor and outdoor, and there is214 214
a great deal of scale and shape variation amongst the objects. The third is a set215 215
of satellite images acquired from Google Earth. The goal in these images is to216 216
detect cars from above. Because of the impoverished visual information, there217 217
are many false positives when a sliding window detector is applied. In this case,218 218
context provides a filtering mechanism to remove the false positives. Because219 219
these two applications are different, and in order to demonstrate the flexibility220 220
of our approach, we use different detectors for each. In all experiments, we allow221 221
S a cardinality of K = 101, and use 42 features for the image regions that222 222
represent color, texture, and shape [20]. For clusters, we keep a mean and full223 223
covariance matrix over these features. A small regularization (10−6I) is added224 224
to the covariance to ensure positive semi-definiteness.225 225
PASCAL VOC Datasets. For these experiments, we used four classes from226 226
the VOC2005 data, and two classes from the VOC2006 data. The VOC2005227 227
dataset consists of 2232 images, manually annotated with bounding boxes for228 228
four image classes: cars, people, motorbikes, and bicycles. We use the “train+val”229 229
set (684 images) for training, and the “test2” set (859 images) for testing. The230 230
VOC2006 dataset contains 5304 images, manually annotated with 12 classes, of231 231
which we use the cow and sheep classes. We train on the “trainval” set (2618232 232
images) and test on the “test” set (2686 images). To compare with the results233 233
of the challenges, we adopted as our detector the HOG (histogram of oriented234 234
gradients) detector of Dalal and Triggs [10]. This detector uses an SVM and235 235
therefore outputs a score margini ∈ (−∞,+∞), which we convert into a proba-236 236
bility by learning a logistic regression function for P (Ti | margini). We also plot237 237
the precision-recall curve using the code provided in the challenge toolkit.238 238
Because these images are taken parallel to the ground plane, we use relation-239 239
ships that capture axis-aligned interactions at distances relative to size of the240 240
1 Results were robust to a range of K between 5 and 20.
![Page 8: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/8.jpg)
8 ECCV-08 submission ID 892
(a) (b) (c)
(d) (e) (f)
Fig. 4. (a,b) Example training detections from the bicycle class. The related imageregions are colored by their most likely cluster label. In both examples, the blue regionthat is horizontally offset from the detection belongs to cluster #3. (c) 16 of the topscoring regions for cluster #3, ranked by P (F | S = 3) (likelihood of image features).This cluster corresponds to “roads” or “bushes” as things that are gray/green andoccur next to cars. (d) A case where context helped find a true detection. (e,f) Twoexamples where incorrect detections are filtered out by context.
candidate detected object. We therefore use the following relationships: Rij = 1241 241
(IN) for the region j closest to the center of the detection window; Rij = 2, 3, 4, 5242 242
for the regions that are one bounding box width to the LEFT of, to the RIGHT243 243
of, ABOVE, and BELOW the window; Rij = 6 (NONE) for all other regions.244 244
Figure 4 (top row) shows example bicycle detection candidates, and the re-245 245
lationships they have to the regions in their images. These examples suggest246 246
the type of context that might be learned. For example, the region beside both247 247
detections (colored blue) belongs to cluster #3, which looks visually like a road248 248
or bush cluster. The learned values of the model parameters also indicate that249 249
being to the left or right of this cluster increases the probability of a window250 250
containing a bicycle (e.g., by about 33% in the case where Rij is RIGHT).251 251
We performed a single run of EM learning to convergence, which takes around252 252
2 hours on an Intel Dual Core 1.9 GHz machine with 2 GB of memory. We run253 253
separate experiments for each class, though in principle it would be possible to254 254
learn a single joint model over all classes. By separating the classes, we are able255 255
to isolate the contextual contribution from the stuff, rather than between the dif-256 256
ferent types of things present in the images. For our MCMC inference, we found257 257
that, due to the strength of the baseline detectors, the Markov chain converged258 258
fairly rapidly; we achieved very good results using merely 10 MCMC samples,259 259
![Page 9: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/9.jpg)
ECCV-08 submission ID 892 9
0.1 0.2 0.3 0.4 0.5
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
TAS ModelBase DetectorsINRIA-Dalal
0.1 0.2 0.3 0.4 0.5 0.6
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
0.1 0.2 0.3 0.4 0.5
0.2
0.4
0.6
0.8
1TAS ModelBase DetectorsINRIA-Douze
Recall Rate
Pre
cisi
on
(a) Cars (2005) (b) Motorbikes (2005) (e) Cows (2006)
0.1 0.2 0.3 0.4 0.5 0.6
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
0.1 0.2 0.3 0.4
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
0.1 0.2 0.3 0.4 0.5
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
(c) People (2005) (d) Bicycles (2005) (f) Sheep (2006)
Fig. 5. Precision-recall curves for the VOC2005 and VOC2006 classes.
where each is initialized randomly and then undergoes 5 Gibbs iterations. The260 260
entire inference process takes about one second per image.261 261
The bottom row of Figure 4 shows some detections that were corrected using262 262
context. We show one example where a true bicycle was discovered using context,263 263
and two examples where false positives were filtered out by our model. These264 264
examples demonstrate the type of information that is being leveraged by TAS.265 265
In the first example, the sky above, and the road beside the window give a signal266 266
that this detection is at ground level, and is therefore likely to be a bicycle.267 267
Figure 5 shows the full recall-precision curve for each class. For (a-d) we268 268
compare to the 2005 INRIA-Dalal challenge entry, and for (e,f) we compare to269 269
the 2006 INRIA-Douze entry, both of which used the HOG detector. We also270 270
show the curve produced by our Base Detector alone2. Finally, we plot the271 271
curves produced by our TAS Model, trained using full EM, which scores win-272 272
dows using the probability of (1). The model trained using the Pre-Clustered273 273
approach performed similarly. From these curves, we see that the TAS model274 274
provided a significant improvement in accuracy for all but the “people” and275 275
“sheep” classes. We believe the lack of improvement for people is due to the276 276
wide variation of backgrounds in these images, including streets, grass, forests,277 277
2 Differences in PR curves between our base detector and the INRIA-Dalal/INRIA-Douze results come from the use of slightly different training windows and param-eters. We selected algorithm parameters based on the numbers in Everingham etal. [19] and by looking at results on the “train+val” set. We note that a slightchange in parameters raised our “people” results far above the INRIA-Dalal chal-lenge 3 results to the level of their challenge 4 results, where they trained on theirown dataset. Also, INRIA-Dalal did not report results for “bicycle.”
![Page 10: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/10.jpg)
10 ECCV-08 submission ID 892
deserts, etc. With no strong context cues to latch onto, TAS is unable to improve278 278
on the base HOG detector, which was in fact originally optimized to detect peo-279 279
ple. For sheep, TAS provides an improvement at the low recall rates only. This280 280
may be due to the wide scale variation present in the sheep dataset.281 281
Satellite Images. The second dataset is a set of 30 images extracted from282 282
Google Earth. The images are color, and of size 792 × 636, and contain 1319283 283
manually labeled cars. The average car window is approximately 45× 45 pixels,284 284
and all windows are scaled to these dimensions for training. We used 5-fold cross-285 285
validation, and results below report the mean performance across the folds.286 286
Here, we use a patch-based boosted detector very similar to that of Tor-287 287
ralba [1]. We use 50 rounds of boosting with two level decision trees over patch288 288
cross-correlation features that were computed for 15,000–20,000 rectangular patches289 289
of various aspect ratios and widths from 4 pixels up to 22. Patches are extracted290 290
from the intensity and gradient magnitude images. We learned a single detec-291 291
tor using every positive window, rather than learning detectors for different car292 292
orientations. As above, we convert the boosting score into a probability using293 293
logistic regression. For training the TAS model, we used 10 random restarts of294 294
EM, selecting the parameters that provided the best likelihood of the observed295 295
data. For inference, we need to account for the fact that our detectors are much296 296
weaker, and so more samples are necessary to adequately capture the posterior.297 297
We utilize 100 samples, where each sample undergoes 5 iterations.298 298
Because the in-plane rotation of these images is arbitrary, we need to use299 299
relationships that are more sophisticated than axis-aligned offsets. We observe300 300
that many regions are elongated along roads in the images. Thus, we can use301 301
the region shape to define a local coordinate system that roughly aligns with the302 302
roads. For every candidate detection window, we its containing region, determine303 303
the major and minor axis of that region, and re-project the locations of all other304 304
image regions into this new coordinate system. We then encode the relationships305 305
(in the new coordinate system) of IN (R = 1), 100 pixels ABOVE (R = 2),306 306
BELOW (R = 3), LEFT (R = 4), RIGHT (R = 5), and NONE (R = 6)3.307 307
Figure 6 shows some clusters that are learned with the context model. Eight308 308
of the ten learned clusters are shown, visualized by presenting 16 of the image309 309
regions that rank highest with respect to P (F | S). These clusters have a clear310 310
interpretation: cluster #4, for instance, represents the roofs of houses and cluster311 311
#6 trees and water regions. With each cluster, we also show the probability that312 312
a candidate window contains a car given that it is IN (R = 1) this region. The313 313
parameters provide interesting insights. Clusters #7 and #8 are road clusters,314 314
and both give a nearby window a 80% chance of being a car. Clusters #1 and #7,315 315
however, which represent forest and grass areas drive the probability of nearby316 316
detections down below 2%. Figure 7 shows an example image with the detections317 317
scored by the detector only, and by the TAS model. Many of the false positives318 318
that are not near roads are filtered out by the model.319 319
3 100 pixels represents 2–3 car lengths, which is approximately the range at whichroad-car and building-car interactions occur.
![Page 11: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/11.jpg)
ECCV-08 submission ID 892 11
Cluster #1 Cluster #2 Cluster #3 Cluster #4P (car | “In′′) = 0.02 P (car | “In′′) = 0.79 P (car | “In′′) = 0.23 P (car | “In′′) = 0.12
Cluster #5 Cluster #6 Cluster #7 Cluster #8P (car | “In′′) = 0.78 P (car | “In′′) = 0.01 P (car | “In′′) = 0.79 P (car | “In′′) = 0.81
Fig. 6. Clusters learned by the context model on the satellite image dataset. Eachcluster shows 16 of the training image regions that are most likely to be in the clusterbased on P (F | S). For each cluster, we also show the probability that a high-scoringwindow (according to the detector) in a region labeled by that cluster contains a car.
Here, there are many detections per image, so we plot the recall versus the320 320
number of false detections per image in Figure 8. We compare the Base De-321 321
tectors to the TAS Model trained using the Full TAS learning method.322 322
We found that the Full TAS learning outperformed the Pre-Clustered TAS323 323
learning, raising the recall rate by an average of 2.2% for any given FPPI, and324 324
a max of 4.7% for 1.2 FPPI. This demonstrates that some of the benefit from325 325
the model comes from the joint learning phase. The curves verify that context326 326
indeed improves our results by filtering out false positives.327 327
6 Discussion and Future Directions328 328
In this paper, we have presented the TAS model, a probabilistic framework that329 329
captures the contextual information between “stuff” and “things”, by linking330 330
discriminative detection of objects with unsupervised clustering of image regions.331 331
Importantly, the method does not require extensive labeling of image regions;332 332
standard labeling of object bounding boxes suffices for learning a model of the333 333
appearance of stuff regions and their contextual cues. We have demonstrated334 334
that the TAS model improves the performance even of strong base classifiers,335 335
including one of the top performing detectors in the PASCAL challenge.336 336
The flexibility of the TAS model provides several important benefits. The337 337
model can accommodate almost any choice of object detector that produces a338 338
score for a window that is monotonically increasing with the likelihood of the339 339
object being in the window. It is also flexible to many classes of region features340 340
![Page 12: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/12.jpg)
12 ECCV-08 submission ID 892
(a) Base Detectors (b) Region Labels (c) TAS Detections
Fig. 7. An example satellite image, with detections found by the base detector (a), andby the TAS model (c) with a threshold of 0.15. The TAS model successfully filters outmany of the false positives that occur far away from roads, but many of the remainingfalse positives are “in context,” and thus cannot be fixed. Region labels (b) show thathouses (in purple) are correctly grouped, and able to provide context for the cars.
Fig. 8. A plot of recall rate vs. false posi-tives per image for the satellite data. Theresults here are averaged across 5 folds, andshow a significant improvement from usingTAS over the base detectors.
0 40 80 120 160
0.2
0.4
0.6
0.8
1
False Positives Per Image
Rec
all R
ate
Base DetectorTAS Model
coupled with any generative model over these features. For instance, we might341 341
want to pre-cluster the regions into so-called visual words, and then use a cluster-342 342
dependent multinomial distribution over these words [20]. These characteristics343 343
also make the model applicable to a wide range of problems in object detection.344 344
Because the image region clusters are learned in an unsupervised fashion,345 345
they are able to capture a wide range of possible concepts. While a human might346 346
label the regions in one way (say trees and buildings), the automatic learning347 347
procedure might find a more contextually relevant grouping. For instance, the348 348
TAS model might split buildings into two categories: apartments, which often349 349
have cars parked near them, and factories, which rarely co-occur with cars.350 350
As discussed in Section 2, recent work has amply demonstrated the impor-351 351
tance of context in computer vision. The type of context modeled by the TAS352 352
framework is a natural complement for many of the other types of context in353 353
the literature. In particular, while many other forms of context can relate known354 354
objects that have been labeled in the data, our model can extract the signals355 355
present in the unlabeled part of the data.356 356
![Page 13: Learning Spatial Context: 1 Using Stuff to Find Things 2ai.stanford.edu/~gaheitz/Research/CONTEXT/ForDaphne/context.pdf · 101 dom Field (MRF) or variant (CRF,DRF) [8,9] to encode](https://reader034.vdocument.in/reader034/viewer/2022050200/5f543f02b45e7659296301be/html5/thumbnails/13.jpg)
ECCV-08 submission ID 892 13
A major limitation of the TAS model is that it captures only 2D context. This357 357
issue also affects our ability to determine the appropriate scale for the contextual358 358
relationships. It would be interesting to integrate a TAS-like definition of context359 359
into an approach that attempts some level of 3D reconstruction, such as the work360 360
of Hoiem and Efros [15] or of Saxena et al. [21], allowing us to utilize 3D context,361 361
and simultaneously address the issue of scale.362 362
References363 363
[1] Torralba, A. Contextual priming for object detection. IJCV 53, 2003.364 364
[2] Viola, P., Jones, M.: Robust real-time face detection. ICCV, 2001.365 365
[3] Shotton, J., Winn, J., Rother, C., Criminisi, A. Textonboost: Joint appearance,366 366
shape and context modeling for multi-class object recognition and segmentation.367 367
ECCV, 2006.368 368
[4] Forsyth, D.A., Malik, J., Fleck, M.M., Greenspan, H., Leung, T.K., Belongie, S.,369 369
Carson, C., Bregler, C. Finding pictures of objects in large collections of images.370 370
Object Representation in Computer Vision, 1996371 371
[5] Murphy, K., Torralba, A., Freeman, W. Using the forest to see the tree: a graphical372 372
model relating features, objects and the scenes NIPS, 2003.373 373
[6] Singhal, A., Luo, J., Zhu, W. Probabilistic spatial context models for scene content374 374
understanding. CVPR, 2003.375 375
[7] Rabinovich, A, Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S. Objects376 376
in context. ICCV, 2007.377 377
[8] Kumar, S., Hebert, M.: A hierarchical field framework for unified context-based378 378
classification. ICCV, 2005.379 379
[9] Carbonetto, P., de Freitas, N., Barnard, K. A statistical model for general con-380 380
textual object recognition ECCV, 2004.381 381
[10] Dalal, N., Triggs, B. Histograms of oriented gradients for human detection. CVPR,382 382
2005.383 383
[11] Oliva, A., Torralba, A. The role of context in object recognition. Trends Cogn384 384
Sci, 2007.385 385
[12] Torralba, A., Murphy, K., Freeman, W., Rubin, M.: Context-based vision system386 386
for place and object recognition ICCV, 2003.387 387
[13] Wolf, L., Bileschi, S. A critical view of context. IJCV 69, 2006.388 388
[14] Fink, M., Perona, P. Mutual boosting for contextual inference. NIPS, 2003.389 389
[15] Hoiem, D., Efros, A.A., Hebert, M. Putting objects in perspective. CVPR, 2006.390 390
[16] Ren, X., Malik, J. Learning a classification model for segmentation. ICCV, 2003.391 391
[17] Dempster, A.P., Laird, N.M., Rubin, D.B. Maximum likelihood from incomplete392 392
data via the em algorithm. JRSS, 1977.393 393
[18] Geman, S., Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian394 394
restoration of images. Readings in computer vision: issues, problems, principles,395 395
and paradigms. (1987) 564–584396 396
[19] Everingham, M., et al. The 2005 pascal visual object classes challenge. MLCW,397 397
2005.398 398
[20] Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.399 399
Matching words and pictures. JMLR 3, 2003.400 400
[21] Saxena, A., Sun, M., Ng, A.Y. Learning 3-d scene structure from a single still401 401
image. ICCV, 2007.402 402