1 integrating vision models for holistic scene understanding geremy heitz cs223b march 4 th, 2009
Post on 19-Dec-2015
216 views
TRANSCRIPT
1
Integrating Vision Models for Holistic Scene
Understanding
Geremy Heitz
CS223BMarch 4th, 2009
2
Scene/Image Understanding
What’s happening in these pictures?
3
Human View of a “Scene”
“A car passes a bus on the road,
while people walk past a building.”
ROAD
BUILDING
CAR
BUSPEOPLEWALKING
4
Computer View of a “Scene”
BUILDING
ROAD
STREET
SCENE
Can we integrateall of these subtasks,
so thatwhole > sum of parts ?
5
Outline
Overview Integrating Vision Models
CCM: Cascaded Classification Models Learning Spatial Context
TAS: Things and Stuff Future Directions
[Heitz et al. NIPS 2008a]
[Heitz & Koller ECCV 2008]
6
Image/Scene Understanding
“a man and a dogare walking
on a sidewalkin front of a building”
Man
Dog
Backpack
Cigarette
Primitives Objects Parts Surfaces Regions
Interactions Context Actions Scene
Descriptions
Established techniques address these in isolation.
Reasoning over image statistics
Complex web of relations well represented by graphical models.
Reasoning over more abstract entities.
Building
Sidewalk
7
Why will integration help?
What is this object?
8
More Context
Context is key!
9
Outline
Overview Integrating Vision Models
CCM: Cascaded Classification Models Learning Spatial Context
TAS: Things and Stuff Future Directions
[Heitz et al. NIPS 2008a]
10
Human View of a “Scene”
ROAD
BUILDING
CAR
BUSPEOPLEWALKING
Scene Categorization
Object Detection Region Labelling Depth
Reconstruction Surface Orientations Boundary/Edge
Detection Outlining/Refined
Localization Occlusion
Reasoning ...
11
Intrinsic Images [Barrow and Tenenbaum, 1978], [Tappen et al., 2005]
Hoiem et al., “Closing the Loop in Scene Interpretation” , 2008
We want to focus more on “semantic” classes We want to be flexible to using outside models We want an extendable framework, not one engineered for a particular set of
tasks
Related Work
= +
=
12
How Should we Integrate? Single joint model over all variables
Pros: Tighter interactions, more designer control Cons: Need expertise in each of the subtasks
Simple, flexible combination of existing models
Pros: State-of-the-art models, easier to extendLimited “black-box” interface to components
Cons: Missing some of the modeling power
DETECTIONDalal & Triggs, 2006
REGION LABELINGGould et al., 2007
DEPTH RECONSTRUCTIONSaxena et al., 2007
13
DET1 REG1 REC1
Cascaded Classification Models
Image
Features fDET
Object Detection
RegionLabeling
DET0IndependentModels
fREG
REG0
fREC
REC0
3DReconstructio
n
Context-awareModels
14
Integrated Model for Scene Understanding
Object Detection Multi-class
Segmentation Depth Reconstruction Scene Categorization
I’ll show you
these
15
Basic Object Detection
= Car
= Person
= Motorcycle= Boat
= Sheep= Cow
Detection Window W
Score(W) > 0.5
16
Base Detector - HOG
[ Dalal & Triggs, CVPR, 2006 ] HOG Detector:
Feature Vector X SVM Classifier
17
Context-Aware Object Detection
From Base Detector Log Score D(W)
From Scene Category MAP category, marginals
From Region Labels How much of each label is in
a window adjacent to W From Depths
Mean, variance of depths,estimate of “true” object size
Final Classifier
P(Y) = Logistic(Φ(W))
Scene Type: Urban scene
% of “road” below W
Variance of depths in W
18
Multi-class Segmentation CRF Model
Label each pixel as one of:{‘grass’, ‘road’, ‘sky’, etc }
Conditional Markov random field (CRF) over superpixels:
Singleton potentials: log-linear function of boosted detectors scores for each class
Pairwise potentials: affinity of classes appearing together conditioned on (x,y) location within the image
[Gould et al., IJCV 2007]
19
Context-Aware Multi-class Seg.
Additional Feature:Relative Location Map
Where is
the grass?
20
Depth Reconstruction CRF
[Saxena et al., PAMI 2008]
Label each pixel with it’s distance from the camera
Conditional Markov random field (CRF) over superpixels
Continuous variables Models depth as linear
function of features with pairwise smoothness constraints
http://make3d.stanford.edu
21
Depth Reconstruction with Context
BLACK BOX
GRASS
SKY
Grass is horizontal
Sky is far away
Find d* Reoptimize depths
with new constraints:
dCCM = argmin α||d - d*||
+ β||d - dCONTEXT||
22
Training
I: Image f: Image Features Ŷ: Output labels Training Regimes
Independent
Ground: Groundtruth Input
I
fD fS fZ
ŶD0 ŶS
0 ŶZ0
I
fD fS fZ
ŶD1
ŶS*
ŶS1
ŶZ*
ŶZ1
)(maxarg 0 fYPINDEP
),(maxarg *1 fYYP otherGROUND
23
Training
CCM Training Regime Later models can
ignore the mistakes of previous models
Training realistically emulates testing setup
Allows disjoint datasets
K-CCM: A CCM with K levels of classifiers
I
fD fS fZ
ŶD0
ŶD1
ŶS0
ŶS1
ŶZ0
ŶZ1
),ˆ(maxarg 01 fYYP otherCCM
24
Experiments
DS1 422 Images, fully labeled Categorization, Detection, Multi-class
Segmentation 5-fold cross validation
DS2 1745 Images, disjoint labels Detection, Multi-class Segmentation, 3D
Reconstruction 997 Train, 748 Test
25
CCM Results – DS1
CAR PEDESTRIAN
MOTORBIKE BOAT
CATEGORIES
REGION LABELS
26
CCM Results – DS2
Detection
Car Person Bike Boat Sheep Cow Depth
INDEP 0.357 0.267 0.410 0.096 0.319 0.395 16.7m
2-CCM 0.364
0.272 0.410 0.212 0.289 0.415 15.4m
Regions
Tree Road Grass Water Sky Building
FG
INDEP 0.541 0.702 0.859 0.444 0.924 0.436 0.828
2-CCM 0.581 0.692 0.860 0.565 0.930 0.489 0.819
Boats
27
Example ResultsIN
DEP
EN
DEN
TC
CM
28
Example Results
Independent Objects
Independent Regions CCM Objects
Independent Objects
Independent Regions CCM Regions
29
Understanding the man
“a man, a dog, a sidewalk, a building”
30
Outline
Overview Integrating Vision Models
CCM: Cascaded Classification Models Learning Spatial Context
TAS: Things and Stuff Future Directions
[Heitz & Koller ECCV 2008]
31
Things vs. Stuff
Stuff (n): Material defined by a homogeneous or repetitive pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape.
(REGIONS)
Thing (n): An object with a specific size and shape.
(DETECTIONS)
From: Forsyth et al. Finding pictures of objects in large collections of images. Object Representation in Computer Vision, 1996.
32
Cascaded Classification Models
DET1 REG1 REC1
Image
Features fDET fREG fREC
Object Detection
RegionLabeling
DET0IndependentModels
REG0 REC0
3DReconstructio
n
Context-awareModels
33
CCM
Feedforward
CCMs vs. TAS
Image
fDET fREG
DET0 REG0
DET1 REG1
TAS
Modeled Jointly
Image
fDET fREG
DET REG
Relationships
34
Satellite Detection Example
FALSE POSITIVE
TRUE POSITIVE
35
Stuff-Thing Context
Stuff-Thing: Based on spatial
relationships
Intuition:
Trees = no cars
Houses = cars nearby
Ro
ad =
cars here
“Cars drive on roads”
“Cows graze on grass”
“Boats sail on water”Goal: Unsupervised
36
Things
Detection Ti Є {0,1} Ti = 1: Candidate
window contains a positive detection
Ti
ImageWindow
Wi
P(Ti) = Logistic(score(Wi))
37
Stuff
Coherent image regions Coarse “superpixels” Feature vector Fj in Rn
Cluster label Sj in {1…C}
Stuff model Naïve BayesSj
Fj
38
Relationships
Descriptive Relations “Near”, “Above”,
“In front of”, etc. Choose set R = {r1…
rK} Rijk=1: Detection i and
region j have relation k Relationship model
S72 = Trees
S 4 = H
ouses
S10 =
Ro
ad
T1
Rijk
TiSj
R1,10,in=1
39
Unrolled Model
T1
S1
S2
S3
S4
S5
T2
T3
R2,1,above = 0
R3,1,left = 1
R1,3,near = 0
R3,3,in = 1
R1,1,left = 1
CandidateWindows
ImageRegions
40
Learning the Parameters
Assume we know R Sj is hidden
Everything else observed Expectation-Maximization
“Contextual clustering” Parameters are readily
interpretable
RijkTi Sj
Fj
ImageWindow
Wi
N
J
K
Supervisedin Training Set
AlwaysObserved
AlwaysHidden
41
Which Relationships to Use?
Rijk = spatial relationship between candidate i and region j
Rij1 = candidate in regionRij2 = candidate closer than 2 bounding boxes (BBs) to regionRij3 = candidate closer than 4 BBs to regionRij4 = candidate farther than 8 BBs from regionRij5 = candidate 2BBs left of regionRij6 = candidate 2BBs right of regionRij7 = candidate 2BBs below regionRij8 = candidate more than 2 and less than 4 BBs from region…RijK = candidate near region boundary
How do we avoid overfitting?
42
Learning the TAS Relations
Intuition “Detached” Rijk = inactive
relationship Structural EM iterates:
Learn parameters Decide which edge to toggle
Evaluate with l(T|F,W,R) Requires inference Better results than using
standard E[l(T,S,F,W,R)]
Rij1
Ti Sj
Fj
Rij2 RijK
43
Inference
Goal:
Block Gibbs Sampling Easy to sample Ti’s given Sj’s
and vice versa
44
Learned Satellite Clusters
45
Results - Satellite
Prior:Detector Only
Posterior:Detections
Posterior:Region Labels
46
Discovered Context - Bicycles
Bicycles
Cluster #3
47
TAS Results – Bicycles
Examples
Discover “true positives”
Remove “false positives”
BIKE
??
?
48
Results – VOC 2005TASBase Detector
49
Understanding the man
“a man and a dog on a sidewalk,
in front of a building ”
50
Outline
Overview Integrating Vision Models
CCM: Cascaded Classification Models Learning Spatial Context
TAS: Things and Stuff Future Directions
51
Shape models for segmentation
We have a good deformable shape model (LOOPS) for outlining objects
We have good models for segmenting objects
Let’s combine them Add terms
encouraging landmarks to lie on segmentation boundaries
Ben Packer is working on this…
Outline Segmentation
Joint Outline Joint Segmentation
LandmarkSeg Mask
52
Refined Segmentation
Our segmentation only knows about pixel “classes” What about objects?
Steve Gould is working on this…
Region Class
Region Appearance
Pixel/Region Assignment
Pixel Appearance
53
Full TAS-like Integration
Rijk
TiSj
Depths
OcclusionEdges
SurfaceEdges
ShapeModels