1 integrating vision models for holistic scene understanding geremy heitz cs223b march 4 th, 2009

1

Integrating Vision Models for Holistic Scene

Understanding

Geremy Heitz

CS223BMarch 4th, 2009

2

Scene/Image Understanding

What’s happening in these pictures?

3

Human View of a “Scene”

“A car passes a bus on the road,

while people walk past a building.”

ROAD

BUILDING

CAR

BUSPEOPLEWALKING

4

Computer View of a “Scene”

BUILDING

ROAD

STREET

SCENE

Can we integrateall of these subtasks,

so thatwhole > sum of parts ?

5

Outline

Overview Integrating Vision Models

CCM: Cascaded Classification Models Learning Spatial Context

TAS: Things and Stuff Future Directions

[Heitz et al. NIPS 2008a]

[Heitz & Koller ECCV 2008]

6

Image/Scene Understanding

“a man and a dogare walking

on a sidewalkin front of a building”

Man

Dog

Backpack

Cigarette

Primitives Objects Parts Surfaces Regions

Interactions Context Actions Scene

Descriptions

Established techniques address these in isolation.

Reasoning over image statistics

Complex web of relations well represented by graphical models.

Reasoning over more abstract entities.

Building

Sidewalk

7

Why will integration help?

What is this object?

8

More Context

Context is key!

9

Outline




[Heitz et al. NIPS 2008a]

10

Human View of a “Scene”

ROAD

BUILDING

CAR

BUSPEOPLEWALKING

Scene Categorization

Object Detection Region Labelling Depth

Reconstruction Surface Orientations Boundary/Edge

Detection Outlining/Refined

Localization Occlusion

Reasoning ...

11

Intrinsic Images [Barrow and Tenenbaum, 1978], [Tappen et al., 2005]

Hoiem et al., “Closing the Loop in Scene Interpretation” , 2008

We want to focus more on “semantic” classes We want to be flexible to using outside models We want an extendable framework, not one engineered for a particular set of

tasks

Related Work

= +

=

12

How Should we Integrate? Single joint model over all variables

Pros: Tighter interactions, more designer control Cons: Need expertise in each of the subtasks

Simple, flexible combination of existing models

Pros: State-of-the-art models, easier to extendLimited “black-box” interface to components

Cons: Missing some of the modeling power

DETECTIONDalal & Triggs, 2006

REGION LABELINGGould et al., 2007

DEPTH RECONSTRUCTIONSaxena et al., 2007

13

DET1 REG1 REC1

Cascaded Classification Models

Image

Features fDET

Object Detection

RegionLabeling

DET0IndependentModels

fREG

REG0

fREC

REC0

3DReconstructio

n

Context-awareModels

14

Integrated Model for Scene Understanding

Object Detection Multi-class

Segmentation Depth Reconstruction Scene Categorization

I’ll show you

these

15

Basic Object Detection

= Car

= Person

= Motorcycle= Boat

= Sheep= Cow

Detection Window W

Score(W) > 0.5

16

Base Detector - HOG

[ Dalal & Triggs, CVPR, 2006 ] HOG Detector:

Feature Vector X SVM Classifier

17

Context-Aware Object Detection

From Base Detector Log Score D(W)

From Scene Category MAP category, marginals

From Region Labels How much of each label is in

a window adjacent to W From Depths

Mean, variance of depths,estimate of “true” object size

Final Classifier

P(Y) = Logistic(Φ(W))

Scene Type: Urban scene

% of “road” below W

Variance of depths in W

18

Multi-class Segmentation CRF Model

Label each pixel as one of:{‘grass’, ‘road’, ‘sky’, etc }

Conditional Markov random field (CRF) over superpixels:

Singleton potentials: log-linear function of boosted detectors scores for each class

Pairwise potentials: affinity of classes appearing together conditioned on (x,y) location within the image

[Gould et al., IJCV 2007]

19

Context-Aware Multi-class Seg.

Additional Feature:Relative Location Map

Where is

the grass?

20

Depth Reconstruction CRF

[Saxena et al., PAMI 2008]

Label each pixel with it’s distance from the camera

Conditional Markov random field (CRF) over superpixels

Continuous variables Models depth as linear

function of features with pairwise smoothness constraints

http://make3d.stanford.edu

21

Depth Reconstruction with Context

BLACK BOX

GRASS

SKY

Grass is horizontal

Sky is far away

Find d* Reoptimize depths

with new constraints:

dCCM = argmin α||d - d*||

+ β||d - dCONTEXT||

22

Training

I: Image f: Image Features Ŷ: Output labels Training Regimes

Independent

Ground: Groundtruth Input

I

fD fS fZ

ŶD0 ŶS

0 ŶZ0

I

fD fS fZ

ŶD1

ŶS*

ŶS1

ŶZ*

ŶZ1

)(maxarg 0 fYPINDEP

),(maxarg *1 fYYP otherGROUND

23

Training

CCM Training Regime Later models can

ignore the mistakes of previous models

Training realistically emulates testing setup

Allows disjoint datasets

K-CCM: A CCM with K levels of classifiers

I

fD fS fZ

ŶD0

ŶD1

ŶS0

ŶS1

ŶZ0

ŶZ1

),ˆ(maxarg 01 fYYP otherCCM

24

Experiments

DS1 422 Images, fully labeled Categorization, Detection, Multi-class

Segmentation 5-fold cross validation

DS2 1745 Images, disjoint labels Detection, Multi-class Segmentation, 3D

Reconstruction 997 Train, 748 Test

25

CCM Results – DS1

CAR PEDESTRIAN

MOTORBIKE BOAT

CATEGORIES

REGION LABELS

26

CCM Results – DS2

Detection

Car Person Bike Boat Sheep Cow Depth

INDEP 0.357 0.267 0.410 0.096 0.319 0.395 16.7m

2-CCM 0.364

0.272 0.410 0.212 0.289 0.415 15.4m

Regions

Tree Road Grass Water Sky Building

FG

INDEP 0.541 0.702 0.859 0.444 0.924 0.436 0.828

2-CCM 0.581 0.692 0.860 0.565 0.930 0.489 0.819

Boats

27

Example ResultsIN

DEP

EN

DEN

TC

CM

28

Example Results

Independent Objects

Independent Regions CCM Objects

Independent Objects

Independent Regions CCM Regions

29

Understanding the man

“a man, a dog, a sidewalk, a building”

30

Outline




[Heitz & Koller ECCV 2008]

31

Things vs. Stuff

Stuff (n): Material defined by a homogeneous or repetitive pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape.

(REGIONS)

Thing (n): An object with a specific size and shape.

(DETECTIONS)

From: Forsyth et al. Finding pictures of objects in large collections of images. Object Representation in Computer Vision, 1996.

32

Cascaded Classification Models

DET1 REG1 REC1

Image

Features fDET fREG fREC

Object Detection

RegionLabeling

DET0IndependentModels

REG0 REC0

3DReconstructio

n

Context-awareModels

33

CCM

Feedforward

CCMs vs. TAS

Image

fDET fREG

DET0 REG0

DET1 REG1

TAS

Modeled Jointly

Image

fDET fREG

DET REG

Relationships

34

Satellite Detection Example

FALSE POSITIVE

TRUE POSITIVE

35

Stuff-Thing Context

Stuff-Thing: Based on spatial

relationships

Intuition:

Trees = no cars

Houses = cars nearby

Ro

ad =

cars here

“Cars drive on roads”

“Cows graze on grass”

“Boats sail on water”Goal: Unsupervised

36

Things

Detection Ti Є {0,1} Ti = 1: Candidate

window contains a positive detection

Ti

ImageWindow

Wi

P(Ti) = Logistic(score(Wi))

37

Stuff

Coherent image regions Coarse “superpixels” Feature vector Fj in Rn

Cluster label Sj in {1…C}

Stuff model Naïve BayesSj

Fj

38

Relationships

Descriptive Relations “Near”, “Above”,

“In front of”, etc. Choose set R = {r1…

rK} Rijk=1: Detection i and

region j have relation k Relationship model

S72 = Trees

S 4 = H

ouses

S10 =

Ro

ad

T1

Rijk

TiSj

R1,10,in=1

39

Unrolled Model

T1

S1

S2

S3

S4

S5

T2

T3

R2,1,above = 0

R3,1,left = 1

R1,3,near = 0

R3,3,in = 1

R1,1,left = 1

CandidateWindows

ImageRegions

40

Learning the Parameters

Assume we know R Sj is hidden

Everything else observed Expectation-Maximization

“Contextual clustering” Parameters are readily

interpretable

RijkTi Sj

Fj

ImageWindow

Wi

N

J

K

Supervisedin Training Set

AlwaysObserved

AlwaysHidden

41

Which Relationships to Use?

Rijk = spatial relationship between candidate i and region j

Rij1 = candidate in regionRij2 = candidate closer than 2 bounding boxes (BBs) to regionRij3 = candidate closer than 4 BBs to regionRij4 = candidate farther than 8 BBs from regionRij5 = candidate 2BBs left of regionRij6 = candidate 2BBs right of regionRij7 = candidate 2BBs below regionRij8 = candidate more than 2 and less than 4 BBs from region…RijK = candidate near region boundary

How do we avoid overfitting?

42

Learning the TAS Relations

Intuition “Detached” Rijk = inactive

relationship Structural EM iterates:

Learn parameters Decide which edge to toggle

Evaluate with l(T|F,W,R) Requires inference Better results than using

standard E[l(T,S,F,W,R)]

Rij1

Ti Sj

Fj

Rij2 RijK

43

Inference

Goal:

Block Gibbs Sampling Easy to sample Ti’s given Sj’s

and vice versa

44

Learned Satellite Clusters

45

Results - Satellite

Prior:Detector Only

Posterior:Detections

Posterior:Region Labels

46

Discovered Context - Bicycles

Bicycles

Cluster #3

47

TAS Results – Bicycles

Examples

Discover “true positives”

Remove “false positives”

BIKE

??

?

48

Results – VOC 2005TASBase Detector

49

Understanding the man

“a man and a dog on a sidewalk,

in front of a building ”

50

Outline




51

Shape models for segmentation

We have a good deformable shape model (LOOPS) for outlining objects

We have good models for segmenting objects

Let’s combine them Add terms

encouraging landmarks to lie on segmentation boundaries

Ben Packer is working on this…

Outline Segmentation

Joint Outline Joint Segmentation

LandmarkSeg Mask

52

Refined Segmentation

Our segmentation only knows about pixel “classes” What about objects?

Steve Gould is working on this…

Region Class

Region Appearance

Pixel/Region Assignment

Pixel Appearance

53

Full TAS-like Integration

Rijk

TiSj

Depths

OcclusionEdges

SurfaceEdges

ShapeModels

1 integrating vision models for holistic scene understanding geremy heitz cs223b march 4 th, 2009

Documents

sidewalk slide

context context

stateoftheart models

graphical models

contextaware object

holistic scene

scene interpretation

vision models ccm