learning and inference in vision: from features to scene understanding jonathan huang, tomasz...
Post on 17-Dec-2015
215 Views
Preview:
TRANSCRIPT
Learning and Inference in Vision: from Features to Scene Understanding
Jonathan Huang, Tomasz Malisiewicz
MLD Student Research Symposium, 2009
Road
Sky
Trees
Bridge
SignCar
Huge datasetsPASCAL Visual Objects Challenge (VOC) dataset
~15000 annotated images, ~35,000 annotated object instances, 20 object classes with segmentations, bounding boxes
Huge datasets
LabelMe dataset
~11845 static images, >100,000 labeled polygons
Outline
I. Recognizing single object classes (Jon)
II. Scene understanding with multiple classes (Tomasz)
Recognition task #1: Find all markers
Geometric Variability
Recognition task #2: Find all cats
Object recognition is often hard due to:
Variation within an object class
Viewpoint/Scales/Illumination Variability Images from Flickr
From Pixels to Visual features
car
ImagingImaging
InferenceInference
Scene
Featu
res
Pixels
Low level features
Higher level inference
Local Visual Features
Images are high dimensional!
Compute image statistics in a region (e.g., estimate the distribution of image gradient orientations)
(640 width) *(480 height) = (307200 pixels)
Key ideas in feature design
Be invariant to stuff you don’t care about…
while not being too invariant
Object classification
Inference: What object class is this?Learning: What does each object class look like?
Cow or Horse??
Let’s look at a simpler example first…
Document classification analogy
John Terry scored on a header to lift Chelsea to a 1-0 victory over Manchester United and extend the Blues’ Premier League lead to 5 points. Chelsea had been frustrated by Manchester United for 76 minutes, but took advantage of a free kick awarded when Darren Fletcher fouled Ashley Cole.Brian Ching scored six minutes into overtime and the Houston Dynamo advanced to Major League Soccer’s Western ...
In the Senate, where proposals differ substantially from the House-passed measure on issues like a government-run plan and how to pay for coverage, the bill is stalled while budget analysts assess its overall costs. The slim margin in the House — the bill passed with just two votes to spare, and 39 Democrats opposed it — suggests even greater challenges in the Senate, where the majority leader, ...
??? ???
Classify each document as sports or politics
Bag-of-words models for text classification
“Much of the meaning behind written language is preserved even when the ordering of the individual words is lost.” [El-Arini et al.,’09]
bag
words(Sue Ann)
Document classification analogy
but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extend victory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored ...
the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government run where the issues votes it the where bill for spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its House assess while Senate, to in just the leader and the plan passed the is House passed The ...
??? ???
Document classification analogy
but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extend victory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored ...
the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government-run where the issues votes it the where bill for spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its House assess while Senate, to in just the leader and the plan passed the is House-passed The ...
??? ???
Visual words (discretization)
Words are discrete, visual features are typically continuous…
Discretization via clustering/vector quantization
Visual words
[Sivic et al., ‘05]
Object classification with bag of words
[Sivic et al., ‘05]
Object classification with bag of wordsPerformance on Caltech 101 dataset with linear SVM on bag-of-word vectors:
Faces
Airplanes Cars
[Csurka et al., ‘04]
Object Detection problemDetection: Locate all the faces in this image.
Classification: Is this a face, or not a face?
Face detection via a series of classifications(a.k.a. sliding window brain damage)
False Detection
Missed Faces
Sliding window detection results
The need for… capturing spatial relationships
One ApproachCreate a more descriptive (complicated) feature
Histograms of Oriented Gradients (HOG) features
Original ImageSubdivided Image cells
Histogrammed gradients in
each cell
Estimated Image Gradients
gradient magnitudes
gradient orientations
[Dalal & Triggs, ‘06]
People Tracking with HOG features
bette
r
Modeling Spatial Relationships with Deformable Part Based Models
Spring-based models: Parts prefer low-energy configurations
[Fischler & Elschlager ,’73], [Ramanan et al,’07], [Felszwenwalb et al,’05,’09], [Kumar et al, ‘09]
Parts Based Model
Vertices – Local Appearance
Edges - Spatial Relationship
Goal: Assign model parts to image regions preserving
both local appearance and spatial relationships
Parts based models - Inference ProblemInference problem: What is the best scoring assignment f?
Local Appearance termPairwise Spatial
Relationship term
Inference is NP-hard for general graphs
For trees can use belief propagation for exact solution in polytime
Parts based models - Learning Problem
Linear models:
s.t.
Local Appearance termPairwise Spatial
Relationship term
Convex max-margin objective
Positive examples on one side
Negative examples on the other
[Kumar et al,’09]
Learning linear models: Find weight vectors that best separate positive and negative examples. E.g.,
Person deformable part model
Root filter (8x8 resolution)
Part filter (4x4 resolution)
Quadratic spatial configuration model
[Felszwenwalb et al,’09]
[Felszwenwalb et al,’09]
[Ramanan et al,’09]
Outline
I. Recognizing single object classes (Jon)
II. Scene understanding with multiple classes (Tomasz)
Part II: Scene Understanding with Multiple ClassesGoal: Predict Many Different Objects in a Single Image
Car
Fire Hydrant
Building
Fence
Sidewalk
Tree
Wait...
• What’s wrong with just learning a different sliding window classifier for each object type in the world?
The image as seen from a object detector’s point of view
41
Relationships between objects make recognition possible
41Antonio Torralba. The Context Challenge. http://web.mit.edu/torralba/www/carsAndFacesInContext.html
43
Objects as the “Parts” of a Scene
Key Challenge in Scene Understanding: Modeling relationships between objects from different categories
Deformable Part Model Scene Model
Fixed Extent “Things” vs Free-form “Stuff”
Building
Fence
Sidewalk
Car
Fire Hydrant
Tree
Things have a well-defined shape. A part of a car is not a car.
Stuff is free-form and mostly defined by color/texture. A part of a building is still a building.
3 Types of Scene Models
Pixel-based Window-based Segment-based
Pixel-based Scene Understanding
Unable to reason about instances
Only limited notion of context
TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. Shotton et al. ECCV 2006
Produces Segmentation
Works well on “stuff”
50
Pixel-wise Conditional Random Fields (TextonBoost)
• Inference
• y^* = argmax_y p(y|x)
• Training: Use boosting to learn unary potential
• Future Direction: Higher-Order Cliques50
TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. Shotton et al. ECCV 2006
Window-based Scene Understanding
Often not possible to model “stuff” using windows.
Window assumption also questionable for some “things.”
Possible to model interactions between object instances.
Discriminative models for multi-class object layout. Desai et al. ICCV 2009Object Recognition by Scene Alignment.
Russell et al. NIPS 2007
52
Discriminative models for multi-class object layout
• Inference via Greedy Forward Search
• Training
52
53
Window-based results
53
Region-Based Scene Understanding
Use Segmentation algorithm to extract stable regionsUse CRF to label those segments
Problem: Hard to get object-segments. Problem: Inference difficult for fully connected models.
56
Region-Based CRF
• Training: Bag of Words with Nearest Neighbor classifier
• Maximum Likelihood training of pairwise potentials
56
Object Categorization using Co-Occurrence, Location and Appearance. Galleguillos et al. CVPR 2008.
Spatial Relations
57
Segmentation-Based Results
57
Input image No context w/ context
Object Categorization using Co-Occurrence, Location and Appearance. Galleguillos et al. CVPR 2008.
58
Model Granularity vs. Object Type
Pixels Windows Regions
Things (car, cow, person) :-( :-) :-/
Stuff (road, sky, tree) :-) :-( :-)
Granularity
ObjectType
Scene Understanding Recap
• Rich object-object interactions are important for scene understanding.
• Different underlying assumptions (pixel vs. window vs. region) are better suited for different types of objects (“stuff” vs. “things”)
• Many of the techniques for single class object recognition (e.g., part based models) are relevant for scene understanding
Thanks!
Image Classification
Sliding Window based Object Detection
Modeling Spatial Relationships between parts
Modeling Spatial Relationships between objects
top related