stanford cs223b computer vision, winter 2006 lecture 14: object detection and classification using...
TRANSCRIPT
Stanford CS223B Computer Vision, Winter 2006
Lecture 14: Object Detection and
Classification Using Machine Learning
Gary Bradski, Intel, StanfordCAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado
“Who will be strong and stand with me? Beyond the barricade, Is there a world you long to see?”
-- Enjolras, Do you hear the people sing? Le Miserables
This guy is wearing a haircutThis guy is wearing a haircutcalled a “Mullet”called a “Mullet”
Fast, accurate and Fast, accurate and general object general object recognition …recognition …
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Find the Mullets…
Rapid Learning and Generalization
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Approaches to Recognition
Geometric
Non-Geo
Local Global
Patches/Ulman
Histograms/SchieleHMAX/Poggio
Constellation/Perona Eigen Objects/TurkShape models
MRF/Freeman, Murphy
We’ll see a few of these …
features
rela
tion
s
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Eigenfaces Find a new coordinate system that best captures the scatter of the data. Eigen vectors point in the direction of scatter, ordered of the magnitude
of the eigen values. We can typically prune the number of eigen vectors to a few dozen.
GlobalGlobal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Eigenfaces, the algorithm
The database
2
2
1
Ng
g
g
2
2
1
Ne
e
e
2
2
1
Nh
h
h
2
2
1
Nf
f
f
2
2
1
Nc
c
c
2
2
1
Na
a
a
2
2
1
Nd
d
d
2
2
1
Nb
b
b
[slide credit: Alexander Roth]
Assumptions: Square images with W=H=N M is the number of images in the databaseP is the number of persons in the database
GlobalGlobal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Eigenfaces, the algorithm
Then subtract it from the training faces
22
2
1
2
1
NN
m
m
m
m
h
h
h
h
22
2
1
2
1
NN
m
m
m
m
e
e
e
e
22
2
1
2
1
NN
m
m
m
m
f
f
f
f
22
2
1
2
1
NN
m
m
m
m
g
g
g
g
22
2
1
2
1
NN
m
m
m
m
d
d
d
d
22
2
1
2
1
NN
m
m
m
m
a
a
a
a
22
2
1
2
1
NN
m
m
m
m
b
b
b
b
22
2
1
2
1
NN
m
m
m
m
c
c
c
c
[slide credit: Alexander Roth]
We compute the average face
8 with,1
222
222
111
M
hba
hba
hba
Mm
NNN
GlobalGlobal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Eigenfaces, the algorithm
Now we build the matrix which is N2 by M
The covariance matrix which is N2 by N2
TAAC
mmmmmmmm hgfedcbaA
[slide credit: Alexander Roth]
Find eigenvalues of the covariance matrix– The matrix is very large
– The computational effort is very big
We are interested in at most M eigenvalues– We can reduce the dimension of the matrix
TAAC
GlobalGlobal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Eigenvalue Theorem
Define dimension N2 by N2
dimension M by M (e.g., 8 by 8) Let be an eigenvector of : Then is eigenvector of : Proof:
AAL
AACT
T
)(
)(
)(
)()(
Av
vA
LvA
AvAA
AvAAAvCT
T
v L
)()( AvAvC Av C
vLv
[slide credit: Alexander Roth]
GlobalGlobal
This vast dimensionality reduction is what makes the whole thing work.
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Eigenfaces, the algorithm
Compute another matrix which is M by M:
Find the M eigenvalues and eigenvectors– Eigenvectors of C and L are equivalent
Build matrix V from the eigenvectors of L
AAL T
[slide credit: Alexander Roth]
Eigenvectors of C are linear combination of image space with the eigenvectors of L
Eigenvectors represent the variation in the faces
VAU
GlobalGlobal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Eigenfaces, the algorithm
Compute for each face its projection onto the face space
Compute the between-class threshold
)(8 mT hU
)(7 mT gU
)(6 mT fU
)(5 mT eU
)(2 mT dU
)(3 mT cU
)(2 mT bU
)(1 mT aU
Mjiji ....1,for}max{2
1
[slide credit: Alexander Roth]
GlobalGlobal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Example
Photobook, MIT
[Note: sharper]
Example set Eigenfaces
Normalized Eigenfaces
GlobalGlobal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Eigenfaces, the algorithm in use To recognize a face, subtract the average face from it
2 2
1 1
2 2
m
N N
r m
r mr
r m
2
2
1
Nr
r
r
22
2
1
2
1
NN
m
m
m
m
r
r
r
r
[slide credit: Alexander Roth]
Compute its projection onto the face space
Compute the distance in the face space between the face and all known faces
)( mT rU
Miii ...1for22
Distinguish between– If it’s not a face– If it’s a new face– If it’s a known face
}min{ and
),...,1(, and
i
i Mi
GlobalGlobal
Beyond uses in recognition, Eigen “backgrounds” can be very effective for background subtraction.
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Eigenfaces, the algorithm Problems with eigenfaces – spurious “scatter”
– Different illumination– Different head pose– Different alignment– Different facial expression
[slide credit: Alexander Roth]
Fisherfaces may beat … Developed in 1997 by P.Belhumeur et al. Based on Fisher’s LDA Faster than eigenfaces, in some cases Has lower error rates Works well even if different illumination Works well even if different facial express.
GlobalGlobal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Global/local feature mix
[image credit: Kevin Murphy]
Global-noGeoGlobal-noGeo
Global works OK, still used, but local now seems to outperform.
Recent mix of local and global: – Use global features to bias local features with no internal
geometric dependencies: Murphy, Torralba & Freeman (03)
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Use local features to find objectsGlobal-noGeoGlobal-noGeo
Filter bank
Image
ncorrelatio normalized
nconvolutio *
patch
Gaussian within bounding box
Trainingx positiveO negative
Object bounding box
[image credit: Kevin Murphy]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Global feature: Back to neural nets: Propagate Mixture Density Networks*
Final output
Iteration
Uses “boostedrandom fields” tolearn graph structure
[slide credit: Kevin Murphy]
Global-noGeoGlobal-noGeo
* C. M. Bishop. Mixture density networks. Technical Report NCRG 4288, Neural ComputingResearch Group, Department of Computer Science, Aston University, 1994
Fe
atu
re u
se
d:
Ste
era
ble
pyr
am
id
tra
nsf
orm
atio
n u
sin
g 4
orie
nta
tion
s a
nd
2
sca
les;
Im
ag
e d
ivid
ed
into
4x4
grid
, a
vera
ge
en
erg
y co
mp
ute
d in
ea
ch c
ha
nn
el
yie
lds
12
8 f
ea
ture
s.
PC
A d
ow
n t
o 8
0.
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Example of context focus
The algorithm knows where to focus for objects
Global-noGeoGlobal-noGeo
[image credit: Kevin Murphy]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Results Performance is boosted by knowing context
Global-noGeoGlobal-noGeo
[image credit: Kevin Murphy]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Completely Local: Color Histograms Swain and Ballard ’91 took the normalized r,g,b color histogram of
objects:
and noted the tolerance to 3D rotation, partial occlusions etc:
Local-noGeoLocal-noGeo
[image credit: Swain & Ballard]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Color Histogram Matching Objects were recognized based on their histogram intersection:
Yielding excellent results over 30 objects:
The problem is, color varies markedly with lighting …
Local-noGeoLocal-noGeo
[image credit: Swain & Ballard]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Local-noGeoLocal-noGeo
Scheile and Crowley used derivative type features instead:
And a probabilistic matching rule:
Local Feature Histogram Matching
[image credit: Scheile & Crowley]
• For multiple objects:
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Again with impressive performance results, much more tolerant to lighting:
Problem is: Histograms suffer exponential blow up with number of features
Local Feature Histogram ResultsLocal-noGeoLocal-noGeo
[image credit: Scheile & Crowley]
30 of 100f objects
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Local features, for example:– Lowe’s SIFT– Malik’s Shape Context– Poggio’s HMAX– von der Malsburg’s Gabor Jets– Yokono’s Gaussian Derivative Jets
Adding patches thereof seems to work great, but they are of high dimensionality.
Idea: Encode in Hierarchy: – Overview some techniques...
Local Features
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Convolutional Neural NetworksYann LeCun
Broke all the HIPs code(Human Interaction Proofs)from Yahoo, MSN, E-Bay …
Local-HierarchyLocal-Hierarchy
[image credit: LeCun]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Fragment Based Hierarchy Shimon Ullman
Top down and bottom up hierarchy
http://www.wisdom.weizmann.ac.il/~vision/research.html See also Perona’s group work on hierarchical feature models of objects http://www.vision.caltech.edu/html-files/publications.html
Local-HierarchyLocal-Hierarchy
[image credit: Ullman et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Constellation ModelPerona’s
Fro
m:
Ro
b F
erg
us h
ttp://
ww
w.r
obot
s.ox
.ac.
uk/%
7Efe
rgus
/
Feature detector results: Bayesian Decision basedThe shape model. The mean location is indicated by the cross, with the ellipse showing the uncertainty in location. The number by each part is the probability of that part being present.
The appearance model closest to the mean of the appearance density of each part
Recognition Result:
Se
e a
lso
Pe
ron
a’s
gro
up
wo
rk o
n h
iera
rch
ica
l fe
atu
re
mo
de
ls o
f o
bje
cts
htt
p:/
/ww
w.v
isio
n.c
alte
ch.e
du
/htm
l-file
s/p
ub
lica
tion
s.h
tml
Local-HierarchyLocal-Hierarchy
[image credit: Perona et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Joijic and Frey
Scene description as hierarchy of sprites
Local-HierarchyLocal-Hierarchy
[image credit: Joijic et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Jeff Hawkins, Dileep George Modular hierarchical spatial temporal memory
Hierarchy Module
Results Templates Good Classifications Bad Classifications
In (D) Out (E)
Local-HierarchyLocal-Hierarchy
[image credit: George, Hawkins]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Peter Bock’s ALISAAn explicit Cognitive Model
Local-HierarchyLocal-Hierarchy
Histogram based
[image credit: Bock et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
ALISA Labeling 2 ScenesLocal-HierarchyLocal-Hierarchy
[image credit: Bock et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Local-HierarchyLocal-HierarchyHMAX from the “Standard Model”Maximilian Riesenhuber and Tomaso Poggio
Basic building blocksIn object recognition hierarchy
Modulated by attention
Pick this up momentarily, first, a little on trees and boosting …[image credit: Riesenhuber et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Bayesian NetworksLibrary:
PNL
Statistical LearningLibrary:
MLL
• K-means
• Decision trees
• Agglomerative clustering• Spectral clustering
• K-NN
• Dependency Nets
• Boosted decision trees
Machine Learning – Many TechniquesLibraries from Intel
Modeless Model based
Uns
uper
vise
dS
uper
vise
d• Multi-Layer Perceptron
• SVM• BayesNets: Classification
• Tree distributions
Key:• Optimized• Implemented• Not implemented
• BayesNets: Parameter fitting• Inference
• Kernel density est.
• PCA
• Physical Models
• Influence diagrams
• Bayesnet structure learning
• Logistic Regression
• Kalman Filter
• HMM
• Adaptive Filters
• Radial Basis • Naïve Bayes• ARTMAP
• Gaussian Fitting
• Assoc. Net.
• ART• Kohonen Map
• Random Forests.
• MART
• CART
• Diagnostic Bayesnet
• Histogram density est.
focus
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Machine Learning
INPUT OUTPUTf
Example Uses of Prediction:- Insurance risk prediction - Parameters that impact yields- Gene classification by function- Topics of a document. . .
Find a function that describes given dataand predicts unknown data
overfit
underfit
just right
X
y f
Learn a model/function
That maps input to output
Specific example: prediction, using a decision tree => => =>
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Binary Recursive Decision TreesLeo Breiman’s “CART”*
Data setData set
maximal purity splits
Perfect purity, but…
overfit
underfit
X
y f
*Classification And Regression Tree
At Each Level:At Each Level: Find the variable (predictor) and its threshold.
– That splits the data into 2 groups– With maximal purity within each group
All variables/predictors are considered at every level.
Data of different types, eachcontaining a vector of “predictors”
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
just right
Data setData set
Prune to avoid over fitting usingcomplexity cost measure
Binary Recursive Decision TreesLeo Breiman’s “CART”*
At Each Level:At Each Level: Find the variable (predictor) and its threshold.
– That splits the data into 2 groups– With maximal purity within each group
All variables/predictors are considered at every level.
overfit
x
y f
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Consider a Face Detector via Decision Stumps
Data setData set
maximal purity splits: Thresh = N
For each rectangle combination region: Find the threshold
– That splits the data into 2 groups (face, non-face) – With maximal purity within each group
Face and non-face data that he features can be tried on
Bar detector works well for “nose” a face detecting stump.
It doesn’tdetectcars.
Consider a tree “Stump” – just one split.It selects the single most discriminative feature …
See Appendix for Viola, Jones’s feature generator: Intregral Images
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
We use “Boosting” to Select a “Forest of Stumps”
G iven exam ple im ages (x1 ,y1) , … , (xn ,yn) w here y i = 0, 1 for negative and positive exam ples respectively.
In itialize w eights w 1 ,i = 1 /(2m ), 1 /(2 l) for train ing exam ple i, w here m and l are the num ber of negatives and positives respectively.
For t = 1 … T 1) N orm alize w eights so that w t is a d istribution 2) For each feature j train a classifier h j and evaluate its error j w ith respect to w t. 3) C hose the classifier h j w ith low est error. 4) U pdate w eights according to :
1,,1
i
titit ww
w here e i = 0 is x i is classified correctly, 1 o therw ise, and
1 t
t
t
T he final strong classifier is:
otherwise
xxhT
t
T
t ttt h0
2
1)(1)( 1 1 , w here )
1log(
t
t
Each stump is a selected feature plus a split threshold
Gentle Boost:
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
For efficient calculation, form a Detection Cascade
A boosted cascade is assembled such that at each node, non-object regions stop further processing.
If the detection of each node is high (~99.9%), at cost of a high false positive rate (say 50% of everything detected as “object), and if the nodes are independent,
.6.9 and 98.0 :get we
cascade node 20 afor then so, If . falsePos and detect
are rates positive false anddetection overall then the
7
11
efd
fdn
ii
n
ii
Rapid Object Detection using a Boosted Cascade of Simple Features - Viola, Jones (2001)
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Improvements to Cascade
J. Wu, J. M. Rehg, and M. D. Mullin just do one Boosting round, then select from the feature pool as needed:
Kobi Levi and Yair Weiss just used better features (gradient histograms) to cut training needs by an order of magnitude.
Let’s focus on better features and descriptors …
Viola, Jones Wu, Rehg, Mullin
[image credit: Wu et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
The Standard Model of Visual CortexBiologically Motivated Features
S1 layer:Gabor at 4 orientations
C1 layer:Local Spatial Max
Inter layer:Dictionary of Patches of C1
S2 Layer:Radial Basis fit to it’s patch template over the whole image
C2 Layer:Max S2 Response .8 .4 .9 .2 .6
Classifier (SVM, Boosting, …)
Thomas Serre, Lior Wolf and Tomaso Poggio used the model of the human visual cortex developed in Riesenhuber’s lab:
First 5 chosen features from Boosting
[image credit: Serre et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
The Standard Model of Visual CortexBiologically Motivated Features
Results in state of the art/top performance:
[image credit: Serre et al]
Seems to handily beat SIFT features:
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Yokonos’ Generalization toThe Standard Model of Visual Cortex
Used Gaussian Derivates: 3 orders X 3 scales X 4 orientations = 36 base features:
Similar to Standard Model’s Gabor base filters.
[image credit: Yokono et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Yokonos’ Generalization toThe Standard Model of Visual Cortex
Created a local spatial jet, oriented to the gradient at the largest scale at the center pixel:
Since Gabor has ringing spatial extent ~ still approximately similar to standard model.
[image credit: Yokono et al]
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Yokonos’ Generalization toThe Standard Model of Visual Cortex
Full system:
[image credit: Yokono et al]
~S1, C1:Features memorized from positive samples at Harris corner interest points.
~S2:Dictionary of learned features is measured (normalized cross correlation) against all interest points in the image.
~C2:The maximum normalized cross correlation scores are arranged in a feature vector
Classifier:Again: SVM, Boosting, …
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Yokonos’ Generalization toThe Standard Model of Visual Cortex
Excellent Results:
[image credit: Yokono et al]
CBCL Database ROC curve for 1200 Stumps:
SVM with 1 to 5 training images beats other techniques:
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Yokonos’ Generalization toThe Standard Model of Visual Cortex
Excellent Results:
[image credit: Yokono et al]
AIBO Dog in articulated poses:
Some features chosen:
ROC Curve:
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Brash Claim
In the high 90% performance under lighting, articulation, scale and 3D rotation. – The classifier inside humans is unlikely to be much more accurate.
We are not that far from raw human level performance. – By 2015 I predict.
Base classifier is embedded in larger system that makes it more reliable:– Attention– Color constancy features– Context– Temporal filtering– Sensor fusion
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Back to Kevin Murphy: Context:
[slide credit: Kevin Murphy]
Missing
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
We know there is a keyboard present in this scene even if we cannot see it clearly.
We know there is no keyboard present in this scene
… even if there is one indeed.[slide credit: Kevin Murphy]
MissingContext
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Attention
Change blindness
Missing
Farm Truck
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Call for a Program: Generalize Standard Model Even Further
Detect:– DOG– Harris Corner
Descriptors:– SIFT– Steerable– Gabor
Dictionary:– All descriptors– Subset– Clustered
Image Level Scoring:– Histogram– Max Correlation– Max Probability
Classifier:– SVM– Boosting– K-NN …
Embedding:– Attention, active vision– Context: Scene, 3D inference– Sensor fusion/association– Motion
Research Framework
Loca
l
Glo
bal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Call for a Program: Generalize Standard Model Even Further
Ashutosh Saxena, Chung and Ng learned depth using local features in an MRF (similar to Kevin Murphy).
Ashutosh also has a robot picking up novel objects from local features. Together with active vision, active manipulation, context – Now is a good time for vision systems!
[image credit: Saxena et al]Apply to “Stanley II” and to STAIR
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Summary: Mix local with globalGeneralize Standard Model Even Further
Detect:– DOG– Harris Corner
Descriptors:– SIFT– Steerable– Gabor
Dictionary:– All descriptors– Subset– Clustered
Image Level Scoring:– Histogram– Max Correlation– Max Probability
Classifier:– SVM– Boosting– K-NN …
Embedding:– Attention, active vision– Context: Scene, 3D inference– Sensor fusion/association– Motion
Research Framework
Loca
l
Glo
bal
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Bibliography for this lecture Papers for this lecture:1. R. Fergus, P. Perona and A.Zisserman, “Object Class Recognition by Unsupervised Scale-Invariant Learning”,
CVPR 03.
2. M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, Vol. 3, No. 1, 1991.
3. Serre, T., L. Wolf and T. Poggio. Object Recognition with Features Inspired by Visual Cortex. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society Press, San Diego, June 2005.
4. Jerry Jun Yokono & Tomaso Poggio, “Boosting a Biologically Inspired Local Descriptor for Geometry-free Face and Full Multi-view 3D Object Recognition”, AI Memo 3005-023 CBCL Memo 254, July 2005
5. J. Wu, J. M. Rehg, and M. D. Mullin, “Learning a Rare Event Detection Cascade by Direct Feature Selection” Proc. Advances in Neural Information Processing Systems 16 (NIPS*2003), MIT Press, 2004
6. J. Wu, M. D. Mullin, and J. M. Rehg, “Linear Asymmetric Classifier for Face Detection”, International Conference on Machine Learning (ICML 05), pages 993-1000, Bonn, Germany, August 2005
7. Kobi Levi and Yair Weiss, “Learning Object Detection from a Small Number of Examples: The Importance of Good Features” International Conference on Computer Vision and Pattern Recognition (CVPR) 2004.
8. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. CVPR, pages 511–518, 2001.
9. B. Schiele and JL Crowley. Probabilistic object recognition using multidimensional receptive field histograms. submitted to ICPR'96
10. M. J. Swain and D. H. Ballard, "Color Indexing," International Journal of Computer Vision, vol. 7, pp. 11-32, 1991.
11. Antonio Torralba, Kevin Murphy and William Freeman , “Contextual Models for Object Detection using Boosted Random Fields ”, NIPS 2004.
12. Kevin Murphy, Antonio Torralba, Daniel Eaton, William Freeman, “Object detection and localization using local and global features”, Sicily workshop on object recognition, 2005
13. M. Riesenhuber and T. Poggio. How visual cortex recognizes objects: The tale of the standard model. The Visual Neurosciences, 2:1640–1653, 2003.
14. A. Saxena, S.H. Chung, A.Y. Ng, “Learning depth from Single Monocular Images”, NIPS 2005
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Feature set generators
Backup Slides
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
3 rectangular features types:
• two-rectangle feature type (horizontal/vertical)
• three-rectangle feature type
• four-rectangle feature type
Using a 24x24 pixel base detection window, with all the possible combination of horizontal and vertical location and scale of these feature types the full set of features has 49,396 features.
The motivation behind using rectangular features, as opposed to more expressive steerable filters is due to their extreme computational efficiency.
Paul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt ICCV 2001 Workshop on Statistical and Computation Theories of Vision
Intregral Images -- a Feature Set Generator
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Define an “Integral image” Def: The integral image at location (x,y), is the sum of the pixel values above and to the left of (x,y), inclusive.
Using the following two recurrences, where i(x,y) is the pixel value of original image at the given location and s(x,y) is the cumulative column sum, we can calculate the integral image representation of the
image in a single pass.
(x,y)
(0,0)
x
yPaul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt ICCV 2001 Workshop on Statistical and Computation Theories of Vision
Sum
s(x,y) = s(x,y-1) + i(x,y)
ii(x,y) = ii(x-1,y) + s(x,y)
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Allows rapid evaluation of rectangular features
Using the integral image representation one can compute the value of any rectangular sum in constant time.
For example the integral sum inside rectangle D we can compute as:
ii(4) + ii(1) – ii(2) – ii(3)
As a result: two-, three-, and four-rectangular features can be computed with 6, 8 and 9 array references respectively.
Paul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt ICCV 2001 Workshop on Statistical and Computation Theories of Vision
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
Intregal Image Example
0 8 6 1
1 5 9 0
0 7 5 0
2 8 9 2
0 8 - -
1 14 - -
1 - - -
4 - - -
0 8 14 -
1 14 29 -
1 21 41 -
4 32 61 -
0 8 14 15
1 14 29 30
1 21 41 42
4 32 61 64
Image
Intregal Image
Can calculate in one pass.
Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision
0 8 6 1
1 5 9 0
0 7 5 0
2 8 9 2
Intregal Image Example
0 8 14 15
1 14 29 30
1 21 41 42
4 32 61 64
Image Intregal Image
Find sum
5+9+7+5+8+9=43 61+0-(14+4)=43