stanford cs223b computer vision, winter 2006 lecture 14: object detection and classification using...

Stanford CS223B Computer Vision, Winter 2006

Lecture 14: Object Detection and

Classification Using Machine Learning

Gary Bradski, Intel, StanfordCAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

“Who will be strong and stand with me? Beyond the barricade, Is there a world you long to see?”

-- Enjolras, Do you hear the people sing? Le Miserables

This guy is wearing a haircutThis guy is wearing a haircutcalled a “Mullet”called a “Mullet”

Fast, accurate and Fast, accurate and general object general object recognition …recognition …

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Find the Mullets…

Rapid Learning and Generalization


Approaches to Recognition

Geometric

Non-Geo

Local Global

Patches/Ulman

Histograms/SchieleHMAX/Poggio

Constellation/Perona Eigen Objects/TurkShape models

MRF/Freeman, Murphy

We’ll see a few of these …

features

rela

tion

s


Eigenfaces Find a new coordinate system that best captures the scatter of the data. Eigen vectors point in the direction of scatter, ordered of the magnitude

of the eigen values. We can typically prune the number of eigen vectors to a few dozen.

GlobalGlobal


Eigenfaces, the algorithm

The database

2

2

1

Ng

g

g

2

2

1

Ne

e

e

2

2

1

Nh

h

h

2

2

1

Nf

f

f

2

2

1

Nc

c

c

2

2

1

Na

a

a

2

2

1

Nd

d

d

2

2

1

Nb

b

b

[slide credit: Alexander Roth]

Assumptions: Square images with W=H=N M is the number of images in the databaseP is the number of persons in the database

GlobalGlobal



Then subtract it from the training faces

22

2

1

2

1

NN

m

m

m

m

h

h

h

h

22

2

1

2

1

NN

m

m

m

m

e

e

e

e

22

2

1

2

1

NN

m

m

m

m

f

f

f

f

22

2

1

2

1

NN

m

m

m

m

g

g

g

g

22

2

1

2

1

NN

m

m

m

m

d

d

d

d

22

2

1

2

1

NN

m

m

m

m

a

a

a

a

22

2

1

2

1

NN

m

m

m

m

b

b

b

b

22

2

1

2

1

NN

m

m

m

m

c

c

c

c


We compute the average face

8 with,1

222

222

111

M

hba

hba

hba

Mm

NNN

GlobalGlobal



Now we build the matrix which is N2 by M

The covariance matrix which is N2 by N2

TAAC

mmmmmmmm hgfedcbaA


Find eigenvalues of the covariance matrix– The matrix is very large

– The computational effort is very big

We are interested in at most M eigenvalues– We can reduce the dimension of the matrix

TAAC

GlobalGlobal


Eigenvalue Theorem

Define dimension N2 by N2

dimension M by M (e.g., 8 by 8) Let be an eigenvector of : Then is eigenvector of : Proof:

AAL

AACT

T

)(

)(

)(

)()(

Av

vA

LvA

AvAA

AvAAAvCT

T

v L

)()( AvAvC Av C

vLv


GlobalGlobal

This vast dimensionality reduction is what makes the whole thing work.



Compute another matrix which is M by M:

Find the M eigenvalues and eigenvectors– Eigenvectors of C and L are equivalent

Build matrix V from the eigenvectors of L

AAL T


Eigenvectors of C are linear combination of image space with the eigenvectors of L

Eigenvectors represent the variation in the faces

VAU

GlobalGlobal



Compute for each face its projection onto the face space

Compute the between-class threshold

)(8 mT hU

)(7 mT gU

)(6 mT fU

)(5 mT eU

)(2 mT dU

)(3 mT cU

)(2 mT bU

)(1 mT aU

Mjiji ....1,for}max{2

1


GlobalGlobal


Example

Photobook, MIT

[Note: sharper]

Example set Eigenfaces

Normalized Eigenfaces

GlobalGlobal


Eigenfaces, the algorithm in use To recognize a face, subtract the average face from it

2 2

1 1

2 2

m

N N

r m

r mr

r m

2

2

1

Nr

r

r

22

2

1

2

1

NN

m

m

m

m

r

r

r

r


Compute its projection onto the face space

Compute the distance in the face space between the face and all known faces

)( mT rU

Miii ...1for22

Distinguish between– If it’s not a face– If it’s a new face– If it’s a known face

}min{ and

),...,1(, and

i

i Mi

GlobalGlobal

Beyond uses in recognition, Eigen “backgrounds” can be very effective for background subtraction.


Eigenfaces, the algorithm Problems with eigenfaces – spurious “scatter”

– Different illumination– Different head pose– Different alignment– Different facial expression


Fisherfaces may beat … Developed in 1997 by P.Belhumeur et al. Based on Fisher’s LDA Faster than eigenfaces, in some cases Has lower error rates Works well even if different illumination Works well even if different facial express.

GlobalGlobal


Global/local feature mix

[image credit: Kevin Murphy]

Global-noGeoGlobal-noGeo

Global works OK, still used, but local now seems to outperform.

Recent mix of local and global: – Use global features to bias local features with no internal

geometric dependencies: Murphy, Torralba & Freeman (03)


Use local features to find objectsGlobal-noGeoGlobal-noGeo

Filter bank

Image

ncorrelatio normalized

nconvolutio *

patch

Gaussian within bounding box

Trainingx positiveO negative

Object bounding box



Global feature: Back to neural nets: Propagate Mixture Density Networks*

Final output

Iteration

Uses “boostedrandom fields” tolearn graph structure

[slide credit: Kevin Murphy]


* C. M. Bishop. Mixture density networks. Technical Report NCRG 4288, Neural ComputingResearch Group, Department of Computer Science, Aston University, 1994

Fe

atu

re u

se

d:

Ste

era

ble

pyr

am

id

tra

nsf

orm

atio

n u

sin

g 4

orie

nta

tion

s a

nd

2

sca

les;

Im

ag

e d

ivid

ed

into

4x4

grid

, a

vera

ge

en

erg

y co

mp

ute

d in

ea

ch c

ha

nn

el

yie

lds

12

8 f

ea

ture

s.

PC

A d

ow

n t

o 8

0.


Example of context focus

The algorithm knows where to focus for objects




Results Performance is boosted by knowing context




Completely Local: Color Histograms Swain and Ballard ’91 took the normalized r,g,b color histogram of

objects:

and noted the tolerance to 3D rotation, partial occlusions etc:

Local-noGeoLocal-noGeo

[image credit: Swain & Ballard]


Color Histogram Matching Objects were recognized based on their histogram intersection:

Yielding excellent results over 30 objects:

The problem is, color varies markedly with lighting …


[image credit: Swain & Ballard]



Scheile and Crowley used derivative type features instead:

And a probabilistic matching rule:

Local Feature Histogram Matching

[image credit: Scheile & Crowley]

• For multiple objects:


Again with impressive performance results, much more tolerant to lighting:

Problem is: Histograms suffer exponential blow up with number of features

Local Feature Histogram ResultsLocal-noGeoLocal-noGeo

[image credit: Scheile & Crowley]

30 of 100f objects


Local features, for example:– Lowe’s SIFT– Malik’s Shape Context– Poggio’s HMAX– von der Malsburg’s Gabor Jets– Yokono’s Gaussian Derivative Jets

Adding patches thereof seems to work great, but they are of high dimensionality.

Idea: Encode in Hierarchy: – Overview some techniques...

Local Features


Convolutional Neural NetworksYann LeCun

Broke all the HIPs code(Human Interaction Proofs)from Yahoo, MSN, E-Bay …

Local-HierarchyLocal-Hierarchy

[image credit: LeCun]


Fragment Based Hierarchy Shimon Ullman

Top down and bottom up hierarchy

http://www.wisdom.weizmann.ac.il/~vision/research.html See also Perona’s group work on hierarchical feature models of objects http://www.vision.caltech.edu/html-files/publications.html


[image credit: Ullman et al]


Constellation ModelPerona’s

Fro

m:

Ro

b F

erg

us h

ttp://

ww

w.r

obot

s.ox

.ac.

uk/%

7Efe

rgus

/

Feature detector results: Bayesian Decision basedThe shape model. The mean location is indicated by the cross, with the ellipse showing the uncertainty in location. The number by each part is the probability of that part being present.

The appearance model closest to the mean of the appearance density of each part

Recognition Result:

Se

e a

lso

Pe

ron

a’s

gro

up

wo

rk o

n h

iera

rch

ica

l fe

atu

re

mo

de

ls o

f o

bje

cts

htt

p:/

/ww

w.v

isio

n.c

alte

ch.e

du

/htm

l-file

s/p

ub

lica

tion

s.h

tml


[image credit: Perona et al]


Joijic and Frey

Scene description as hierarchy of sprites


[image credit: Joijic et al]


Jeff Hawkins, Dileep George Modular hierarchical spatial temporal memory

Hierarchy Module

Results Templates Good Classifications Bad Classifications

In (D) Out (E)


[image credit: George, Hawkins]


Peter Bock’s ALISAAn explicit Cognitive Model


Histogram based

[image credit: Bock et al]


ALISA Labeling 2 ScenesLocal-HierarchyLocal-Hierarchy

[image credit: Bock et al]


Local-HierarchyLocal-HierarchyHMAX from the “Standard Model”Maximilian Riesenhuber and Tomaso Poggio

Basic building blocksIn object recognition hierarchy

Modulated by attention

Pick this up momentarily, first, a little on trees and boosting …[image credit: Riesenhuber et al]


Bayesian NetworksLibrary:

PNL

Statistical LearningLibrary:

MLL

• K-means

• Decision trees

• Agglomerative clustering• Spectral clustering

• K-NN

• Dependency Nets

• Boosted decision trees

Machine Learning – Many TechniquesLibraries from Intel

Modeless Model based

Uns

uper

vise

dS

uper

vise

d• Multi-Layer Perceptron

• SVM• BayesNets: Classification

• Tree distributions

Key:• Optimized• Implemented• Not implemented

• BayesNets: Parameter fitting• Inference

• Kernel density est.

• PCA

• Physical Models

• Influence diagrams

• Bayesnet structure learning

• Logistic Regression

• Kalman Filter

• HMM

• Adaptive Filters

• Radial Basis • Naïve Bayes• ARTMAP

• Gaussian Fitting

• Assoc. Net.

• ART• Kohonen Map

• Random Forests.

• MART

• CART

• Diagnostic Bayesnet

• Histogram density est.

focus


Machine Learning

INPUT OUTPUTf

Example Uses of Prediction:- Insurance risk prediction - Parameters that impact yields- Gene classification by function- Topics of a document. . .

Find a function that describes given dataand predicts unknown data

overfit

underfit

just right

X

y f

Learn a model/function

That maps input to output

Specific example: prediction, using a decision tree => => =>


Binary Recursive Decision TreesLeo Breiman’s “CART”*

Data setData set

maximal purity splits

Perfect purity, but…

overfit

underfit

X

y f

*Classification And Regression Tree

At Each Level:At Each Level: Find the variable (predictor) and its threshold.

– That splits the data into 2 groups– With maximal purity within each group

All variables/predictors are considered at every level.

Data of different types, eachcontaining a vector of “predictors”


just right

Data setData set

Prune to avoid over fitting usingcomplexity cost measure

Binary Recursive Decision TreesLeo Breiman’s “CART”*

At Each Level:At Each Level: Find the variable (predictor) and its threshold.

– That splits the data into 2 groups– With maximal purity within each group

All variables/predictors are considered at every level.

overfit

x

y f


Consider a Face Detector via Decision Stumps

Data setData set

maximal purity splits: Thresh = N

For each rectangle combination region: Find the threshold

– That splits the data into 2 groups (face, non-face) – With maximal purity within each group

Face and non-face data that he features can be tried on

Bar detector works well for “nose” a face detecting stump.

It doesn’tdetectcars.

Consider a tree “Stump” – just one split.It selects the single most discriminative feature …

See Appendix for Viola, Jones’s feature generator: Intregral Images


We use “Boosting” to Select a “Forest of Stumps”

G iven exam ple im ages (x1 ,y1) , … , (xn ,yn) w here y i = 0, 1 for negative and positive exam ples respectively.

In itialize w eights w 1 ,i = 1 /(2m ), 1 /(2 l) for train ing exam ple i, w here m and l are the num ber of negatives and positives respectively.

For t = 1 … T 1) N orm alize w eights so that w t is a d istribution 2) For each feature j train a classifier h j and evaluate its error j w ith respect to w t. 3) C hose the classifier h j w ith low est error. 4) U pdate w eights according to :

1,,1

i

titit ww

w here e i = 0 is x i is classified correctly, 1 o therw ise, and

1 t

t

t

T he final strong classifier is:

otherwise

xxhT

t

T

t ttt h0

2

1)(1)( 1 1 , w here )

1log(

t

t

Each stump is a selected feature plus a split threshold

Gentle Boost:


For efficient calculation, form a Detection Cascade

A boosted cascade is assembled such that at each node, non-object regions stop further processing.

If the detection of each node is high (~99.9%), at cost of a high false positive rate (say 50% of everything detected as “object), and if the nodes are independent,

.6.9 and 98.0 :get we

cascade node 20 afor then so, If . falsePos and detect

are rates positive false anddetection overall then the

7

11

efd

fdn

ii

n

ii

Rapid Object Detection using a Boosted Cascade of Simple Features - Viola, Jones (2001)


Improvements to Cascade

J. Wu, J. M. Rehg, and M. D. Mullin just do one Boosting round, then select from the feature pool as needed:

Kobi Levi and Yair Weiss just used better features (gradient histograms) to cut training needs by an order of magnitude.

Let’s focus on better features and descriptors …

Viola, Jones Wu, Rehg, Mullin

[image credit: Wu et al]


The Standard Model of Visual CortexBiologically Motivated Features

S1 layer:Gabor at 4 orientations

C1 layer:Local Spatial Max

Inter layer:Dictionary of Patches of C1

S2 Layer:Radial Basis fit to it’s patch template over the whole image

C2 Layer:Max S2 Response .8 .4 .9 .2 .6

Classifier (SVM, Boosting, …)

Thomas Serre, Lior Wolf and Tomaso Poggio used the model of the human visual cortex developed in Riesenhuber’s lab:

First 5 chosen features from Boosting

[image credit: Serre et al]


The Standard Model of Visual CortexBiologically Motivated Features

Results in state of the art/top performance:

[image credit: Serre et al]

Seems to handily beat SIFT features:


Yokonos’ Generalization toThe Standard Model of Visual Cortex

Used Gaussian Derivates: 3 orders X 3 scales X 4 orientations = 36 base features:

Similar to Standard Model’s Gabor base filters.

[image credit: Yokono et al]



Created a local spatial jet, oriented to the gradient at the largest scale at the center pixel:

Since Gabor has ringing spatial extent ~ still approximately similar to standard model.




Full system:


~S1, C1:Features memorized from positive samples at Harris corner interest points.

~S2:Dictionary of learned features is measured (normalized cross correlation) against all interest points in the image.

~C2:The maximum normalized cross correlation scores are arranged in a feature vector

Classifier:Again: SVM, Boosting, …



Excellent Results:


CBCL Database ROC curve for 1200 Stumps:

SVM with 1 to 5 training images beats other techniques:



Excellent Results:


AIBO Dog in articulated poses:

Some features chosen:

ROC Curve:


Brash Claim

In the high 90% performance under lighting, articulation, scale and 3D rotation. – The classifier inside humans is unlikely to be much more accurate.

We are not that far from raw human level performance. – By 2015 I predict.

Base classifier is embedded in larger system that makes it more reliable:– Attention– Color constancy features– Context– Temporal filtering– Sensor fusion


Back to Kevin Murphy: Context:

[slide credit: Kevin Murphy]

Missing


We know there is a keyboard present in this scene even if we cannot see it clearly.

We know there is no keyboard present in this scene

… even if there is one indeed.[slide credit: Kevin Murphy]

MissingContext


Attention

Change blindness

Missing

Farm Truck


Call for a Program: Generalize Standard Model Even Further

Detect:– DOG– Harris Corner

Descriptors:– SIFT– Steerable– Gabor

Dictionary:– All descriptors– Subset– Clustered

Image Level Scoring:– Histogram– Max Correlation– Max Probability

Classifier:– SVM– Boosting– K-NN …

Embedding:– Attention, active vision– Context: Scene, 3D inference– Sensor fusion/association– Motion

Research Framework

Loca

l

Glo

bal


Call for a Program: Generalize Standard Model Even Further

Ashutosh Saxena, Chung and Ng learned depth using local features in an MRF (similar to Kevin Murphy).

Ashutosh also has a robot picking up novel objects from local features. Together with active vision, active manipulation, context – Now is a good time for vision systems!

[image credit: Saxena et al]Apply to “Stanley II” and to STAIR


Summary: Mix local with globalGeneralize Standard Model Even Further

Detect:– DOG– Harris Corner

Descriptors:– SIFT– Steerable– Gabor

Dictionary:– All descriptors– Subset– Clustered

Image Level Scoring:– Histogram– Max Correlation– Max Probability

Classifier:– SVM– Boosting– K-NN …

Embedding:– Attention, active vision– Context: Scene, 3D inference– Sensor fusion/association– Motion

Research Framework

Loca

l

Glo

bal


Bibliography for this lecture Papers for this lecture:1. R. Fergus, P. Perona and A.Zisserman, “Object Class Recognition by Unsupervised Scale-Invariant Learning”,

CVPR 03.

2. M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, Vol. 3, No. 1, 1991.

3. Serre, T., L. Wolf and T. Poggio. Object Recognition with Features Inspired by Visual Cortex. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society Press, San Diego, June 2005.

4. Jerry Jun Yokono & Tomaso Poggio, “Boosting a Biologically Inspired Local Descriptor for Geometry-free Face and Full Multi-view 3D Object Recognition”, AI Memo 3005-023 CBCL Memo 254, July 2005

5. J. Wu, J. M. Rehg, and M. D. Mullin, “Learning a Rare Event Detection Cascade by Direct Feature Selection” Proc. Advances in Neural Information Processing Systems 16 (NIPS*2003), MIT Press, 2004

6. J. Wu, M. D. Mullin, and J. M. Rehg, “Linear Asymmetric Classifier for Face Detection”, International Conference on Machine Learning (ICML 05), pages 993-1000, Bonn, Germany, August 2005

7. Kobi Levi and Yair Weiss, “Learning Object Detection from a Small Number of Examples: The Importance of Good Features” International Conference on Computer Vision and Pattern Recognition (CVPR) 2004.

8. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. CVPR, pages 511–518, 2001.

9. B. Schiele and JL Crowley. Probabilistic object recognition using multidimensional receptive field histograms. submitted to ICPR'96

10. M. J. Swain and D. H. Ballard, "Color Indexing," International Journal of Computer Vision, vol. 7, pp. 11-32, 1991.

11. Antonio Torralba, Kevin Murphy and William Freeman , “Contextual Models for Object Detection using Boosted Random Fields ”, NIPS 2004.

12. Kevin Murphy, Antonio Torralba, Daniel Eaton, William Freeman, “Object detection and localization using local and global features”, Sicily workshop on object recognition, 2005

13. M. Riesenhuber and T. Poggio. How visual cortex recognizes objects: The tale of the standard model. The Visual Neurosciences, 2:1640–1653, 2003.

14. A. Saxena, S.H. Chung, A.Y. Ng, “Learning depth from Single Monocular Images”, NIPS 2005


Feature set generators

Backup Slides


3 rectangular features types:

• two-rectangle feature type (horizontal/vertical)

• three-rectangle feature type

• four-rectangle feature type

Using a 24x24 pixel base detection window, with all the possible combination of horizontal and vertical location and scale of these feature types the full set of features has 49,396 features.

The motivation behind using rectangular features, as opposed to more expressive steerable filters is due to their extreme computational efficiency.

Paul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt ICCV 2001 Workshop on Statistical and Computation Theories of Vision

Intregral Images -- a Feature Set Generator


Define an “Integral image” Def: The integral image at location (x,y), is the sum of the pixel values above and to the left of (x,y), inclusive.

Using the following two recurrences, where i(x,y) is the pixel value of original image at the given location and s(x,y) is the cumulative column sum, we can calculate the integral image representation of the

image in a single pass.

(x,y)

(0,0)

x

yPaul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt ICCV 2001 Workshop on Statistical and Computation Theories of Vision

Sum

s(x,y) = s(x,y-1) + i(x,y)

ii(x,y) = ii(x-1,y) + s(x,y)


Allows rapid evaluation of rectangular features

Using the integral image representation one can compute the value of any rectangular sum in constant time.

For example the integral sum inside rectangle D we can compute as:

ii(4) + ii(1) – ii(2) – ii(3)

As a result: two-, three-, and four-rectangular features can be computed with 6, 8 and 9 array references respectively.

Paul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt ICCV 2001 Workshop on Statistical and Computation Theories of Vision


Intregal Image Example

0 8 6 1

1 5 9 0

0 7 5 0

2 8 9 2

0 8 - -

1 14 - -

1 - - -

4 - - -

0 8 14 -

1 14 29 -

1 21 41 -

4 32 61 -

0 8 14 15

1 14 29 30

1 21 41 42

4 32 61 64

Image

Intregal Image

Can calculate in one pass.


0 8 6 1

1 5 9 0

0 7 5 0

2 8 9 2

Intregal Image Example

0 8 14 15

1 14 29 30

1 21 41 42

4 32 61 64

Image Intregal Image

Find sum

5+9+7+5+8+9=43 61+0-(14+4)=43

stanford cs223b computer vision, winter 2006 lecture 14: object detection and classification using...

Documents

machine learning gary

alexander rothwe

alexander rothassumptions

alexander rothcompute

n2slide credit

itslide credit

n2 dimension

face spacecompute