a probabilistic model for recursive factorized image features ppt

A probabilistic model for recursive factorized image

features.

Sergey KarayevMario FritzSanja Fidler

Trevor Darrell

Outline

• Motivation:

✴Distributed coding of local image features

✴Hierarchical models

✴Bayesian inference

• Our model: Recursive LDA

• Evaluation

Local Features

• Gradient energy histograms by orientation and grid position in local patches.

• Coded and used in bag-of-words or spatial model classifiers.

Feature Coding• Traditionally vector quantized as visual

words.

• But coding as mixture of components, such as in sparse coding, is empirically better. (Yang et al. 2009)Dense codes

(ascii)Sparse, distributed codes Local codes

(grandmother cells)

. . . . . .

+ High combinatorial capacity (2N)

- Difficult to read out

+ Decent combinatorial capacity (~NK)

+ Still easy to read out

- Low combinatorial capacity (N)

+ Easy to read out

Local codes vs. Dense codes

picture from Bruno Olshausen

Another motivation: additive image formation

Fritz et al. An Additive Latent Feature Model for Transparent Object Recognition. NIPS (2009).

Outline

• Motivation:





• Evaluation

Hierarchies

• Biological evidence for increasing spatial support and complexity of visual pathway.

• Local features not robust to ambiguities. Higher layers can help resolve.

• Efficient parametrization possible due to sharing of lower-layer components.

Past Work• HMAX models (Riesenhuber and Poggio 1999, Mutch and

Lowe 2008)

• Convolutional networks (Ranzato et al. 2007, Ahmed et al. 2009)

• Deep Belief Nets (Hinton 2007, Lee et al. 2009)

• Hyperfeatures (Agarwal and Triggs 2008)

• Fragment-based hierarchies (Ullman 2007)

• Stochastic grammars (Zhu and Mumford 2006)

• Compositional object representations (Fidler and Leonardis 2007, Zhu et al. 2008)

HMAX

nature neuroscience • volume 2 no 11 • november 1999 1021

lus (unless the afferents showed no overlap in spaceor scale); consequently, excitation of the ‘complex’cell would increase along with the stimulus size,even though the afferents show size invariance!(This is borne out in simulations using a simplifiedtwo-layer model25.) For the MAX mechanism, how-ever, cell response would show little variation, even as stimulussize increased, because the cell’s response would be determinedjust by the best-matching afferent.

These considerations (supported by quantitative simulationsof the model, described below) suggest that a nonlinear MAXfunction represents a sensible way of pooling responses to achieveinvariance. This would involve implicitly scanning (see Discus-sion) over afferents of the same type differing in the parameterof the transformation to which responses should be invariant(for instance, feature size for scale invariance), and then select-ing the best-matching afferent. Note that these considerationsapply where different afferent to a pooling cell (for instance, thoselooking at different parts of space), are likely to respond to dif-ferent objects (or different parts of the same object) in the visu-al field. (This is the case with cells in lower visual areas with theirbroad shape tuning.) Here, pooling by combining afferents would

mix up signals caused by different stimuli. However, if the affer-ents are specific enough to respond only to one pattern, as oneexpects in the final stages of the model, then it is advantageousto pool them using a weighted sum, as in the RBF network15,where VTUs tuned to different viewpoints were combined tointerpolate between the stored views.

MAX-like mechanisms at some stages of the circuitry seemcompatible with neurophysiological data. For instance, when twostimuli are brought into the receptive field of an IT neuron, thatneuron’s response seems dominated by the stimulus that, whenpresented in isolation to the cell, produces a higher firing rate24—just as expected if a MAX-like operation is performed at the levelof this neuron or its afferents. Theoretical investigations into pos-sible pooling mechanisms for V1 complex cells also support amaximum-like pooling mechanism (K. Sakai and S. Tanaka, Soc.Neurosci. Abstr. 23, 453, 1997).

articles

View-tuned cells

MAX

weighted sum

Simple cells (S1)

Complex cells (C1)

Complex composite cells (C2)

Composite feature cells (S2)

Fig. 2. Sketch of the model. The model was an extension ofclassical models of complex cells built from simple cells4,consisting of a hierarchy of layers with linear (‘S’ units in thenotation of Fukushima6, performing template matching, solidlines) and non-linear operations (‘C’ pooling units6, perform-ing a ‘MAX’ operation, dashed lines). The nonlinear MAXoperation—which selected the maximum of the cell’s inputsand used it to drive the cell—was key to the model’s proper-ties, and differed from the basically linear summation ofinputs usually assumed for complex cells. These two types ofoperations provided pattern specificity and invariance totranslation, by pooling over afferents tuned to different posi-tions, and to scale (not shown), by pooling over afferentstuned to different scales.

Fig. 3. Highly nonlinear shape-tuning properties of the MAX mechanism. (a) Experimentally observed responses of IT cells obtained using a ‘simplifi-cation procedure’26 designed to determine ‘optimal’ features (responses normalized so that the response to the preferred stimulus is equal to 1). Inthat experiment, the cell originally responded quite strongly to the image of a ‘water bottle’ (leftmost object). The stimulus was then ‘simplified’ to itsmonochromatic outline, which increased the cell’s firing, and further, to a paddle-like object consisting of a bar supporting an ellipse. Whereas thisobject evoked a strong response, the bar or the ellipse alone produced almost no response at all (figure used by permission). (b) Comparison ofexperiment and model. White bars show the responses of the experimental neuron from (a). Black and gray bars show the response of a model neu-ron tuned to the stem-ellipsoidal base transition of the preferred stimulus. The model neuron is at the top of a simplified version of the model shownin Fig. 2, where there were only two types of S1 features at each position in the receptive field, each tuned to the left or right side of the transitionregion, which fed into C1 units that pooled them using either a MAX function (black bars) or a SUM function (gray bars). The model neuron was con-nected to these C1 units so that its response was maximal when the experimental neuron’s preferred stimulus was in its receptive field.

0

0.2

0.4

0.6

0.8

1

Resp

onse

MAX expt. SUM(a) (b)a b

© 1999 Nature America Inc. • http://neurosci.nature.com

© 1

999

Nat

ure

Am

eric

a In

c. •

http

://ne

uros

ci.n

atur

e.co

m

Riesenhuber and Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience (1999).

Convolutional Deep Belief

NetsStacked layers, each consisting of

feature extraction, transformation, and pooling.

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

(NW � NV − NH + 1); the filter weights are sharedacross all the hidden units within the group. In addi-tion, each hidden group has a bias bk and all visibleunits share a single bias c.

We define the energy function E(v,h) as:

P (v,h) =1Z

exp(−E(v,h))

E(v,h) = −K�

k=1

NH�

i,j=1

NW�

r,s=1

hk

ijW

k

rsvi+r−1,j+s−1

−K�

k=1

bk

NH�

i,j=1

hk

ij− c

NV�

i,j=1

vij . (1)

Using the operators defined previously,

E(v,h) = −K�

k=1

hk • (W̃ k ∗ v)−

K�

k=1

bk

�

i,j

hk

i,j− c

�

i,j

vij .

As with standard RBMs (Section 2.1), we can performblock Gibbs sampling using the following conditionaldistributions:

P (hk

ij= 1|v) = σ((W̃ k ∗ v)ij + bk)

P (vij = 1|h) = σ((�

k

Wk ∗ h

k)ij + c),

where σ is the sigmoid function. Gibbs sampling formsthe basis of our inference and learning algorithms.

3.3. Probabilistic max-poolingIn order to learn high-level representations, we stackCRBMs into a multilayer architecture analogous toDBNs. This architecture is based on a novel opera-tion that we call probabilistic max-pooling.

In general, higher-level feature detectors need informa-tion from progressively larger input regions. Existingtranslation-invariant representations, such as convolu-tional networks, often involve two kinds of layers inalternation: “detection” layers, whose responses arecomputed by convolving a feature detector with theprevious layer, and “pooling” layers, which shrink therepresentation of the detection layers by a constantfactor. More specifically, each unit in a pooling layercomputes the maximum activation of the units in asmall region of the detection layer. Shrinking the rep-resentation with max-pooling allows higher-layer rep-resentations to be invariant to small translations of theinput and reduces the computational burden.

Max-pooling was intended only for feed-forward archi-tectures. In contrast, we are interested in a generativemodel of images which supports both top-down andbottom-up inference. Therefore, we designed our gen-erative model so that inference involves max-pooling-like behavior.

!

"#

$#%&'

(#

)**!"#$#%&'(&)*'+,

+# !-'.'/.#01(&)*'+,

,# !200&#13(&)*'+,

-+

-)

.

-"

-,

Figure 1. Convolutional RBM with probabilistic max-pooling. For simplicity, only group k of the detection layerand the pooing layer are shown. The basic CRBM corre-sponds to a simplified structure with only visible layer anddetection (hidden) layer. See text for details.

To simplify the notation, we consider a model with avisible layer V , a detection layer H, and a pooling layerP , as shown in Figure 1. The detection and poolinglayers both have K groups of units, and each groupof the pooling layer has NP × NP binary units. Foreach k ∈ {1, ...,K}, the pooling layer P

k shrinks therepresentation of the detection layer H

k by a factorof C along each dimension, where C is a small in-teger such as 2 or 3. I.e., the detection layer H

k ispartitioned into blocks of size C × C, and each blockα is connected to exactly one binary unit p

k

α in thepooling layer (i.e., NP = NH/C). Formally, we defineBα � {(i, j) : hij belongs to the block α.}.

The detection units in the block Bα and the poolingunit pα are connected in a single potential which en-forces the following constraints: at most one of thedetection units may be on, and the pooling unit is onif and only if a detection unit is on. Equivalently, wecan consider these C

2+1 units as a single random vari-able which may take on one of C

2 + 1 possible values:one value for each of the detection units being on, andone value indicating that all units are off.

We formally define the energy function of this simpli-fied probabilistic max-pooling-CRBM as follows:

E(v,h) = −X

k

X

i,j

“hk

i,j(W̃k ∗ v)i,j + bkhk

i,j

”− c

X

i,j

vi,j

subj. toX

(i,j)∈Bα

hki,j ≤ 1, ∀k, α.

We now discuss sampling the detection layer H andthe pooling layer P given the visible layer V . Group k

receives the following bottom-up signal from layer V :

I(hk

ij) � bk + (W̃ k ∗ v)ij . (2)

Now, we sample each block independently as a multi-nomial function of its inputs. Suppose h

k

i,jis a hid-

den unit contained in block α (i.e., (i, j) ∈ Bα), the

611

Lee et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical …. Proceedings of the 26th Annual International Conference on Machine Learning (2009)

HyperfeaturesInt J Comput Vis (2008) 78: 15–27 17

Fig. 1 Constructing a hyperfeature stack. The ‘level 0’ (base fea-ture) pyramid is constructed by calculating a local image descriptorvector for each patch in a multiscale pyramid of overlapping imagepatches. These vectors are vector quantized according to the level 0codebook, and local histograms of codebook memberships are accu-mulated over local position-scale neighborhoods (the smaller darkenedregions within the image pyramids) to make the level 1 feature vectors.Other local accumulation schemes can be used, such as soft voting or

more generally capturing local statistics using posterior membershipprobabilities of mixture components or similar latent models. Softerspatial windowing is also possible. The process simply repeats itself athigher levels. The level l to l + 1 coding is also used to generate thelevel l output vectors—global histograms over the whole level-l pyra-mid. The collected output features are fed to a learning machine andused to classify the (local or global) image region

and HMAX (Riesenhuber and Poggio 1999; Serre et al.2005; Mutch and Lowe 2006). These are all multilayer net-works with alternating stages of linear filtering (banks oflearned convolution filters for CNN’s and of learned ‘sim-ple cells’ for HMAX and the neocognitron) and nonlin-ear rectify-and-pool operations. The neocognitron activatesa higher level cell if at least one associated lower levelcell is active. In CNNs the rectified signals are pooledlinearly, while in HMAX a max-like operation (‘complexcell’) is used so that only the dominant input is passedthrough to the next stage. The neocognitron and HMAXlay claims to biological plausibility whereas CNN is moreof an engineering solution, but all three are convolution-based discriminatively-trained models. In contrast, althoughhyperfeatures are still bottom-up, they are essentially a de-scriptive statistics model not a discriminative one: train-ing is completely unsupervised and there are no convolu-tion weights to learn for hyperfeature extraction, althoughthe object classes can still influence the coding indirectlyvia the choices of codebook. The basic nonlinearity is alsodifferent: for hyperfeatures, nonlinear descriptor coding bynearest neighbor lookup (or more generally by evaluat-ing posterior probabilities of membership to latent featureclasses) is followed by a comparatively linear accumulate-

and-normalize process, while for the neural models linearconvolution filtering is followed by simple but nonlinear rec-tification.

A number of other hierarchical feature representationshave been proposed very recently. Some of these, e.g. (Serreet al. 2005; Mutch and Lowe 2006), follow the HMAXmodel of object recognition in the primal cortex while otherssuch as (Epshtein and Ullman 2005) are based on top-downmethods of successively breaking top-level image fragmentsinto smaller components to extract informative features.This does not allow higher levels to abstract from the fea-tures at lower levels. Another top-down approach presentedin (Lazebnik et al. 2006) is based on pyramid matching thatrepeatedly subdivides the image to compute histograms atvarious levels. Again, unlike the hyperfeatures that we pro-pose in this paper, this does not feed lower level featuresinto the computational process of higher ones. Hyperfea-tures are literally “features of (local sets of) features”, andthis differentiates them from such methods. Another keydistinction of hyperfeatures is that they are based on localco-occurrence alone, not on numerical geometry (althoughquantitative spatial relationships are to some extent capturedby coding sets of overlapping patches). Advantages of thisare discussed in Sect. 3.

Agarwal and Triggs. Multilevel Image Coding with Hyperfeatures. International Journal of Computer Vision (2008).

Compositional Representations

!"#$%& '( )&*+ %&,-+./%$,/"-+. -0 /1& 2&*%+&3 4*%/. 5.4*/"*2 !&6"7"2"/8 *2.- 9-3&2&3 78 /1& 4*%/. ". -9"//&3 3$& /- 2*,: -0 .4*,&;( !"# $%&<L2= L3 5/1& "%./ 186 -0 *22 499 4*%/. *%& .1->+;= '() $%&< L4 4*%/. 0-% 0*,&.= ,*%.= *+3 9$#.= *$) $%&< L5 4*%/. 0-% 0*,&.= ,*%. 5-7/*"+&3-+ 3 3"00&%&+/ .,*2&.;= *+3 9$#.(

!"#$%& ?( @&/&,/"-+ -0 L4 *+3 L5 4*%/.( @&/&,/"-+ 4%-,&&3. 7-//-9A$4 *. 3&.,%"7&3 "+ B$7.&,/"-+ C(C( D,/"E& 4*%/. "+ /1& /-4 2*8&% *%&/%*,&3 3->+ /- /1& "9*#& /1%-$#1 2"+:. !(

FGHI J( K-22. *+3 L( @&,-( !"#$%&'&(")'* +,%-"./(,)/, "0 1(.(")2 M60-%3N+"E( O%&..= CPPC( G

FG?I !( B,*2Q- *+3 R( S( O"*/&%( B/*/"./",*2 2&*%+"+# -0 E".$*2 0&*/$%& 1"&%A*%,1"&.( T+ 3"-4.5"$ ") 6,'-)()78 !19:= CPPU( G

FGVI W( B&%%&= X( Y-20= *+3 W( O-##"-( M7Z&,/ %&,-#+"/"-+ >"/1 0&*/$%&."+.4"%&3 78 E".$*2 ,-%/&6( T+ !19: ;<== 4*#&. VV'[GPPP= CPPU( G= C

FCPI J( B$33&%/1= D( W-%%*27*= Y( !%&&9*+= *+3 D( Y"22.:8( X&*%+"+#

1"&%*%,1",*2 9-3&2. -0 .,&+&.= -7Z&,/.= *+3 4*%/.( T+ >!!1= 4*#&.G\\G[G\\?= CPPU( G= C

FCGI R( ]( W.-/.-.( D+*28Q"+# E"."-+ */ /1& ,-942&6"/8 2&E&2( ??@=G\5\;<'C\['^V= GVVP( G

FCCI B( N229*+ *+3 _( J4.1/&"+( 1(.%'* !*'..(!/'&(") AB ' C(,-'-/5B"0 DE&,)F,F G,'&%-,.2 W->*%3. `*/&#-%8AX&E&2 M7Z&,/ K&,-#+"/"-+(B4%"+#&%Aa&%2*#= CPP^( G= C

Fidler and Leonardis. Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts. CVPR (2007)

Outline

• Motivation:





• Evaluation

Bayesian inference• The human visual cortex

deals with inherently ambiguous data.

• Role of priors and inference (Lee and Mumford 2003).

C-2 KERSTEN ! MAMASSIAN ! YUILLE

See legend on next page

Kersten.qxd 12/17/2003 9:17 AM Page 2

Ann

u. R

ev. P

sych

ol. 2

004.

55:2

71-3

04. D

ownl

oade

d fr

om a

rjour

nals

.ann

ualre

view

s.org

by U

nive

rsity

of C

alifo

rnia

- B

erke

ley

on 0

8/26

/09.

For

per

sona

l use

onl

y.

Kersten et al. Object perception as Bayesian inference. Annual Reviews (2004)

• But most hierarchical approaches do both learning and inference only from the bottom-up.

w patch

T0 activations

patch

inference

activations

inference

T1

L0

L1

L0representation

learning

representation

L1

learning

w patch

T0 activations

patch

activationsT1

L0

L1

L0learning

L1

learning

Joint representation

Joint Inference

What we would like

•Distributed coding of local features in a hierarchical model that would allow full inference.

Outline

• Motivation:





• Evaluation

Our model: rLDA

• Based on Latent Dirichlet Allocation (LDA).

• Multiple layers, with increasing spatial support.

• Learns representation jointly across layers.

w patch

T0 activations

patch

activationsT1

L0

L1

L0learning

L1

learning

Joint representation

Joint Inference

z

α

β

Tw

D

Griffiths LDA

θ

Ndφ

Wednesday, December 2, 2009

Latent Dirichlet Allocation

• Bayesian multinomial mixture model originally formulated for text analysis.

Blei et al. Latent Dirichlet allocation. Journal of Machine Learning Research (2003)

z

α

β

Tw

D

Griffiths LDA

θ

Ndφ

Wednesday, December 2, 2009

Latent Dirichlet AllocationCorpus-wide, the multinomial distributions ofwords (topics) are sampled:

• φ ∼ Dir(β)

For each document, d ∈ 1, . . . , D, mixing propor-tions θ(d) are sampled according to:

• θ(d) ∼ Dir(α)

And Nd words w are sampled according to:

• z ∼ Mult(θ(d)): sample topic given thedocument-topic mixing proportions

• w ∼ Mult(φ(z): sample word given the topicand the topic-word multinomials

LDA in vision• Past work has applied LDA to visual words,

with topics being distributions over them.

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 highway

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 inside of city

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 tall buildings

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 street

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 suburb

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 forest

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 coast

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 mountain

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 open country

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 bedroom

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 kitchen

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 livingroom

0 20 40 60 80 100 120 140 1600

0.05

0.1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3 office

0 20 40 60 80 100 120 140 1600

0.05

0.1

themes codewords

Figure 5. Internal structure of the models learnt for each cat-egory. Each row represents one category. The left panel showsthe distribution of the 40 intermediate themes. The right panelshows the distribution of codewords as well as the appearance of10 codewords selected from the top 20 most likely codewords forthis category model.

highw

ayins

ide o

f city

tall b

uildin

gstr

eet

sub

urb

fore

stco

ast

mou

ntain

open

coun

trybe

droo

mki

tche

nliv

ingr

oom

offic

e

correct incorrect

Figure 6. Examples of testing images for each category. Eachrow is for one category. The first 3 columns on the left show 3examples of correctly recognized images, the last column on theright shows an example of incorrectly recognized image. Super-imposed on each image, we show samples of patches that belongto the most significant set of codewords given the category model.Note for the incorrectly categorized images, the number of signifi-cant codewords of the model tends to occur less likely. (This figureis best viewed in color.)6

Fei-Fei and Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR (2005)

LDA-SIFTDistinctive Image Features from Scale-Invariant Keypoints 101

Figure 7. A keypoint descriptor is created by first computing the gradient magnitude and orientation at each image sample point in a regionaround the keypoint location, as shown on the left. These are weighted by a Gaussian window, indicated by the overlaid circle. These samplesare then accumulated into orientation histograms summarizing the contents over 4x4 subregions, as shown on the right, with the length of eacharrow corresponding to the sum of the gradient magnitudes near that direction within the region. This figure shows a 2 ! 2 descriptor arraycomputed from an 8 ! 8 set of samples, whereas the experiments in this paper use 4 ! 4 descriptors computed from a 16 ! 16 sample array.

6.1. Descriptor Representation

Figure 7 illustrates the computation of the keypoint de-scriptor. First the image gradient magnitudes and ori-entations are sampled around the keypoint location,using the scale of the keypoint to select the level ofGaussian blur for the image. In order to achieve ori-entation invariance, the coordinates of the descriptorand the gradient orientations are rotated relative tothe keypoint orientation. For efficiency, the gradientsare precomputed for all levels of the pyramid as de-scribed in Section 5. These are illustrated with smallarrows at each sample location on the left side ofFig. 7.

A Gaussian weighting function with ! equal to onehalf the width of the descriptor window is used to as-sign a weight to the magnitude of each sample point.This is illustrated with a circular window on the leftside of Fig. 7, although, of course, the weight fallsoff smoothly. The purpose of this Gaussian window isto avoid sudden changes in the descriptor with smallchanges in the position of the window, and to give lessemphasis to gradients that are far from the center of thedescriptor, as these are most affected by misregistrationerrors.

The keypoint descriptor is shown on the right sideof Fig. 7. It allows for significant shift in gradient po-sitions by creating orientation histograms over 4 ! 4sample regions. The figure shows eight directions foreach orientation histogram, with the length of each ar-row corresponding to the magnitude of that histogramentry. A gradient sample on the left can shift up to 4sample positions while still contributing to the same

histogram on the right, thereby achieving the objectiveof allowing for larger local positional shifts.

It is important to avoid all boundary affects in whichthe descriptor abruptly changes as a sample shiftssmoothly from being within one histogram to anotheror from one orientation to another. Therefore, trilin-ear interpolation is used to distribute the value of eachgradient sample into adjacent histogram bins. In otherwords, each entry into a bin is multiplied by a weight of1"d for each dimension, where d is the distance of thesample from the central value of the bin as measuredin units of the histogram bin spacing.

The descriptor is formed from a vector containingthe values of all the orientation histogram entries, cor-responding to the lengths of the arrows on the right sideof Fig. 7. The figure shows a 2 ! 2 array of orienta-tion histograms, whereas our experiments below showthat the best results are achieved with a 4 ! 4 array ofhistograms with 8 orientation bins in each. Therefore,the experiments in this paper use a 4 ! 4 ! 8 = 128element feature vector for each keypoint.

Finally, the feature vector is modified to reduce theeffects of illumination change. First, the vector is nor-malized to unit length. A change in image contrast inwhich each pixel value is multiplied by a constant willmultiply gradients by the same constant, so this contrastchange will be canceled by vector normalization. Abrightness change in which a constant is added to eachimage pixel will not affect the gradient values, as theyare computed from pixel differences. Therefore, the de-scriptor is invariant to affine changes in illumination.However, non-linear illumination changes can also oc-cur due to camera saturation or due to illumination

tokencount

words

How training works

• (quantization, extracting patches, inference illustration)

Topics

(subset of 1024 topics)SIFT average image

w patch

T0 activations

patch

inference

activations

inference

T1

L0

L1

L0representation

learning

representation

L1

learning

Stacking two layers of LDA

D

θ

z0

z1

x0

wφ0

φ1

x1

β0

β1

T0

T1

α

χ

χ

N (d)

Wednesday, November 10, 2010

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

CVPR#1888

CVPR#1888

CVPR 2011 Submission #1888. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

θ(d)

X1

X1

x1

z1

z0x0

X0

V

X0

T0

w

L1

L0T0

T1

χ(z1,·) ∈ RX1

χ(z0,·) ∈ RX0

φ(z0,·,·) ∈ RX0×T0

φ(z1,·,·) ∈ RX1×T0

(a) Recursive LDA model concept.

D

θ

z0

z1

x0

wφ0

φ1

x1

β0

β1

T0

T1

α

χ

χ

N (d)

Wednesday, November 10, 2010

(b) Graphical model for RLDA.

Figure 2: Recursive LDA model.

3. Recursive LDA Model

In contrast to previous approaches to latent factor mod-

eling, we formulate a layered approach that derives progres-

sively higher-level spatial distributions based on the under-

lying latent activations of the lower layer.

For clarity, we will first derive this model for two layers

(L0 and L1) as shown in Figure 2a, but will show how it

generalizes to an arbitrary number of layers. For illustra-

tion purposes, the reader may visualize a particular instance

of our model that observes SIFT descriptors (such as the L0

patch in Figure 2a) as discrete spatial distributions on the

L0-layer. In this particular case, L0 layer models the dis-

tribution of words from a vocabulary of size V = 8 over a

spatial grid X0 of size 4× 4. The vocabulary represents the

8 gradient orientations used by the SIFT descriptor, which

we interpret as words. The frequency of these words is then

the histogram energy in the corresponding bin of the SIFT

descriptor.

The mixture model of T0 components is parameterized

by multinomial parameters φ0 ∈ RT0×X0×V; in our partic-

ular example φ0 ∈ RT0×(4×4)×8. The L1 aggregates the

mixing proportions obtained at layer L0 over a spatial grid

X1 to an L1 patch. In contrast to the L0 layer, L1 mod-

els a spatial distribution over L0 components. The mixture

model of T1 components at layer L1 is parameterized by

multinomial parameters φ1 ∈ RT1×X1×T0 .

The spatial grid is considered to be deterministic at each

layer and position variables for each word x are observed.

However, the distribution of words / topics over the grid is

not uniform and may vary across different components. We

thus have to introduce a spatial (multinomial) distribution χat each layer which is computed from the mixture distribu-

tion φ. This is needed to define a full generative model.

The base model for a single layer (with T1 = 0) is equiv-

alent to LDA, which is therefore a special case of our nested

approach.

3.1. Generative Process

Given symmetric Dirichlet priors α,β0,β1 and a fixed

choice of the number of mixture components T0 and T1 for

layer L1 and L0 respectively, we define the following gener-

ative process which is also illustrated in Figure 2b. Mixture

distributions are sampled globally according to:

• φ1 ∼ Dir(β1) and φ0 ∼ Dir(β0): sample L1 and L0

multinomial parameters

• χ1 ← φ1 and χ0 ← φ0 : compute spatial distributions

from mixture distributions

For each document, d ∈ {1, . . . , D} top level mixing pro-

portions θ(d) are sampled according to:

• θ(d) ∼ Dir(α) : sample top level mixing proportions

For each document d, N (d)words w are sampled according

to:

• z1 ∼ Mult(θ(d)) : sample L1 mixture distribution

• x1 ∼ Mult(χ(z1,·)1 ) : sample spatial position on L1

given z1• z0 ∼ Mult(φ(z1,x1,·)

1 ) : sample L0 mixture distribution

given z1 and x1 from L1

• x0 ∼ Mult(χ(z0,·)0 ) : sample spatial position on L0

given z0• w ∼ Mult(φ(z0,x0,·)

0 ) : sample word given z0 and x0

According to the proposed generative process, the joint dis-

tribution of the model parameters given the hyperparame-

3

Recursive LDA

θ(d)

X1

X1

x1

z1

z0

x0

X0

V

X0

T0

w

L1

L0T0

T1

χ(z1,·) ∈ RX1

χ(z0,·) ∈ RX0

φ(z0,·,·) ∈ RX0×T0

φ(z1,·,·) ∈ RX1×T0

Inference scheme

• Gibbs sampling: sequential updates of random variables with all others held constant.

• Linear topic response for initialization.

Outline

• Motivation:



✴Value of Bayesian inference


• Evaluation

Evaluation

• 16px SIFT, extracted densely every 6px; max value normalized to 100 tokens

• Three conditions:

✴Single-layer LDA

✴Feed-forward two-layer LDA (FLDA)

✴Recursive two-layer LDA (RLDA)

RLDA > FLDA > LDA540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

CVPR#1888

CVPR#1888


Table 2: Results for different implementations of our model with 128 components at L0 and 128 components at L1.

Approach Caltech-101

Model Basis size Layer(s) used 15 30

128-dim

models

LDA 128 “bottom” 52.3± 0.5% 58.7± 1.1%RLDA 128t/128b bottom 55.2± 0.3% 62.6± 0.9%LDA 128 “top” 53.7± 0.4% 60.5± 1.0%FLDA 128t/128b top 55.4± 0.5% 61.3± 1.3%RLDA 128t/128b top 59.3± 0.3% 66.0± 1.2%FLDA 128t/128b both 57.8± 0.8% 64.2± 1.0%RLDA 128t/128b both 61.9± 0.3% 68.3± 0.7%

5 10 15 20 25 3035

40

45

50

55

60

65

70

Number of training examples

Aver

age

% c

orre

ct

RLDA,top,128t/128bFLDA,top,128t/128bLDA,"top",128bLDA,"bottom",128b

(a) Caltech-101: 128 topic models, LDA-SIFT vs FLDA vs RLDA

5 10 15 20 25 3040

45

50

55

60

65

70


Aver

age

% c

orre

ct

RDLA,both,128t/128bRLDA,top,128t/128bFLDA,both,128t/128bFLDA,top,128t/128b

(b) Caltech-101: 128 topic models, L2 vs L1+L2 models

Figure 4: Comparison of classification rates on Caltech-101 of the one-layer model (LDA-SIFT) with the feed-forward

(FLDA) and full generative (RLDA) two-layer models for different number of training examples, all trained with 128 at L0

and 128 at L1 layers. We also compare to the stacked L0 and L1 layers for both FLDA and RLDA, which performed the best.

SIFT model trained with 128 topics, obtaining respective

classification rates of 61.3% and 66.0%, compared to the

LDA-SIFT result of 58.7%. Detailed results across differ-

ent numbers of training examples are shown in Figure 4a.

The curves show that the RLDA model consistently outper-

forms the feed-forward FLDA model, which in turn outper-

forms the one-layer LDA-SIFT model. This is due to the

fact that RLDA has learned more complex spatial structures

(the learned vocabularies are presented in Figure 5).

To show that the increased performance of the second

layer is not only due to a larger spatial support of the fea-

tures, we have also ran the baseline LDA on the bigger,

16 × 16 grid, which is the size of the spatial support of

the RLDA and FLDA models. The performance, reported

in Table 2 under LDA “top” is 60.6% which is less than

FLDA and RLDA, which further demonstrates the neces-

sity of having multi-layered representations.

We have also tested the classification performance on

stacking L0 and L1 features which contains more informa-

tion (spatially larger as well as smaller features are taken

into account this way), which is a standard practice for hi-

erarchical models [17]. The results are presented in Table 2

(under “both”) and Figure 4b, showing that using informa-

tion from both layers facilitates discrimination.

We have further tested the performance of the first layer

(L0) obtained within the RLDA model and compared it with

the single-layer model of LDA-SIFT. Interestingly, the L0

learned with the RLDA model outperforms just the bottom-

up single layer model (62.6% vs 58.7%), demonstrating that

learning the layers jointly is beneficial for the performance

of both, bottom as well as the top layers. However, for

the RLDA model with 1024 L1 and 128 L0 components,

the performance of the bottom (L0) layer performed only

slightly better as the L0 layer of the 128 L1 / 128 L0 RLDA

model, achieving 62.7% vs 62.6%, which seems to be the

saturation point for the 128 bottom layer.

Following this evaluation, we learned two-layer models

with 1024 topics in the top layer and 128 in the bottom layer,

6

• additional layer increases performance

• full inference increases performance

RLDA > FLDA > LDA

5 10 15 20 25 3035

40

45

50

55

60

65

70


Aver

age

% c

orre

ct

RLDA,top,128t/128bFLDA,top,128t/128bLDA,"top",128bLDA,"bottom",128b

5 10 15 20 25 3040

45

50

55

60

65

70


Aver

age

% c

orre

ct

RDLA,both,128t/128bRLDA,top,128t/128bFLDA,both,128t/128bFLDA,top,128t/128b

• additional layer increases performance

• full inference increases performance

• using both layers increases performance

RLDA vs. other hierarchies

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

CVPR#1888

CVPR#1888


p(w, z0,1|x0,1,α,β0,1) =

�

θ

�

φ1

�

φ0

�

χ1

�

χ0

p(w, z1, z0,φ1,φ0, θ,χ1,χ0|x1, x0,α,β1,β0)dχ0dχ1dφ0dφ1dθ (2)

=

�

θ

�

φ1

�

φ0

p(w, z1, z0,φ1,φ0, θ|x1, x0,α,β1,β0)dφ0dφ1dθ (3)

=

�

θ

p(θ|α)p(z1|θ)dθ� ��

top layer

�

φ1

p(φ1|β1)p(z0|φ1, z1, x1)dφ1

� �� intermediate layer

�

φ0

p(w|φ0, z0, x0)p(φ0|β)dφ0

� �� evidence layer

(4)

p(w, z1,...,L|x1,...,L,α,β1,...,L) =

�

θ

p(θ|α)p(zL|θ)dθ� ��

top layer L

L−1�

l=1

�

φl

p(φl|βl)p(zl−1|φl, zl, xl)dφl

� �� layer l

�

φ0

p(w|φ0, z0, x0)p(φ0|β)dφ0

� �� evidence layer

(5)

Figure 3: RLDA conditional probabilities for a two-layer model, which is then generalized to an L-layer model.

Table 1: Comparison of the top performing RLDA with 128 components at L0 and 1024 components at layer L1 to otherhierarchical approaches. “bottom” is used to denote the L0 features of the RLDA model, “top” for L1 features, while “both”denotes stacked features from L0 and L1 layers.

Approach Caltech-101Model Layer(s) used 15 30

Our ModelRLDA (1024t/128b) bottom 56.6± 0.8% 62.7± 0.5%RLDA (1024t/128b) top 66.7± 0.9% 72.6± 1.2%RLDA (1024t/128b) both 67.4± 0.5 73.7± 0.8%

HierarchicalModels

Sparse-HMAX [21] top 51.0% 56.0%CNN [15] bottom – 57.6± 0.4%CNN [15] top – 66.3± 1.5%CNN + Transfer [2] top 58.1% 67.2%CDBN [17] bottom 53.2± 1.2% 60.5± 1.1%CDBN [17] both 57.7± 1.5% 65.4± 0.4%Hierarchy-of-parts [8] both 60.5% 66.5%Ommer and Buhmann [23] top – 61.3± 0.9%

Caltech-101 evaluations. A spatial pyramid with 3 layersof square grids of 1, 2, and 4 cells each was always con-structed on top of our features. Guided by the best practicesoutlined in a recent comparison of different pooling and fac-torization functions [5], we used max pooling for the spatialpyramid aggregation. For classification, we used a linearSVM, following the state-of-the-art results of Yang et al.[31]. Caltech-101 is a dataset comprised of 101 object cat-egories, with differing numbers of images per category [7].Following standard procedure, we use 30 images per cat-egory for training and the rest for testing. Mean accuracyand standard deviation are reported for 8 runs over differentsplits of the data, normalized per class.

We first tested a one-layer LDA over SIFT features.For 128 components we obtain classification accuracy of58.7%. Performance increases as we increase the number

of components, demonstrating the lack of discrimination ofthe baseline LDA-SIFT model: for 1024 topics the perfor-mance is 68.8% and 70.4% if we used 2048 topics.

The main emphasis of our experiments is to evaluatethe contribution of additional layers to classification perfor-mance. We tested models FLDA and RLDA (described in4.1) in the same regime as described above.

As an initial experiment, we constrained the numberof topics to 128 and trained three models: single-layer,FLDA, and RLDA. The two-layer models were trained with128 topics on the bottom layer, corresponding to factorizedSIFT descriptors, and 128 on the top. Only the top-layertopics were used for classification, so the effective totalnumber of topics in the two-layer models was the same asthe single-layer model. We present the results in Table 2.Both two-layer models improved on the one-layer LDA-

5

RLDA vs. single-feature state-of-the-art

• RLDA: 73.7%

• Sparse-Coding Spatial Pyramid Matching: 73.2% (Yang et al. CVPR 2009)

• SCSPM with “macrofeatures” and denser sampling: 75.7% (Bouerau et al. CVPR 2010)

• Locality-constrained Linear Coding: 73.4% (Wang et al. CVPR 2010)

• Saliency sampling + NBNN: 78.5% (Kanan and Cottrell, CVPR 2010)

FLDA 128t/128b

RLDA 128t/128b

Top Bottom

Bottom and top layers

Conclusions

• Presented Bayesian hierarchical approach to modeling sparsely coded visual features of increasing complexity and spatial support.

• Showed value of full inference.

Future directions

• Extend hierarchy to object level.

• Direct discriminative component

• Non-parametrics

• Sparse Coding + LDA

a probabilistic model for recursive factorized image features ppt

Technology