fcv bio cv_cottrell

1

Unsupervised learning of visual Unsupervised learning of visual representations and their use in representations and their use in

object & faceobject & facerecognitionrecognitionGary CottrellGary Cottrell

Chris Kanan Honghao ShanChris Kanan Honghao ShanLingyun Zhang Matthew TongLingyun Zhang Matthew Tong

Tim MarksTim Marks

2

CollaboratorsCollaborators

Honghao Shan Chris Kanan

3

CollaboratorsCollaborators

Tim Marks Matt TongLingyun Zhang

QuickTime and aªTIFF (LZW) decompressor

are needed to see this picture.

4

Efficient Encoding of the worldEfficient Encoding of the world

Sparse Principal Components Analysis:Sparse Principal Components Analysis:A model of unsupervised learning for early A model of unsupervised learning for early perceptual processing (Honghao Shan)perceptual processing (Honghao Shan)

The model embodies three constraintsThe model embodies three constraints

• Keep as much information as possibleKeep as much information as possible

• While trying to equalize the neural responsesWhile trying to equalize the neural responses

• And minimizing the connections.And minimizing the connections.

5

Trained on grayscale imagesTrained on color images

Spatial extent

Tem

pora

l ext

ent

Efficient Encoding of the world leads to magno- Efficient Encoding of the world leads to magno- and parvo-cellular response properties…and parvo-cellular response properties…

Trained on video cubes

This suggests that these cell types exist becausebecause they are useful for efficiently encoding the temporal dynamics of the world.

Midget?

Parasol?

Persistent, small

Transient, large

6

Efficient Encoding of the world leads Efficient Encoding of the world leads to gammatone filters as in auditory to gammatone filters as in auditory

nerves:nerves: Using Using exactly the same algorithmexactly the same algorithm, applied to , applied to

speech, environmental sounds, etc.:speech, environmental sounds, etc.:

7

Efficient Encoding of the worldEfficient Encoding of the world

A single unsupervised learning algorithm A single unsupervised learning algorithm leads toleads to Model cells with properties similar to those Model cells with properties similar to those

found in the retina when applied to natural found in the retina when applied to natural videosvideos

Models cells with properties similar to those Models cells with properties similar to those found in auditory nerve when applied to found in auditory nerve when applied to natural soundsnatural sounds

One small step towards a unified theory of One small step towards a unified theory of temporal processing.temporal processing.

8

Unsupervised Learning of Unsupervised Learning of Hierarchical Representations Hierarchical Representations

(RICA 2.0; cf. Shan et al., NIPS 19)(RICA 2.0; cf. Shan et al., NIPS 19)

Recursive ICA (RICA 1.0 Recursive ICA (RICA 1.0 (Shan et al., 2008)(Shan et al., 2008)):): Alternately compress and expand Alternately compress and expand

representation using PCA and ICA;representation using PCA and ICA; ICA was modified by a component-wise ICA was modified by a component-wise

nonlinearitynonlinearity Receptive fields expanded at each ICA layerReceptive fields expanded at each ICA layer

9



ICA was modified by a component-wise ICA was modified by a component-wise nonlinearity:nonlinearity: Think of ICA as a generative model: The Think of ICA as a generative model: The

pixels are the sum of many independent pixels are the sum of many independent random variables: Gaussian.random variables: Gaussian.

Hence ICA prefers its inputs to be Gaussian-Hence ICA prefers its inputs to be Gaussian-distributed.distributed.

We apply an inverse cumulative Gaussian to We apply an inverse cumulative Gaussian to the absolute value of the ICA components to the absolute value of the ICA components to “gaussianize” them.“gaussianize” them.

10



Strong responses, either positive or Strong responses, either positive or negative, are mapped to the positive tail of negative, are mapped to the positive tail of the Gaussian; weak ones, to the negative the Gaussian; weak ones, to the negative tail; ambiguous ones to the center.tail; ambiguous ones to the center.

11



RICA 2.0:RICA 2.0: Replace PCA by SPCAReplace PCA by SPCA

SPCA SPCA

12



RICA 2.0 Results: RICA 2.0 Results: Multiple layer system withMultiple layer system with Center-surround receptive fields at the first layerCenter-surround receptive fields at the first layer Simple edge filters at the second (ICA) layerSimple edge filters at the second (ICA) layer Spatial pooling of orientations at the third (SPCA) Spatial pooling of orientations at the third (SPCA)

layer:layer:

V2-like response properties at the fourth (ICA) layerV2-like response properties at the fourth (ICA) layer

QuickTime and aª decompressor


13



V2-like response properties at the V2-like response properties at the fourth (ICA) layerfourth (ICA) layer

These maps show strengths of These maps show strengths of connections to layer 1 ICA filters. connections to layer 1 ICA filters. Warm and cold colors are strong Warm and cold colors are strong +/- connections, gray is weak +/- connections, gray is weak connections, orientation connections, orientation corresponds to layer 1 orientation.corresponds to layer 1 orientation.

The left-most column displays two The left-most column displays two model neurons that show uniform model neurons that show uniform orientation preference to layer-1 orientation preference to layer-1 ICA features. ICA features.

The middle column displays model The middle column displays model neurons that have non-neurons that have non-uniform/varying orientation uniform/varying orientation preference to layer-1 ICA features. preference to layer-1 ICA features.

The right column displays two The right column displays two model neurons that have location model neurons that have location preference, but no orientation preference, but no orientation preference, to layer-1 ICA features.preference, to layer-1 ICA features.

The left two columns are consistent with Anzen, Peng, & Van Essen 2007. The

right hand column is a prediction

14



Dimensionality Reduction & Expansion might be a general strategy of Dimensionality Reduction & Expansion might be a general strategy of information processing in the brain.information processing in the brain. The first step removes noise and reduces complexity, the second step The first step removes noise and reduces complexity, the second step

captures the statistical structure.captures the statistical structure. We showed that retinal ganglion cells and V1 complex cells may be We showed that retinal ganglion cells and V1 complex cells may be

derived from the same learning algorithm, applied to pixels in one derived from the same learning algorithm, applied to pixels in one case, and V1 simple cell outputs in the second.case, and V1 simple cell outputs in the second.

This highly simplified model of early vision is the first one that learns This highly simplified model of early vision is the first one that learns the RFs of all early visual layers, using a consistent theory the RFs of all early visual layers, using a consistent theory -- the the efficient coding theory. efficient coding theory.

We believe it could serve as a basis for more sophisticated models of We believe it could serve as a basis for more sophisticated models of early vision.early vision.

An obvious next step is to train and thus make predictions about An obvious next step is to train and thus make predictions about higher layers.higher layers.

15

We showed in Shan & Cottrell (CVPR 2008) We showed in Shan & Cottrell (CVPR 2008) that we could achieve state-of-the-art face that we could achieve state-of-the-art face recognition with the non-linear ICA features recognition with the non-linear ICA features and a simple softmax output.and a simple softmax output.

We showed in Kanan & Cottrell (CVPR 2010) We showed in Kanan & Cottrell (CVPR 2010) that we could achieve state-of-the-art face and that we could achieve state-of-the-art face and object recognition with a system that used an object recognition with a system that used an ICA-based salience map, simulated fixations, ICA-based salience map, simulated fixations, non-linear ICA features, and a kernel-density non-linear ICA features, and a kernel-density memory.memory.

Here I briefly describe the latter.Here I briefly describe the latter.

Nice, but is it useful?Nice, but is it useful?

16

Our attention is Our attention is automatically drawn to automatically drawn to interesting regions in interesting regions in images.images.

Our salience algorithm is Our salience algorithm is automatically drawn to automatically drawn to interesting regions in interesting regions in images.images.

These are useful locations These are useful locations for for discriminatingdiscriminating one one object (face, butterfly) from object (face, butterfly) from another.another.

One reason why this might be One reason why this might be a good idea…a good idea…

17

Training Phase (learning object Training Phase (learning object appearances):appearances):

Use the salience map to decide Use the salience map to decide where where to look. to look. (We use the ICA salience map)(We use the ICA salience map)

Memorize these samplesMemorize these samples of the of the image, with labels (Bob, Carol, Ted, image, with labels (Bob, Carol, Ted, or Alice) or Alice) (We store the (compressed) ICA feature (We store the (compressed) ICA feature values)values)

Main IdeaMain Idea

18

Testing Phase (recognizing Testing Phase (recognizing objects we have learned):objects we have learned):

Now, given a new face, use the salience Now, given a new face, use the salience map to decide where to look.map to decide where to look.

Compare Compare new new image samples to image samples to storedstored ones - the closest ones in memory get to ones - the closest ones in memory get to vote for their label.vote for their label.

Main IdeaMain Idea

19

Stored memories of BobStored memories of AliceNew fragments

19Result: 7 votes for Alice, only 3 for Bob. It’s Alice!

20

VotingVoting

The voting process is based on Bayesian The voting process is based on Bayesian updating (with Naïve Bayes).updating (with Naïve Bayes).

The size of the vote depends on the The size of the vote depends on the distance from the stored sample, using distance from the stored sample, using kernel density estimation. kernel density estimation.

Hence NIMBLE: NIM with Bayesian Hence NIMBLE: NIM with Bayesian Likelihood Estimation.Likelihood Estimation.

21

The ICA features do double-duty:The ICA features do double-duty: They are They are combinedcombined to make the salience map to make the salience map

- which is used to decide where to look- which is used to decide where to look They are They are storedstored to represent the object at to represent the object at

that locationthat location

QuickTimeª and a decompressor


Overview of the systemOverview of the system

22

Compare this to (most, not all!) Compare this to (most, not all!) computer vision systems:computer vision systems:

One pass over the image, and global One pass over the image, and global features.features.

Image Global Features

Global Classifier

Decision

NIMBLE vs. Computer VisionNIMBLE vs. Computer Vision

23

QuickTime and aªH.264 decompressor


24Belief After 1 Fixation Belief After 10 Fixations

25

Human vision works in multiple environments Human vision works in multiple environments - our basic features (neurons!) don’t change - our basic features (neurons!) don’t change from one problem to the next.from one problem to the next.

We tune our parameters so that the system We tune our parameters so that the system works well on Bird and Butterfly datasets - and works well on Bird and Butterfly datasets - and then apply the system then apply the system unchangedunchanged to faces, to faces, flowers, and objectsflowers, and objects

This is very different from standard computer This is very different from standard computer vision systems, that are (usually) tuned to a vision systems, that are (usually) tuned to a particular domainparticular domain

Robust VisionRobust Vision

26

Cal Tech 101: 101 Different Categories

AR dataset: 120 Different People with different lighting, expression, and accessories

27

Flowers: 102 Different Flower SpeciesFlowers: 102 Different Flower Species

28

~7 fixations required to achieve ~7 fixations required to achieve at least 90% of maximum at least 90% of maximum performance performance

29

So, we created a simple cognitive model So, we created a simple cognitive model that uses simulated fixations to recognize that uses simulated fixations to recognize things.things. But it isn’t But it isn’t thatthat complicated. complicated.

How does it compare to approaches in How does it compare to approaches in computer vision?computer vision?

30

Caveats:Caveats: As of mid-2010.As of mid-2010. Only comparing to single feature type Only comparing to single feature type

approaches (no “Multiple Kernel approaches (no “Multiple Kernel Learning” (MKL) approaches).Learning” (MKL) approaches).

Still superior to MKL with very few Still superior to MKL with very few training examples per category.training examples per category.

311 5 15 30

NUMBER OF TRAINING EXAMPLES

321 2 3 6 8

NUMBER OF TRAINING EXAMPLES

33

QuickTime and aª decompressor


34Again, best for single feature-type systems

and for 1 training instance better than all systems

35

More neurally and behaviorally relevant More neurally and behaviorally relevant gaze control and fixation integration.gaze control and fixation integration. People don’t randomly sample images.People don’t randomly sample images.

A foveated retinaA foveated retina

Comparison with human eye movement Comparison with human eye movement data during recognition/classification of data during recognition/classification of faces, objects, etc.faces, objects, etc.

36

A biologically-inspired, fixation-A biologically-inspired, fixation-based approach can work well for based approach can work well for image classification.image classification.

Fixation-based models can achieve, Fixation-based models can achieve, and even exceed, some of the best and even exceed, some of the best models in computer vision. models in computer vision.

……Especially when you don’t have a Especially when you don’t have a lot of training images.lot of training images.

37

Software and Paper Available at Software and Paper Available at www.chriskanan.comwww.chriskanan.com

For more details email:For more details email:[email protected]@ucsd.edu

This work was supported by the NSF (grant

#SBE-0542013) to the Temporal Dynamics of

Learning Center.

38Thanks!

39

Sparse Principal Sparse Principal Components AnalysisComponents Analysis

We minimize:We minimize:

Subject to the following constraint:Subject to the following constraint:

40

The SPCA model as a neural net…The SPCA model as a neural net…

It is AT that is mostly 0…

41

ResultsResults

suggesting the 1/f power spectrum of images is suggesting the 1/f power spectrum of images is where this is coming from…where this is coming from…

42

ResultsResults

The role of The role of λλ::

Recall this reduces the number of connections…Recall this reduces the number of connections…

43

ResultsResults

The role of The role of λλ: higher : higher λλ means means fewer connections, which alters fewer connections, which alters the contrast sensitivity function the contrast sensitivity function (CSF).(CSF).

Matches recent data on Matches recent data on malnourished kids and their malnourished kids and their CSF’s: lower sensitivity at low CSF’s: lower sensitivity at low spatial frequencies, but slightly spatial frequencies, but slightly better at high than normal better at high than normal controls…controls…

44

NIMBLE represents its beliefs using NIMBLE represents its beliefs using probability distributionsprobability distributions

Simple nearest neighbor density Simple nearest neighbor density estimation to estimate: estimation to estimate:

P(P(fixationfixationtt | | CategoryCategory = = kk)) Fixations are combined over fixations/Fixations are combined over fixations/

time using Bayesian updatingtime using Bayesian updating

fcv bio cv_cottrell

Technology

ica filters

recursive ica rica

model of unsupervised

ica components

ica features

ica layerthese maps

model neurons

efficient encoding