efficient learning of sparse representations with an

IntroductionThe Model

LearningExperimentsConclusion

Efficient Learning of Sparse Representations withan Energy-Based Model

Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra,Yann Le Cun

Presented by Pascal Lamblin

February 14th, 2007

Efficient Learning of Sparse Representations with an Energy-Based Model



1 IntroductionPre-processors and feature extractorsCoding and decoding

2 The ModelEnergy Function and ArchitectureThe Sparsifying Logistic

3 Learning

4 ExperimentsFeature extractionInitialization of a Convolutional Neural NetHierarchical Extension: Learning Topographic Maps

Si vous vous attendez a lire des betises ici, tant pis pour vous



Pre-processors and feature extractorsCoding and decoding

Unsupervised Learning of Representations

Methods like PCA, ICA, Wavelet decompositions. . .

Usually, dimensionality is reduced

Not necessary: sparse overcomplete representations

Improved separability of classesBetter interpretation (sum of basic components)Biological parallel (early visual areas)

En effet, je resterai d’un serieux exemplaire




Usual Architecture

output

↑ decoder

code

↑ encoder

input

Usually, an encoder and a decoder (possibly sharingparameters)

Architecture for auto-encoders, restricted Boltzmannmachines, PCA, . . .

Sometimes, the encoder or decoder is absent (e.g., replacedby a sampling or minimization procedure)

Here we present a model with an encoder and a decoder

Je ne voudrais pas prendre le risque de deconcentrer l’auditoire




Learning Procedure

Usually (PCA, auto-encoders, . . . ), we minimize areconstruction error criterion

Here, we also want sparsity in the code: another constraint

Use of a Sparsifying Logistic module, between code anddecoder

Hard to learn through backprop only: optimize a global energyfunction, which depends also on the codes

Iterative coordinate descent optimization (like EM)

Mais rendus la, je pense que certains sont deja perdus



Energy Function and ArchitectureThe Sparsifying Logistic

Notation and Components

The input: an image patch, X , as a vector

The encoder: set of linear filters, rows of WC

The code: a vector Z

The Sparsifying Logistic: transforms Z into Z

The sparse code vector: a vector Z with components in [0, 1]

The decoder: reverse linear filters, columns of WD

Alors rien de mal a leur poser une petite devinette




Energy of the System

We want to minimize the global energy of the system, function ofthe model’s parameters WC and WD , the free parameter Z , andthe input X .

E (X ,Z ,WC ,WD) = EC (X ,Z ,WC ) + ED(X ,Z ,WD)Code prediction energy:

EC (X ,Z ,WC ) =1

2‖Z −WCX‖2

Reconstruction energy:

EC (X ,Z ,WD) =1

2

∥∥X −WD Z∥∥2

We have no hard equality constraint between Z and WCX , nor onX and WD Z .

Z 6= WCXWD Z 6= X

Attention, accrochez-vous : qu’est-ce qui fait “Toin ! Toin !” ?




Cool Figure

Architecture of the energy-based model

Pendant que vous cherchez, on va a la pub




In Theory

Let’s consider the k-th training sample

zi (k) = ηeβzi (k)

ζi (k)

ζi (k) = ηeβzi (k) + (1− η)ζi (k − 1)

Like a weighted softmax applied through time

High values of β makes the values more binary

High values of η increases the “firing rate”

The Sparsifying Logistic enforces sparsity through the examples foreach individual component. There is no constraint of sparsitybetween the units of a code.

Ragoutoutou ! Le ragout de mon toutou, hmm, j’en suis fou !




In Practice

zi (k) = ηeβzi (k)

ζi (k)

zi (k) = 1

1 +(1− η)ζi (k − 1)

ηeβzi (k)

zi (k) =[1 + (1−η)

η ζi (k − 1)e−βzi (k)]−1

We learn ζi across the training set and fix it

Logistic function, with fixed gain and learnt bias

This version of the Sparsifying Logistic module is deterministic,and does not depend on the ordering of the samples.

Continuons dans la detente



Learning Procedure

We want to minimize:

E(WC ,WD ,Z 1, . . . ,ZP

)=

P∑i=1

(EC

(X i ,Z i ,WC

)+ ED

(X i ,Z i ,WD

))by the procedure:

{W ∗C ,W ∗

D} = argmin{WC ,WD}

(min

Z1,...,ZPE

(WC ,WD ,Z 1, . . . ,ZP

))

1 Find the optimal Z i , given WC and WD

2 Update the weights WC and WD , given Z i found at step 1, inorder to minimize the energy

3 Iterate until convergence

C’est deux vaches qui broutent dans un pre



Online Version

We consider only one sample X at a time. The cost to minimize is

C = EC (X ,Z ,WC ) + ED(X ,Z ,WD)

1 Initialize Z by Zinit = WCX

2 Minimize C wrt Z , by gradient descent initialized at Zinit

3 Compute the gradient of C wrt WC and WD , and perform onegradient step

We iterate over all samples, until convergence.

L’une dit a l’autre : “Ca t’inquiete pas, ces histoires de vaches folles ?”



So, What Happens?

Only a few steps of gradient descent are necessary tominimize Z

At the end of the process, even Zinit = WCX is accurateenough

So EC (X ,Z ,WC ) = 12 ‖Z −WCX‖2 is minimized

The reconstruction errors from Zinit are also low

So ED(X ,Z ,WD) = 12

∥∥X −WD Z∥∥2

is also minimized

The minimization procedure manages to minimize both energyterms

Imposing the hard constraint WCX = Z does not work,because of the saturation of the sparsifying module

An L1 penalty term is added to WC , and an L2 penalty termto WD

Et l’autre repond : “Pas du tout, tu vois bien que je suis un lapin !”



Feature extractionInitialization of a Convolutional Neural NetHierarchical Extension: Learning Topographic Maps

Natural Image Patches

12× 12 patches from the Berkeley segmentation data set

Codes of length 200

30 minutes on a 2 GHz processor for 200 filters on 100, 00012× 12 patches

Filters learnt by the decoder

Spatially localized filters, similar to Gabor wavelets, likereceptive fields of V1 neurons

WC and WD′ are really close after the optimization

La, vous pouvez faire semblant de vous interesser a l’expose : il y a des images




On MNIST Digit Recognition Data Set

Input is the whole 28× 28 image (not a patch)

Codes of length 196

+ 1 + 1= 1 + 1 + 1 + 1 + 1 + 0.8 + 0.8

Some encoder filters, and an example of digit reconstruction

Stroke detectors are learnt

Reconstruction: sum of a few “parts”

Encore des images, ca aide a patienter jusqu’a la fin




On MNIST

Train filters on 5× 5 image patches

Codes of length 50

Initialize a network with 50 features on layer 1 and 2, 50 onlayer 3 and 4, 200 on layer 5, and 10 output units.

Misclassification Random Pre-training

No distortions 0.70% 0.60%Distortions 0.49% 0.39%

Bon, maintenant le suspense a assez dure




Natural Image Patches

12× 12 patches from the Berkeley segmentation data setCodes of length 400Close filters learn similar weights

C ODE L E V E L 1

C ODE L E V E L 2

I NPUT X

K

W c W d

Spar s.

L ogistic

E c

E d

C ODE Z

C ONV OL .

E ucl. Dist.

E ucl. Dist.

0.08 0.12 0.08

0.12 0.23 0.12

0.08 0.12 0.08

K =

La reponse est donc...



Conclusion

Energy-based model for unsupervised learning of sparseovercomplete representations

Fast and accurate processing after learning

Sparsification of each unit across the dataset seems easierthan sparsification of each example across the code units

Can be extended to non-linear encoder and decoder

Sparse code can be used as input for another feature extractor

Un tanard ! >◦ /



Questions?

Hopefully not. . .

Merci de votre attention

efficient learning of sparse representations with an

Documents