efficient learning of sparse representations with an
TRANSCRIPT
IntroductionThe Model
LearningExperimentsConclusion
Efficient Learning of Sparse Representations withan Energy-Based Model
Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra,Yann Le Cun
Presented by Pascal Lamblin
February 14th, 2007
Efficient Learning of Sparse Representations with an Energy-Based Model
IntroductionThe Model
LearningExperimentsConclusion
1 IntroductionPre-processors and feature extractorsCoding and decoding
2 The ModelEnergy Function and ArchitectureThe Sparsifying Logistic
3 Learning
4 ExperimentsFeature extractionInitialization of a Convolutional Neural NetHierarchical Extension: Learning Topographic Maps
Si vous vous attendez a lire des betises ici, tant pis pour vous
IntroductionThe Model
LearningExperimentsConclusion
Pre-processors and feature extractorsCoding and decoding
Unsupervised Learning of Representations
Methods like PCA, ICA, Wavelet decompositions. . .
Usually, dimensionality is reduced
Not necessary: sparse overcomplete representations
Improved separability of classesBetter interpretation (sum of basic components)Biological parallel (early visual areas)
En effet, je resterai d’un serieux exemplaire
IntroductionThe Model
LearningExperimentsConclusion
Pre-processors and feature extractorsCoding and decoding
Usual Architecture
output
↑ decoder
code
↑ encoder
input
Usually, an encoder and a decoder (possibly sharingparameters)
Architecture for auto-encoders, restricted Boltzmannmachines, PCA, . . .
Sometimes, the encoder or decoder is absent (e.g., replacedby a sampling or minimization procedure)
Here we present a model with an encoder and a decoder
Je ne voudrais pas prendre le risque de deconcentrer l’auditoire
IntroductionThe Model
LearningExperimentsConclusion
Pre-processors and feature extractorsCoding and decoding
Learning Procedure
Usually (PCA, auto-encoders, . . . ), we minimize areconstruction error criterion
Here, we also want sparsity in the code: another constraint
Use of a Sparsifying Logistic module, between code anddecoder
Hard to learn through backprop only: optimize a global energyfunction, which depends also on the codes
Iterative coordinate descent optimization (like EM)
Mais rendus la, je pense que certains sont deja perdus
IntroductionThe Model
LearningExperimentsConclusion
Energy Function and ArchitectureThe Sparsifying Logistic
Notation and Components
The input: an image patch, X , as a vector
The encoder: set of linear filters, rows of WC
The code: a vector Z
The Sparsifying Logistic: transforms Z into Z
The sparse code vector: a vector Z with components in [0, 1]
The decoder: reverse linear filters, columns of WD
Alors rien de mal a leur poser une petite devinette
IntroductionThe Model
LearningExperimentsConclusion
Energy Function and ArchitectureThe Sparsifying Logistic
Energy of the System
We want to minimize the global energy of the system, function ofthe model’s parameters WC and WD , the free parameter Z , andthe input X .
E (X ,Z ,WC ,WD) = EC (X ,Z ,WC ) + ED(X ,Z ,WD)Code prediction energy:
EC (X ,Z ,WC ) =1
2‖Z −WCX‖2
Reconstruction energy:
EC (X ,Z ,WD) =1
2
∥∥X −WD Z∥∥2
We have no hard equality constraint between Z and WCX , nor onX and WD Z .
Z 6= WCXWD Z 6= X
Attention, accrochez-vous : qu’est-ce qui fait “Toin ! Toin !” ?
IntroductionThe Model
LearningExperimentsConclusion
Energy Function and ArchitectureThe Sparsifying Logistic
Energy of the System
We want to minimize the global energy of the system, function ofthe model’s parameters WC and WD , the free parameter Z , andthe input X .
E (X ,Z ,WC ,WD) = EC (X ,Z ,WC ) + ED(X ,Z ,WD)Code prediction energy:
EC (X ,Z ,WC ) =1
2‖Z −WCX‖2
Reconstruction energy:
EC (X ,Z ,WD) =1
2
∥∥X −WD Z∥∥2
We have no hard equality constraint between Z and WCX , nor onX and WD Z .
Z 6= WCXWD Z 6= X
Attention, accrochez-vous : qu’est-ce qui fait “Toin ! Toin !” ?
IntroductionThe Model
LearningExperimentsConclusion
Energy Function and ArchitectureThe Sparsifying Logistic
Cool Figure
Architecture of the energy-based model
Pendant que vous cherchez, on va a la pub
IntroductionThe Model
LearningExperimentsConclusion
Energy Function and ArchitectureThe Sparsifying Logistic
In Theory
Let’s consider the k-th training sample
zi (k) = ηeβzi (k)
ζi (k)
ζi (k) = ηeβzi (k) + (1− η)ζi (k − 1)
Like a weighted softmax applied through time
High values of β makes the values more binary
High values of η increases the “firing rate”
The Sparsifying Logistic enforces sparsity through the examples foreach individual component. There is no constraint of sparsitybetween the units of a code.
Ragoutoutou ! Le ragout de mon toutou, hmm, j’en suis fou !
IntroductionThe Model
LearningExperimentsConclusion
Energy Function and ArchitectureThe Sparsifying Logistic
In Practice
zi (k) = ηeβzi (k)
ζi (k)
zi (k) = 1
1 +(1− η)ζi (k − 1)
ηeβzi (k)
zi (k) =[1 + (1−η)
η ζi (k − 1)e−βzi (k)]−1
We learn ζi across the training set and fix it
Logistic function, with fixed gain and learnt bias
This version of the Sparsifying Logistic module is deterministic,and does not depend on the ordering of the samples.
Continuons dans la detente
IntroductionThe Model
LearningExperimentsConclusion
Learning Procedure
We want to minimize:
E(WC ,WD ,Z 1, . . . ,ZP
)=
P∑i=1
(EC
(X i ,Z i ,WC
)+ ED
(X i ,Z i ,WD
))by the procedure:
{W ∗C ,W ∗
D} = argmin{WC ,WD}
(min
Z1,...,ZPE
(WC ,WD ,Z 1, . . . ,ZP
))
1 Find the optimal Z i , given WC and WD
2 Update the weights WC and WD , given Z i found at step 1, inorder to minimize the energy
3 Iterate until convergence
C’est deux vaches qui broutent dans un pre
IntroductionThe Model
LearningExperimentsConclusion
Online Version
We consider only one sample X at a time. The cost to minimize is
C = EC (X ,Z ,WC ) + ED(X ,Z ,WD)
1 Initialize Z by Zinit = WCX
2 Minimize C wrt Z , by gradient descent initialized at Zinit
3 Compute the gradient of C wrt WC and WD , and perform onegradient step
We iterate over all samples, until convergence.
L’une dit a l’autre : “Ca t’inquiete pas, ces histoires de vaches folles ?”
IntroductionThe Model
LearningExperimentsConclusion
So, What Happens?
Only a few steps of gradient descent are necessary tominimize Z
At the end of the process, even Zinit = WCX is accurateenough
So EC (X ,Z ,WC ) = 12 ‖Z −WCX‖2 is minimized
The reconstruction errors from Zinit are also low
So ED(X ,Z ,WD) = 12
∥∥X −WD Z∥∥2
is also minimized
The minimization procedure manages to minimize both energyterms
Imposing the hard constraint WCX = Z does not work,because of the saturation of the sparsifying module
An L1 penalty term is added to WC , and an L2 penalty termto WD
Et l’autre repond : “Pas du tout, tu vois bien que je suis un lapin !”
IntroductionThe Model
LearningExperimentsConclusion
Feature extractionInitialization of a Convolutional Neural NetHierarchical Extension: Learning Topographic Maps
Natural Image Patches
12× 12 patches from the Berkeley segmentation data set
Codes of length 200
30 minutes on a 2 GHz processor for 200 filters on 100, 00012× 12 patches
Filters learnt by the decoder
Spatially localized filters, similar to Gabor wavelets, likereceptive fields of V1 neurons
WC and WD′ are really close after the optimization
La, vous pouvez faire semblant de vous interesser a l’expose : il y a des images
IntroductionThe Model
LearningExperimentsConclusion
Feature extractionInitialization of a Convolutional Neural NetHierarchical Extension: Learning Topographic Maps
On MNIST Digit Recognition Data Set
Input is the whole 28× 28 image (not a patch)
Codes of length 196
+ 1 + 1= 1 + 1 + 1 + 1 + 1 + 0.8 + 0.8
Some encoder filters, and an example of digit reconstruction
Stroke detectors are learnt
Reconstruction: sum of a few “parts”
Encore des images, ca aide a patienter jusqu’a la fin
IntroductionThe Model
LearningExperimentsConclusion
Feature extractionInitialization of a Convolutional Neural NetHierarchical Extension: Learning Topographic Maps
On MNIST
Train filters on 5× 5 image patches
Codes of length 50
Initialize a network with 50 features on layer 1 and 2, 50 onlayer 3 and 4, 200 on layer 5, and 10 output units.
Misclassification Random Pre-training
No distortions 0.70% 0.60%Distortions 0.49% 0.39%
Bon, maintenant le suspense a assez dure
IntroductionThe Model
LearningExperimentsConclusion
Feature extractionInitialization of a Convolutional Neural NetHierarchical Extension: Learning Topographic Maps
Natural Image Patches
12× 12 patches from the Berkeley segmentation data setCodes of length 400Close filters learn similar weights
C ODE L E V E L 1
C ODE L E V E L 2
I NPUT X
K
W c W d
Spar s.
L ogistic
E c
E d
C ODE Z
C ONV OL .
E ucl. Dist.
E ucl. Dist.
0.08 0.12 0.08
0.12 0.23 0.12
0.08 0.12 0.08
K =
La reponse est donc...
IntroductionThe Model
LearningExperimentsConclusion
Conclusion
Energy-based model for unsupervised learning of sparseovercomplete representations
Fast and accurate processing after learning
Sparsification of each unit across the dataset seems easierthan sparsification of each example across the code units
Can be extended to non-linear encoder and decoder
Sparse code can be used as input for another feature extractor
Un tanard ! >◦ /
IntroductionThe Model
LearningExperimentsConclusion
Questions?
Hopefully not. . .
Merci de votre attention
IntroductionThe Model
LearningExperimentsConclusion
Questions?
Hopefully not. . .
Merci de votre attention