thierry arti`eres - lis-lab.fr · 2019-03-19 · mean activity constraint (sparse autoencoders, [ng...

35
Main architectures Learning Deep architectures eseaux de Neurones Profonds, Apprentissage de Repr´ esentations Thierry Arti` eres ECM, LIF-AMU January 15, 2018 T. Arti` eres (ECM, LIF-AMU) Deep Learning January 15, 2018 1 / 31

Upload: others

Post on 05-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Reseaux de Neurones Profonds, Apprentissage de Representations

Thierry Artieres

ECM, LIF-AMU

January 15, 2018

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 1 / 31

Page 2: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

1 Main architecturesDense ArchitecturesAutoencodersConvolutional NNs

2 LearningSGDOptimization variantsLearning DNNs

3 Deep architecturesVery deep ModelsWhat makes DNN work?

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 2 / 31

Page 3: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Outline

1 Main architecturesDense ArchitecturesAutoencodersConvolutional NNs

2 Learning3 Deep architectures

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 3 / 31

Page 4: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Dense Architectures

Dense architecture

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 4 / 31

Page 5: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Autoencoders

Autoencoders

Principal Component AnalysisUnsupervised standard (Linear) Data Analysistechnique

Visualization, dimension reduction

Aims at finding principal axes of a dataset

NN with Diabolo shapeReconstruct the input at the output via anintermediate (small) layerUnsupervised learningNon linear projection, distributed representationHidden layer may be larger than input/outputlayers

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 5 / 31

Page 6: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Autoencoders

Deep autoencoders

Deep NN with Diabolo shapeExtension of autoencoders (figure [Hintonet al., Nature 2006])Pioneer work that started the Deep Learningwave

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 6 / 31

Page 7: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Convolutional NNs

Convolutional layer

MotivationExploit a structure in the data

Images : spatial structureTexts, audio ; temporal structurevideos : spatio-temporal structure

Fully connected layers vs locally connected layers

( [LeCun and Ranzato Tutorial, DL, 2015])T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 7 / 31

Page 8: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Convolutional NNs

Convolution layer

Convolution layer

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 8 / 31

Page 9: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Convolutional NNs

Convolution layer

Convolution layer

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 8 / 31

Page 10: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Convolutional NNs

Convolution layer

Convolution layer

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 8 / 31

Page 11: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Convolutional NNs

Convolution layer

Convolution layer

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 8 / 31

Page 12: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Convolutional NNs

Convolution layer

Convolution layer Example of a filter

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 8 / 31

Page 13: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Convolutional NNs

Convolution layer

Use of multiple maps

([LeCun and Ranzato Tutorial, DL, 2015])

Aggregation layersSubsampling layers with aggregationoperatorMax pooling → brings robustness

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 9 / 31

Page 14: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Convolutional NNs

Convolutional models

LeNet architecture [LeCun 1997]Most often a mix of (convolutional + pooling) layers followed by dense layers

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 10 / 31

Page 15: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Outline

1 Main architectures2 Learning

SGDOptimization variantsLearning DNNs

3 Deep architectures

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 11 / 31

Page 16: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

SGD

Learning deep networks

Gradient descent optimization as for MLPsSGDWith momentumAdagrad, Adam, Adadelta etc

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 12 / 31

Page 17: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

SGD

Gradient Descent Optimization

Gradient Descent OptimizationInitialize Weights (Randomly)Iterate (till convergence)

Restimate wt+1 = wt − ε∂C(w)

∂w |wt

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 13 / 31

Page 18: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

SGD

Gradient Descent: Tuning the Learning rate

Two classes Classification problem

Weight trajectory for two different gradientstep settings.

Images from [LeCun et al.]

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 14 / 31

Page 19: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

SGD

Gradient Descent: Tuning the Learning rate

Effect of learning rate settingAssuming the gradient direction is good, there is an optima value fir the learning rateUsing a smaller value slows the convergence and may prevent from convergingUsing a bigger value makes convergence chaotic and may cause divergence

Images from [LeCun et al.]T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 15 / 31

Page 20: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

SGD

Gradient Descent: Stochastic, Batch and mini batchs

Objective : Minimize C(w) =∑

i=1..N Lw (i) with Lw (i) = Lw (x i , y i ,w)

Batch vs Stochastic vs MinibatchsBatch gradient descent

Use ∇C(w)Every iteration all samples are used to compute the gradient direction and amplitude

Stochastic gradientUse ∇Lw (i)Every iteration one sample (randomly chosen) is used to compute the gradientdirection and amplitudeIntroduce randomization in the process.Minimize C(w) by minimizing parts of it sucessivelyAllows faster convergence, avoiding local minima etc

MinibatchUse ∇

∑few j Lw (j)

Every iteration a batch of samples (randomly chosen) is used to compute the gradientdirection and amplitudeIntroduce randomization in the process.Optimize the GPU computation ability

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 16 / 31

Page 21: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Optimization variants

Using Momentum

SGD with Momentum

Standard Stochastic Gradient descent : w = w − ε ∂C(w)∂w

SGD with Momentum:

v = γv + ε∂C(w)∂w

w = w − v

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 17 / 31

Page 22: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Optimization variants

Guiding the learning

RegularizationConstraints on weights (L1 or L2)

Constraints on activities (of neurons in a hidden layer)L1 or L2Mean activity constraint (Sparse autoencoders, [Ng et al.])Sparsity constraint (in a layer and/or in a batch)Winner take all like strategies

Disturb learning for avoiding learning by heart the training setNoisy inputs (e.g. Denoising Autoencoder, link to L2 regularization)Noisy labels

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 18 / 31

Page 23: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Optimization variants

Guiding the learning

Denoising autoencoders and Deep Belief Networks

Examples of learned filters with Denoising Autoencoders (top)

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 19 / 31

Page 24: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Optimization variants

Guiding the learning

Dropout

First method that allowed learning rellay deep networks without pretraining andsmart initializationRelated to ensemble of modelsWeights are normalized at inference time

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 20 / 31

Page 25: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Learning DNNs

Learning deep networks: Strategies

Few strategies (considering large volumes of unlabeled data)Very large labeled training dataset : Fully supervised settingToo few labeled training samples for supervised training : Unsupervised featurelearning (each layer one after the other) + fine tuning with a classifier on topVery few labeled training samples : Unsupervised feature learning (each layer oneafter the other) + flat classifier learning

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 21 / 31

Page 26: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Learning DNNs

Learning deep networks

Unsupervised feature learning layer by layer

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 22 / 31

Page 27: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Learning DNNs

Learning more general architectures

Graph of modules (better withoutcycles...)

Still optimized with Gradient Descent !!

W = W − ε∂C(W )∂W

provided functions implemented byblocks are differentiableand derivatives ∂Out(B)

∂In(B) and ∂Out(B)∂W (B) are

available for every block

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 23 / 31

Page 28: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Outline

1 Main architectures2 Learning3 Deep architectures

Very deep ModelsWhat makes DNN work?

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 24 / 31

Page 29: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Very deep Models

The Times They Are A Changing

(slide from [Kaiming He])

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 25 / 31

Page 30: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

Very deep Models

From shalow to deep

Simply stacking layers does not work (CIFARresults) ! (figures form [He and al., 2015])

(source [He et al. 2016])

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 26 / 31

Page 31: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

What makes DNN work?

Deep vs Shalow ?

Characterizing the complexity of functions a DNN may implement [Pascanu and al., 2014]DNNs with RELU activation function ⇒ piecewise linear functionComplexity of DNN function as the Number of linear regions on the input dataCase of n0 inputs and n = 2n0 hidden cells per HL (k HL) :

Maximum number of regions : 2(k−1)n0∑n0

j=0

(2n0j

)Example: n0 = 2

Shallow model: 4n0 units → 37 regionsDeep model with 2 hidden layers with 2n0 units each → 44 regionsShallow model: 6n0 units → 79 regionsDeep model with 3 hidden layers with 2n0 units each → 176 regions

Exponentially more regions per parameter in terms of number of HLAt least order (k-2) polynomially more regions per parameter in terms of width ofHL n

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 27 / 31

Page 32: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

What makes DNN work?

Deep vs Shalow ?

From [Pascanu and al., 2014]Left: Regions computed by a layer with 8 RELU hidden neurons on the input spaceof two dimensions (i.e. the output of previous layer)Middle: Heat map of a function computed by a rectifier network with 2 inputs, 2hidden layers of width 4, and one linear output unit. Black lines delimit regions oflinearity of the functionRight: Heat map of a function computed by a 4 layer model with a total of 24hidden units. It takes at least 137 hidden units on a shallow model to represent thesame function.

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 28 / 31

Page 33: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

What makes DNN work?

The depth alone is not enough

Making gradient flow for learning deep modelsMain mechanism : Include the identity mapping as a possible path from the input tothe output of a layersResNet building block [He and al., 2015]]

LSTM (deep in time) [Hochreichter and al., 1998]

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 29 / 31

Page 34: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

What makes DNN work?

About generalization, overtraining, local minimas etc

Traditional Machine LearningOverfiting is the enemyOne may control generalization with appropriate regularization

Recent results in DLThe Overfit idea should be revised for DL [Zhand and al., 2017]

Deep NN may learn noise !Regularization may slightly improve performance but is not THE answer for improvinggeneralization

Objective function do not exhibit lots of saddle points and most local minima aregood and close to globale minimas [Choromanska et al., 2015]

Not clear what in the DNN may allow to predict its generalization ability

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 30 / 31

Page 35: Thierry Arti`eres - lis-lab.fr · 2019-03-19 · Mean activity constraint (Sparse autoencoders, [Ng et al.]) Sparsity constraint (in a layer and/or in a batch) Winner take all like

Main architectures Learning Deep architectures

What makes DNN work?

Favorable context

Huge training resources for huge modelsHuge volumes of training dataHuge computing ressources (clusters of GPUs)

Advances in understanding optimizing NNsRegularization (Dropout...)Making gradient flow (ResNets, LSTM, ...)

Faster diffusion than everSoftwares

Tensorflow, Theano, Torch, Keras, Lasagne, ...Results

Publications (arxiv publication model) + codesArchitectures, weights (3 python lines for loading a state of the art computer visionmodel!)

T. Artieres (ECM, LIF-AMU) Deep Learning January 15, 2018 31 / 31