valentin leveau and titouan lorieul cocomeet 15/12/17 deep learning.pdf · valentin leveau -...

“Demystifying” Deep LearningValentin Leveau and Titouan Lorieul

Cocomeet 15/12/17

Summary1. Introduction2. From shallow neural networks to (very) deep ones3. Why deep CNN get popularized so late?4. Theoretical focus: learning as an optimization problem5. Practical limitations of neural networks

We now are good at mimicking some part of intelligence : Learning

(Machine) Learning = Learning from examples to do a given task and

generalize to new examples.

Predict some variables given others.

Capture statistical relationships / structure

between observed variables.

Introduction: DL vs ML

AIMachine Learning

Deep Learning

Machine Learning : Supervised Learning It is all about learning a prediction function parametrized by some

parameters

A simple function could be a linear model:

Goal is to learn the parameters so that we predict good values of yi given xi

An image can be modeled by its pixel representation:

Very very high dimensional feature space

Not relevant features

Classification in high dimensional spaces

See is why a machine sees !!!

Find Non Linear Invariant in the data (S. Mallat)

Find a mapping such that the problem becomes linearly separable

We want to produce an abstraction of the image containing highly

discriminative features→ Representation Learning

Find Non Linear Invariant in the data (S. Mallat)

Produce handcrafted intermediate representation of images Image abstraction that contain relevant information

Problem: Decide manually which kind of features are good for the different tasks

Solution: Let the system learn the change of variable

→ Toward Deep Representation Learning

Mid-LevelRepresentation

(BoW, Fisher Vectors,...)

Low-LevelRepresentation(SIFT, GIST,...)

Classification

Label

Query Image Intermediate Representation pipeline

Several strategies to go non linear

11/09/2016Valentin Leveau - Kernelizing spatially consistent visual matches for FGC - 9

• Learning several levels of abstraction of the input signal (compositionality).• Progressive Embedding from input pixels to the label space.• Jointly learning all the trainable modules.

Low-level feature

extractionClassification

Image

Intermediate Representation pipeline Local

pooling

(sum, max, avg, etc.)

High-level feature

extraction

Local pooling

(sum, max, avg, etc.)

Deep (Convolutional) Neural Network

From Shallow Neural Networks to (very) Deep Ones

Perceptron : Simple Elementary Neural Unit

Multi Layer Perceptron 2 layers architecture :

Layer 1: several units in parallel + non-linearities

Layer 2 : final linear classifier unit

W1 W2 W3 W4 W5

input data Deep non linear embedding Classification layer

Multi Layer Perceptron• The nonlinearities are crucial !!!

• Linear combination of linear combinations = Linear combination ”depth is useless)

• Nonlinearities allow to bend the space to get samples linearly separable ”click the curved space)

http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

Optimization strategy : Supervised Learning Define an objective function to minimize with respect to the parameters

Example: least square minimization over training samples

Optimization strategy : Gradient Descent (GD) Update parameters of an objective function in

the direction of the gradient

Several GD strategies: a) All the training samples : Batch GD (exact gradient of J) b) Only one training sample : Stochastic GD c) A subsample of the training set: Mini-batch GD (reasonable approximation of a)

E

Gradient descent through deep feedforward models Gradient can be hard to compute A lot of parameters with interdependencies from layer to layer Solution: Deep model is a composition of function

Backpropagation of the gradient

We move co linearly in the direction of input data

Convolutional Neural Network

W1 W2 W3 W4 W5

Such deep non linear mapping still suffer from lack of efficiency to learn invariance We still try to map pixel-wise vectorial representation to the target space. We need stronger assumptions about the model to learn useful invariants for vision.

Solution: Replace linear applications by convolution kernels with learnable params !

A convolutionReplace linear applications by a convolution

http://cs231n.github.io/assets/conv-demo/index.html

Applying a convolution is equivalent to sharing local receptive fields parameters !→ Huge numbers of matrix multiplications → But the number of free parameters is low !

Translation Invariance with Conv Layers

image patch at position (x,y)

Weight sharing:

Activation at position ”x,y) for the k-th filter

Pooling layers Progressively reduce spatial resolution (e.g. max pooling):

Backprop gradient only spatial locations that chosen as max

Why ConvNet got famous so late ?

• Not many theoretical justifications ”compared to CV + Kernel Methods)

• Too much parameters for little datasets and too slow algorithms and hardwares

• High convergence problems ”gradient vanishing, bad normalizations, …)

• No good ways to initialize deep models

→ Motivations in Unsupervised Learning ”2000-2011)

Learn a model to: Project data into a non linear intermediate embedding ”Coding)

Reconstruct its own input from the codes ”Decoding)

PCA’s eigen directions span the same space than linear autoencoder’s

Coding function:

Decoding function:

Reconstruction objective:

Unsupervised Learning example: Autoencoder

min ||X - X_hat||²

h

X X_hat

Unsupervised Learning example: Autoencoder It is more efficient in practice to learn the layers separately.

→ Actually with ReLU, that’s okay.

Stacked autoencoders: Train the first autoencoder layer to reconstruct the input

Use the intermediate representation as input to the next layer and re-apply the process

Fine-tune the model to jointly learn the layers

Why ConvNet got famous so late ?• Large scale labelled dataset : ImageNet ”15 M)

• Hardware acceleration : GPU

• The ReLU nonlinearity : • The killer detail : as good (if not better) as unsupervised pretraining !

What Changed ?

What Changed ?• Batch Normalization ”x14 faster !!!)• Dropout

Evolution of the architectures • LeNet1-5

[Y. LeCun et al. 1989]

• AlexNet[A. Krizhevsky et al. 2012]

Evolution of the architectures • VGG Net

[K. Simonyan et al. 2014]

Evolution of the architectures • GoogLeNet

[C. Szegedy et al. 2015]

Evolution of the architectures • Inception modules

Evolution of the architectures • ResNet ”152 layers, even 1,000 ...)

[K. He et al. 2015]

Evolution of the architectures • Densenet:

[Liuzhuang et al. 2016]

Setting hyper-parameters of DL is a Nightmare !

• A lot more of good practices to make it work:

• Data-augmentation ”random flip, crop, rotation, etc …)

• Momentum and adaptive learning rate methods

• Learning rate tuning and policy

• Learn an ensemble of models (and aggregate somehow their predictions)

Setting hyper-parameters of DL is a Nightmare !

• Learning rate decay

Transfer learning (fine-tuning)Problem: CNNs require huge training data to learn the millions of parameters Solution: Learn domain specific features by transfer learning

1. Train CNN on a generalist image dataset with millions of images


1. Train CNN on a generalist image dataset with millions of images2. Keep the weights of the lowest layers but remove/reset the top layers


1. Train CNN on a generalist image dataset with millions of images2. Keep the weights of the lowest layers but remove/reset the top layers3. Feed forward and back-propagate new domain specific images (with usually

a different number of classes C)

New output layer

Layer 7

The power of transfer learningTransfer learning usually works for any domain

Table 1 - accuracy measured on several fine-grained image classification datasets

Even very specific ones:

Rice seeds varieties recognition 100 classes, 1 500 texture images

GoogLeNet trained from scratch 58.1%

GoogLeNet pre-trained on ImageNet 70.4%

Herbaria species recognition 255 classes, 11K herbaria sheets

Trademark Logos Car models Paris Buildings Aircraft models Bird species Flower species

GoogLeNet trained from scratch 67.7% 59.3% 55.3% 72.7% 24.4% 59.5%

GoogLeNet pre-trained on ImageNet 87.5% 79.9% 71.3% 88.1% 72.4% 89.5%

GoogLeNet trained from scratch 8.8%

GoogLeNet pre-trained on ImageNet 52.4%

Frameworks

From academia…

...to industry

etc.

ToolkitHardware

GPUs…

Mobile…

FPGA… Project Brainwave (Microsoft)

And others… TPUs (Google), etc.

Applications

Plant species recognition: Pl@ntNet

Localization / Segmentation (Deep Mask)

Style Transfer to real and sketch images

Image Generation with GAN[S. Chopra et al. 2015]

Generative Adversarial Network (GAN)[I. GoodFellow et al. 2014]

Logic with Deep Learning

NLP with Memory Networks

Image Captioning via attention based models

“Reverse Image Captioning” :

Ressources Cours de Yann LeCun au Collège de France

Intervention de Stéphane Mallat au Collège de France

The Deep Learning Book

Pattern Recognition and Machine Learning

https://www.college-de-france.fr/site/yann-lecun/course-2015-2016.htm

https://www.college-de-france.fr/site/yann-lecun/seminar-2016-02-19-15h30.htm

https://github.com/HFTrader/DeepLearningBook

http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf

Questions

Theoretical Focus:Learning as an Optimization Problem

Statistical Learning Theory

Data generating process: supervised learning

X × Y probability space Input space X , e.g. Rd, images, ... Output space Y, e.g. R, 0, 1, 0, 1, ...,K − 1, ...

Data drawn from p(x, y) is fixed but unknown probability models uncertainty p(x, y) = p(y|x)p(x)

We have a sample set Tn = (xi, yi), i = 1, 2, ..., n

Objective

Find the function f that best predict y given x.

y = f(x)

Objective

Find the function f that best predict y given x.

y = f(x)

Formalism

Reduce search to a parametrized function class f(.; θ) e.g. linear regression f(x;w, b) = wTx+ b, Θ = R

d × R

Loss function: l(y, y) = l(y, f(x; θ)) e.g. squared loss l(y, y) = (y − y)2

Risk function: J∗(θ) = Ex,y[l(y, f(x; θ))]

Optimization problem 1

minθ∈Θ

J∗(θ)

Statistical Learning Theory


minθ∈Θ

Ex,y[l(y, f(x; θ))]

p(x, y) is unknown, we cannot compute the expectation.

Use empirical risk instead J(θ) = 1

n

∑i l(yi, f(xi; θ))


minθ∈Θ

J(θ)

Optimization algorithms

Usually, we have convex objective functions and Θ ⊂ Rp. We

know how to deal with these optimization problems:

if derivable, first-order gradient approaches: GradientDescent, Stochastic Gradient Descent, ...

if two time derivable, second-order approaches:Newton-Raphson, Quasi-Newton, ...

...

Optimization algorithms

Usually, we have convex objective functions and Θ ⊂ Rp. We

know how to deal with these optimization problems:

if derivable, first-order gradient approaches: GradientDescent, Stochastic Gradient Descent, ...

if two time derivable, second-order approaches:Newton-Raphson, Quasi-Newton, ...

...

However, when measured on a test set, the error is high. Thefitted model does not generalize on new input...

Why?

Underfitting VS Overfitting

Regularization term


minθ∈Θ

J(θ) + λΩ(θ)

where λ is the regularization parameter (hyperparameter)

Examples:

L1 regularization: Ω(θ) = ‖θ‖1 =∑

k |θk|

L2 regularization: Ω(θ) = 1

2‖θ‖2

2= 1

2

∑k θ

2

k

Backed by theory: VC theory, PAC learning

ǫ2 ≤log |H|+ log 1

δ

2n

What about Neural Networks?

Optimization problem

definitely not convex

needs a lot of engineering to make it converge tosomething: good initialization, ReLU activations, lots oftricks to avoid gradient from exploding, etc.

...but somehow Stochastic Gradient Descent works well

Generalization

no regularization term (but other approaches)

huge capacity (universal approximation theorem): classictheory do not provide a good bound on generalization...

In the end, we do not even optimize what we would like(surrogate loss), we just monitor the validation error in order toknow when we stop...

What about Neural Networks?

Open question: Why does it work?

Optimization

empirically: a lot of minima reached by SGD are equivalentin term of generalization error1

Generalization

new theories rising?2

1Choromanska et al. 2015; Goodfellow and Vinyals 2014.2Shwartz-Ziv and Tishby 2017.

Practical Limitations of Neural Networks

Some current limitations

Interpretability

building predictive models can be useful to performanalysis on a set of variables (e.g. economics, ecology, ...)

... given we can understand what the model has learned

but neural networks are black boxes

Can be fooled

no measure of prediction uncertainty usually outputs only some sort of scores that can not be

interpreted as uncertainty estimates Bayesian Neural Networks (BNN)3

easy to find adversarial examples need for formal verification

3Kendall and Gal 2017.

Some failures...

Because neural networks have allowed us to make considerableprogress in a lot of different tasks, we might considerer themsolved... but there are not.

Some failures...

Adversarial examples4

4Goodfellow, Shlens, et al. 2014; Kurakin et al. 2016.

Verification

For safety-critical systems, we need formal guarantees on thebehavior of the algorithms, e.g.:

check properties of DNN implementation of next-generationairborne collision avoidance system for unmanned aircraft5

verify that there is no adversarial example around a testpoint6

5Katz et al. 2017.6Huang et al. 2017.

Conclusion

Deep Learning has been very successful and has a very largearray of applications.

Proved that the current statistical learning theories areinsufficient.

We need to rethink of generalization in high dimension.

We need to find a new learning theory.

Lack of theoretical understanding limits their usage as blackbox modules in software because they do not provide strongcontracts (hence need for formal verification).

There is a current on-going debate in the community (e.g. seeAli Rahimi’s NIPS 2017 Test-of-Time award talk).

Questions?

valentin leveau and titouan lorieul cocomeet 15/12/17 deep learning.pdf · valentin leveau -...

Documents