deep learning: a gentle introduction - paris dauphine universityatif/lib/exe/fetch.php?... ·...

Deep Learning: a gentle introduction

Jamal Atifjamal.atif@dauphine.fr

PSL, Université Paris-Dauphine, LAMSADE

February 8, 2016

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 1 / 1

Why a talk about deep learning?

Convolutional Networks (Yann Le Cun): challenge ILSVRC : 1000 categories and 1.461.406images.

Neuron’s basic anatomy

Figure: A neuron’s basic anatomy consists of four parts: a soma (cell body),dendrites, an axon, and nerve terminals. Information is received by dendrites,gets collected in the cell body, and flows down the axon.

Artificial neuron

∑ni=1 wixi + b

xn+1 = 1

Pre-activation ActivationActivation

Dendrites

Cell body

PerceptronRosenblatt 1957

Figure: Mark I Perceptron machine

Perceptron learning

Input: a sample S = {(x(1), y1), · · · , (x(n), yn)}I Initialize a parameter t to 0I Initialize the weights wi with random values.I Repeat

I Pick an example x(k) = [x(k)1 , · · · , x(k)

d ]T from SI Let y(k) be the target value and y(k) the computed value using the current

perceptronI If y(k) 6= y(k) then

I Update the weights: wi(t + 1) = wi(t) + ∆wi(t) where∆wi(t) = (y(k) − y(k))x

I End IfI t = t+ 1

I Until all the examples in S were visited and no change occurs in theweights

Output: a perceptron for linear discrimination of S

Perceptron learningExample: learning the OR function

Initialization: w1(0) = w2(0) = 1, w3(0) = −1t w1(t) w2(t) w3(t) x(k)

∑wix

ki y(k) y(k) ∆w1(t) ∆w2(t) ∆w3(t)

0 1 1 -1 001 -1 0 0 0 0 01 1 1 -1 011 0 0 1 0 1 12 1 2 0 101 1 1 1 0 0 03 1 2 0 111 3 1 1 0 0 04 1 2 0 001 0 0 0 0 0 05 1 2 0 011 2 1 1 0 0 0

Perceptron illustration

1 if a > 00 elsewhere

w1 = 1

w2 = 1

w3 = −11

a =∑

−1 y = 0

wi = wi + (y − y)xi

w1 = 1 + 0 ∗ 0w2 = 1 + 0 ∗ 0w3 = −1 + 0 ∗ −1

x(1) = [0, 0]T t = 0

w1 = 1

w2 = 1

w3 = −11

a =∑

w1 = 1 + 1 ∗ 0 = 1w2 = 1 + 1 ∗ 1 = 2w3 = −1 + 1 ∗ 1 = 0

x(2) = [0, 1]T t = 1

w1 = 1

w2 = 2

w3 = 01

a =∑

w1 = 1 + 0 ∗ 0 = 1w2 = 2 + 0 ∗ 1 = 2w3 = 0 + 0 ∗ 1 = 0

x(3) = [1, 0]T t = 2

w1 = 1

w2 = 2

w3 = 01

a =∑

w1 = 1 + 0 ∗ 0 = 1w2 = 2 + 0 ∗ 1 = 2w3 = 0 + 0 ∗ −1 = 0

x(4) = [1, 1]T t = 3

w1 = 1

w2 = 2

w3 = 01

a =∑

w1 = 1 + 0 ∗ 0 = 1w2 = 2 + 0 ∗ 1 = 2w3 = 0 + 0 ∗ 1 = 0

x(1) = [0, 0]T t = 4

w1 = 1

w2 = 2

w3 = 01

a =∑

w1 = 1 + 0 ∗ 0 = 1w2 = 2 + 0 ∗ 1 = 2w3 = 0 + 0 ∗ 1 = 0

x(2) = [0, 1]T t = 5

Perceptron capacity

OR(x1, x2) AND(x1, x2)

Perceptron autumn

XOR(x1, x2)

Link with Logistic regression

∑di=1 wixi + b

xd+1 = 1

a(x) =

h(x) = 11+e−a(x)

Stochastic gradient update rule:

wj = wj + λ(y(i) − h(x(i)))x(i)j

The same as the perceptron update rule!

XOR(x1, x2)

AND(x1, x2)

Multilayer Perceptron

Paul Werbos, 84. David Rumelhart, 86

Input Layer

Hidden layers

Output layer

1 Layer Perceptron

Hidden Layer

hj(x) = g(b+∑n

i=1wijxi)

Output Layerf(x) = o(b2 +

∑di=1w

21hi(x))

Input Layer

MLP training: backpropagationXOR example

b12b11 b2

∑i w

1i2x2 + b12

∑k w

2khk + b2 ho(x) =

11+e−a(x)

h(x) = 11+e−a(x)

MLP training: backpropagationXOR example: initialisation

w111 = 1

w112=1

w122 = 1

b12=−1

∑i w

1i2x2 + b12

∑k w

2khk + b2 ho(x) =

11+e−a(x)

h(x) = 11+e−a(x)

w21 = 1

w22 = 1

MLP training: backpropagationXOR example: Feed-forward

w111 = 1

w112=1

w122 = 1

b12=−1

∑i w

1i2x2 + b12

∑k w

2khk + b2

ho(x) =1

1+e−ao(x)

h1(x) =1

1+e−a1(x)

h2(x) =1

1+e−a2(x)

w21 = 1

w22 = 1

ao = y = 0.39

−0.46

MLP training: backpropagationXOR example: Backpropagation

w111 = 1

w112=1

w122 = 1

b12=−1

∑i w

1i2x2 + b12

∑k w

2khk + b2

ho(x) =1

1+e−ao(x)

h1(x) =1

1+e−a1(x)

h2(x) =1

1+e−a2(x)

w21 = 1

w22 = 1

ao = y = 0.39

−0.46

w2k = w2

k + ηδk

δk = (y − y)(1 − y)yhk

=∂Ew

∂y∂y∂ao

∂w2k

MLP training: backpropagationXOR example: Backpropagation

w111 = 1

w112=1

w1 21=1

w122 = 1

b12=−.9

−.98

−1.07

∑i w

1i2x2 + b12

∑k w

2khk + b2

ho(x) =1

1+e−ao(x)

h1(x) =1

1+e−a1(x)

h2(x) =1

1+e−a2(x)

w21 = 0.98

w22 = 0.98

ao = y = 0.39

−0.46

w2k = w2

k − ηδk

δk = −(y − y)(1− y)yhk=∂Ew

∂y∂y∂ao

∂w2k

δk = ∂Ew

∂w2k

, Ew = 12 (y − y)2

w21 = w2

1 + 1 ∗ (0 − .39)(1− .39).3 ∗ 0.27

w22 = w2

2 + 1 ∗ (0 − .39)(1− .39).3 ∗ 0.27

b2 = b2 + 1 ∗ (0 − .39)(1− .39).3 ∗ 1

w1ij = w1

ij − ηδij δij =∂Ew

∂wij= ∂Ew

∂wij

δij = −(y − y)(1 − y)yw2jhj(1− hj)xi

w111 = 1 + 1 ∗ (0− .39)(1− .39) ∗ .39 ∗ 1 ∗ .27 ∗ (1 − .27) ∗ 0

w122 = 1 + 1 ∗ (0− .39)(1− .39) ∗ .39 ∗ 1 ∗ .27 ∗ (1 − .27) ∗ 0

b11 = −1 + 1 ∗ (0− .39)(1− .39) ∗ .39 ∗ 1 ∗ .27 ∗ (1− .27) ∗ −1

MLP training: backpropagationXOR example: Feed-forward

w111 = 1

w112=1

w122 = 1

b12=−.9

−.98

−1.07

∑i w

1i2x2 + b12

∑k w

2khk + b2

ho(x) =1

1+e−ao(x)

h1(x) =1

1+e−a1(x)

h2(x) =1

1+e−a2(x)

w21 = 0.98

w22 = 0.98

ao = y ≈ 0.5

−0.09

Multilayer Perceptron as a deep NN

MLP with more than 2/3 layers is a deep network

Input Layer

Hidden layers

Output layer

Training issues

Setting the hyperparametersI InitializationI Number of iterationsI Learning rate ηI Activation functionI Early stopping criterion

Overfitting/UnderfittingI Number of hidden layersI Number of neurons

OptimizationVanishing gradient problem

Overfitting/Underfitting

Source: Bishop, PRMLJamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 30 / 1

Overfitting/Underfitting

Source: Bishop, PRML

What’s new!

Before 2006, training deep architectures was unsuccessful!

UnderfittingNew optimization techniques

OverfittingBengio, Hinton, LeCun

I Unsupervised pre-trainingI Stochastic “dropout" training

Unsupervised pre-training

Main ideaInitialize the network in an unsupervised way.

ConsequencesI Network layers encode successive invariant features (latent structure of

the data)I Better optimization and hence better generalization

Unsupervised pre-training

How to?Proceed greedily, layer by layer.

Input Layer =⇒

Input Layer

Unsupervised NN learning techniquesI Restricted Boltzmann Machines (RBM)I Auto-encodersI and many many variants since 2006

Unsupervised pre-trainingAutoencoder

Credit: Hugo Larochelle

Unsupervised pre-trainingSparse Autoencoders

Credit: Y. Bengio

Fine tuning

How to?I Add the output layerI Initialize its weights with random valuesI initialize the hidden layers with the values from the pre-trainingI Update the layers by Backpropagation

Input Layer

Output Layer

Drop out

IntuitionRegularize the network by dropping out stochastically some hidden units.

ProcedureAssign to each hidden unit a value 0 with probability p (common choice: .5)

Input Layer

Output Layer

Some applications

Computer visionConvolutional Neural Networks Lecun, 89–

State of the art in digit recognition

Modern CNN

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with DeepConvolutional Neural Networks, NIPS 2012

I 7 hidden layers, 650,000 units, 60,000,000 parametersI Drop outI 106 imagesI GPU implementationI Activation function: f(x) = max(0, x) (ReLu)

Computer visionConvolutional Neural Networks

LeCun, Hinton, Bengio, Nature 2015

What is deep in deep learning!Old vs New ML paradigms

Raw data

Hand-craftedfeatures

Raw data

Learnedfeatures

Classifier

DecisionDecision

ClassifierRepresen

Softwares

I TensorFlow (google), https://www.tensorflow.org/I Theano (Python CPU/GPU) mathematical and deep learning library

http ://deeplearning.net/software/theano. Can do automatic, symbolicdifferentiation.

I Senna: NLPI by Collobert et al. http ://ronan.collobert.com/senna/I State-of-the-art performance on many tasksI 3500 lines of C, extremely fast and using very little memory.

I Torch ML Library (C+Lua) : http ://www.torch.ch/I Recurrent Neural Network Language Model http ://www.fit.vutbr.cz/

imikolov/rnnlmI Recursive Neural Net and RAE models for paraphrase detection,

sentiment analysis, relation classiffcation www.socher.org

deep learning: a gentle introduction - paris dauphine universityatif/lib/exe/fetch.php?... ·...

Documents

avril - hoteljeannedarc.com · avril à paris april in...

understanding big data - paris dauphine university

introduction - paris dauphine university

universit´e paris-dauphine, m1 mmd controˆle des...

maria del mar khouri alonso - paris dauphine university ·...

université paris-dauphine/baruch college cooperation...

rojas - paris dauphine university · 1 université...

fact sheet 2017-2018 - université paris-dauphine€¦ ·...

monte carlo search - paris dauphine university

université paris-dauphine - 11th annual hedge …...11th...

master 1 mmd - mouvements brownien et evaluations d'actifs...

edba paris-dauphine thesis alami aroussi -27 fevrier 2009

© goddard & isabelle 20061 groupe de travail jerip 03 mars...

medial representations - paris dauphine university

universitÉ paris dauphine département mido (*) ·...

eli noam columbia university paris dauphine, january 2011...

vincent giard - paris dauphine university

lamsade – université paris ix dauphine

christian robert universit e paris-dauphine and...

204 | université paris dauphine-psl | université paris...