deep learning: a gentle introduction - paris dauphine universityatif/lib/exe/fetch.php?... ·...

Post on 09-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning: a gentle introduction

Jamal Atifjamal.atif@dauphine.fr

PSL, Université Paris-Dauphine, LAMSADE

February 8, 2016

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 1 / 1

Why a talk about deep learning?

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 2 / 1

Why a talk about deep learning?

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 3 / 1

Why a talk about deep learning?

Convolutional Networks (Yann Le Cun): challenge ILSVRC : 1000 categories and 1.461.406images.

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 4 / 1

Neuron’s basic anatomy

Figure: A neuron’s basic anatomy consists of four parts: a soma (cell body),dendrites, an axon, and nerve terminals. Information is received by dendrites,gets collected in the cell body, and flows down the axon.

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 5 / 1

Artificial neuron

∑ni=1 wixi + b

x1

x2

x3

xn

xn+1 = 1

Pre-activation ActivationActivation

w1

w2

w3

wd

b

Dendrites

Cell body

Axon

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 6 / 1

PerceptronRosenblatt 1957

Figure: Mark I Perceptron machine

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 7 / 1

Perceptron learning

Input: a sample S = {(x(1), y1), · · · , (x(n), yn)}I Initialize a parameter t to 0I Initialize the weights wi with random values.I Repeat

I Pick an example x(k) = [x(k)1 , · · · , x(k)

d ]T from SI Let y(k) be the target value and y(k) the computed value using the current

perceptronI If y(k) 6= y(k) then

I Update the weights: wi(t + 1) = wi(t) + ∆wi(t) where∆wi(t) = (y(k) − y(k))x

(k)i

I End IfI t = t+ 1

I Until all the examples in S were visited and no change occurs in theweights

Output: a perceptron for linear discrimination of S

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 8 / 1

Perceptron learningExample: learning the OR function

Initialization: w1(0) = w2(0) = 1, w3(0) = −1t w1(t) w2(t) w3(t) x(k)

∑wix

ki y(k) y(k) ∆w1(t) ∆w2(t) ∆w3(t)

0 1 1 -1 001 -1 0 0 0 0 01 1 1 -1 011 0 0 1 0 1 12 1 2 0 101 1 1 1 0 0 03 1 2 0 111 3 1 1 0 0 04 1 2 0 001 0 0 0 0 0 05 1 2 0 011 2 1 1 0 0 0

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 9 / 1

Perceptron illustration

1 if a > 00 elsewhere

0

0

w1 = 1

w2 = 1

w3 = −11

a =∑

iwixi

−1 y = 0

y = 0

wi = wi + (y − y)xi

w1 = 1 + 0 ∗ 0w2 = 1 + 0 ∗ 0w3 = −1 + 0 ∗ −1

x(1) = [0, 0]T t = 0

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 10 / 1

Perceptron illustration

1 if a > 00 elsewhere

0

1

w1 = 1

w2 = 1

w3 = −11

a =∑

iwixi

y = 0

y = 1

wi = wi + (y − y)xi

w1 = 1 + 1 ∗ 0 = 1w2 = 1 + 1 ∗ 1 = 2w3 = −1 + 1 ∗ 1 = 0

x(2) = [0, 1]T t = 1

0

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 11 / 1

Perceptron illustration

1 if a > 00 elsewhere

1

0

w1 = 1

w2 = 2

w3 = 01

a =∑

iwixi

y = 1

y = 1

wi = wi + (y − y)xi

w1 = 1 + 0 ∗ 0 = 1w2 = 2 + 0 ∗ 1 = 2w3 = 0 + 0 ∗ 1 = 0

x(3) = [1, 0]T t = 2

1

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 12 / 1

Perceptron illustration

1 if a > 00 elsewhere

1

1

w1 = 1

w2 = 2

w3 = 01

a =∑

iwixi

y = 1

y = 1

wi = wi + (y − y)xi

w1 = 1 + 0 ∗ 0 = 1w2 = 2 + 0 ∗ 1 = 2w3 = 0 + 0 ∗ −1 = 0

x(4) = [1, 1]T t = 3

3

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 13 / 1

Perceptron illustration

1 if a > 00 elsewhere

0

0

w1 = 1

w2 = 2

w3 = 01

a =∑

iwixi

y = 0

y = 0

wi = wi + (y − y)xi

w1 = 1 + 0 ∗ 0 = 1w2 = 2 + 0 ∗ 1 = 2w3 = 0 + 0 ∗ 1 = 0

x(1) = [0, 0]T t = 4

0

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 14 / 1

Perceptron illustration

1 if a > 00 elsewhere

0

1

w1 = 1

w2 = 2

w3 = 01

a =∑

iwixi

y = 1

y = 1

wi = wi + (y − y)xi

w1 = 1 + 0 ∗ 0 = 1w2 = 2 + 0 ∗ 1 = 2w3 = 0 + 0 ∗ 1 = 0

x(2) = [0, 1]T t = 5

2

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 15 / 1

Perceptron capacity

0 1

0

1

0 1

0

1

OR(x1, x2) AND(x1, x2)

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 16 / 1

Perceptron autumn

0 1

0

1

XOR(x1, x2)

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 17 / 1

Link with Logistic regression

∑di=1 wixi + b

x1

x2

x3

xd

xd+1 = 1

w1

w2

w3

wd

b

a(x) =

h(x) = 11+e−a(x)

Stochastic gradient update rule:

wj = wj + λ(y(i) − h(x(i)))x(i)j

The same as the perceptron update rule!

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 18 / 1

But!

0 1

0

1

XOR(x1, x2)

AND(x1, x2)

AND(x

1,x

2)

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 19 / 1

Multilayer Perceptron

Paul Werbos, 84. David Rumelhart, 86

x1

x2

x3

xd

1b

Input Layer

o1

om

Hidden layers

Output layer

1

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 20 / 1

1 Layer Perceptron

w11

w13

w1d

w12

w21

w22

w23

w2d

w31

w32

w33

w3d

wn1

wn2

w n3

wnd

x1

x2

x3

xn

1 1b

b2

Hidden Layer

w21

w22

w23

w2d

h1

h2

h3

hd

hj(x) = g(b+∑n

i=1wijxi)

Output Layerf(x) = o(b2 +

∑di=1w

21hi(x))

Input Layer

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 21 / 1

MLP training: backpropagationXOR example

w111

w112

w221

w222

b12b11 b2

1 1

∑i w

1i1xi

+ b11

∑i w

1i2x2 + b12

a1

∑k w

2khk + b2 ho(x) =

11+e−a(x)

a2

h(x) = 11+e−a(x)

h(x) = 11+e−a(x)

h1

h2

w21

w22

x1

x2

ao y

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 22 / 1

MLP training: backpropagationXOR example: initialisation

w111 = 1

w112=1

w1

21=1

w122 = 1

b12=−1

b1 1=

−1

b2=

−1

1 1

∑i w

1i1xi

+ b11

∑i w

1i2x2 + b12

a1

∑k w

2khk + b2 ho(x) =

11+e−a(x)

a2

h(x) = 11+e−a(x)

h(x) = 11+e−a(x)

h1

h2

w21 = 1

w22 = 1

x1

x2

ao y

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 23 / 1

MLP training: backpropagationXOR example: Feed-forward

w111 = 1

w112=1

w1

21=1

w122 = 1

b12=−1

b1 1=

−1

b2=

−1

1 1

∑i w

1i1xi

+ b11

∑i w

1i2x2 + b12

−1

∑k w

2khk + b2

ho(x) =1

1+e−ao(x)

−1

h1(x) =1

1+e−a1(x)

h2(x) =1

1+e−a2(x)

0.27

0.27

w21 = 1

w22 = 1

x1

x2

ao = y = 0.39

0

0

a1 =

a2 =

−0.46

y = 0

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 24 / 1

MLP training: backpropagationXOR example: Backpropagation

w111 = 1

w112=1

w1

21=1

w122 = 1

b12=−1

b1 1=

−1

b2=

−1

1 1

∑i w

1i1xi

+ b11

∑i w

1i2x2 + b12

−1

∑k w

2khk + b2

ho(x) =1

1+e−ao(x)

−1

h1(x) =1

1+e−a1(x)

h2(x) =1

1+e−a2(x)

0.27

0.27

w21 = 1

w22 = 1

x1

x2

ao = y = 0.39

0

0

a1 =

a2 =

−0.46

y = 0

w2k = w2

k + ηδk

δk = (y − y)(1 − y)yhk

=∂Ew

∂y∂y∂ao

∂ao

∂w2k

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 25 / 1

MLP training: backpropagationXOR example: Backpropagation

w111 = 1

w112=1

w1 21=1

w122 = 1

b12=−.9

8

b1 1=

−.98

b2=

−1.07

1 1

∑i w

1i1xi

+ b11

∑i w

1i2x2 + b12

−1

∑k w

2khk + b2

ho(x) =1

1+e−ao(x)

−1

h1(x) =1

1+e−a1(x)

h2(x) =1

1+e−a2(x)

0.27

0.27

w21 = 0.98

w22 = 0.98

x1

x2

ao = y = 0.39

0

0

a1 =

a2 =

−0.46

y = 0

w2k = w2

k − ηδk

δk = −(y − y)(1− y)yhk=∂Ew

∂y∂y∂ao

∂ao

∂w2k

δk = ∂Ew

∂w2k

, Ew = 12 (y − y)2

w21 = w2

1 + 1 ∗ (0 − .39)(1− .39).3 ∗ 0.27

w22 = w2

2 + 1 ∗ (0 − .39)(1− .39).3 ∗ 0.27

b2 = b2 + 1 ∗ (0 − .39)(1− .39).3 ∗ 1

w1ij = w1

ij − ηδij δij =∂Ew

∂wij= ∂Ew

∂ho

∂ho

∂ao

∂ao

∂hj

∂hj

∂aj

∂aj

∂wij

δij = −(y − y)(1 − y)yw2jhj(1− hj)xi

w111 = 1 + 1 ∗ (0− .39)(1− .39) ∗ .39 ∗ 1 ∗ .27 ∗ (1 − .27) ∗ 0

w122 = 1 + 1 ∗ (0− .39)(1− .39) ∗ .39 ∗ 1 ∗ .27 ∗ (1 − .27) ∗ 0

b11 = −1 + 1 ∗ (0− .39)(1− .39) ∗ .39 ∗ 1 ∗ .27 ∗ (1− .27) ∗ −1

...

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 26 / 1

MLP training: backpropagationXOR example: Feed-forward

w111 = 1

w112=1

w1

21=1

w122 = 1

b12=−.9

8

b1 1=

−.98

b2=

−1.07

1 1

∑i w

1i1xi

+ b11

∑i w

1i2x2 + b12

.02

∑k w

2khk + b2

ho(x) =1

1+e−ao(x)

.02

h1(x) =1

1+e−a1(x)

h2(x) =1

1+e−a2(x)

0.5

0.5

w21 = 0.98

w22 = 0.98

x1

x2

ao = y ≈ 0.5

0

1

a1 =

a2 =

−0.09

y = 1

11

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 27 / 1

Multilayer Perceptron as a deep NN

MLP with more than 2/3 layers is a deep network

x1

x2

x3

xd

1b

Input Layer

o1

om

Hidden layers

Output layer

1

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 28 / 1

Training issues

Setting the hyperparametersI InitializationI Number of iterationsI Learning rate ηI Activation functionI Early stopping criterion

Overfitting/UnderfittingI Number of hidden layersI Number of neurons

OptimizationVanishing gradient problem

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 29 / 1

Overfitting/Underfitting

Source: Bishop, PRMLJamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 30 / 1

Overfitting/Underfitting

Source: Bishop, PRML

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 31 / 1

What’s new!

Before 2006, training deep architectures was unsuccessful!

UnderfittingNew optimization techniques

OverfittingBengio, Hinton, LeCun

I Unsupervised pre-trainingI Stochastic “dropout" training

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 32 / 1

Unsupervised pre-training

Main ideaInitialize the network in an unsupervised way.

ConsequencesI Network layers encode successive invariant features (latent structure of

the data)I Better optimization and hence better generalization

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 33 / 1

Unsupervised pre-training

How to?Proceed greedily, layer by layer.

x1

x2

x3

xd

Input Layer =⇒

x1

x2

x3

xd

Input Layer =⇒

x1

x2

x3

xd

Input Layer

Unsupervised NN learning techniquesI Restricted Boltzmann Machines (RBM)I Auto-encodersI and many many variants since 2006

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 34 / 1

Unsupervised pre-trainingAutoencoder

Credit: Hugo Larochelle

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 35 / 1

Unsupervised pre-trainingSparse Autoencoders

Credit: Y. Bengio

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 36 / 1

Fine tuning

How to?I Add the output layerI Initialize its weights with random valuesI initialize the hidden layers with the values from the pre-trainingI Update the layers by Backpropagation

x1

x2

x3

xd

Input Layer

Output Layer

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 37 / 1

Drop out

IntuitionRegularize the network by dropping out stochastically some hidden units.

ProcedureAssign to each hidden unit a value 0 with probability p (common choice: .5)

x1

x2

x3

xd

Input Layer

Output Layer

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 38 / 1

Some applications

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 39 / 1

Computer visionConvolutional Neural Networks Lecun, 89–

State of the art in digit recognition

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 40 / 1

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 41 / 1

Modern CNN

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with DeepConvolutional Neural Networks, NIPS 2012

I 7 hidden layers, 650,000 units, 60,000,000 parametersI Drop outI 106 imagesI GPU implementationI Activation function: f(x) = max(0, x) (ReLu)

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 42 / 1

Computer visionConvolutional Neural Networks

LeCun, Hinton, Bengio, Nature 2015

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 43 / 1

What is deep in deep learning!Old vs New ML paradigms

Raw data

Hand-craftedfeatures

Raw data

Learnedfeatures

Classifier

DecisionDecision

ClassifierRepresen

tatio

nlea

rning

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 44 / 1

Softwares

I TensorFlow (google), https://www.tensorflow.org/I Theano (Python CPU/GPU) mathematical and deep learning library

http ://deeplearning.net/software/theano. Can do automatic, symbolicdifferentiation.

I Senna: NLPI by Collobert et al. http ://ronan.collobert.com/senna/I State-of-the-art performance on many tasksI 3500 lines of C, extremely fast and using very little memory.

I Torch ML Library (C+Lua) : http ://www.torch.ch/I Recurrent Neural Network Language Model http ://www.fit.vutbr.cz/

imikolov/rnnlmI Recursive Neural Net and RAE models for paraphrase detection,

sentiment analysis, relation classiffcation www.socher.org

Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 2016 45 / 1

top related