introduction to neural networks - github pages · 2020-03-01 · perceptron-1958...

122
Introduction to Neural Networks E. Scornet January 2020 E. Scornet Deep Learning January 2020 1 / 118

Upload: others

Post on 21-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Introduction to Neural Networks

E. Scornet

January 2020

E. Scornet Deep Learning January 2020 1 / 118

Page 2: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 2 / 118

Page 3: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 3 / 118

Page 4: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Supervised Learning

Supervised Learning FrameworkInput measurement X ∈ XOutput measurement Y ∈ Y.(X,Y ) ∼ P with P unknown.Training data : Dn = {(X1,Y1), . . . , (Xn,Yn)} (i.i.d. ∼ P)

OftenI X ∈ Rd and Y ∈ {−1, 1} (classification)I or X ∈ Rd and Y ∈ R (regression).

A predictor is a function in F = {f : X → Y meas.}

GoalConstruct a good predictor f from the training data.

Need to specify the meaning of good.Classification and regression are almost the same problem!

E. Scornet Deep Learning January 2020 4 / 118

Page 5: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Loss and Probabilistic Framework

Loss function for a generic predictorLoss function : `(Y , f (X)) measures the goodness of the prediction of Y by f (X)Examples:

I Prediction loss: `(Y , f (X)) = 1Y 6=f (X)I Quadratic loss: `(Y , f (X)) = |Y − f (X)|2

Risk functionRisk measured as the average loss for a new couple:

R(f ) = E(X,Y )∼P [`(Y , f (X))]Examples:

I Prediction loss: E [`(Y , f (X))] = P {Y 6= f (X)}I Quadratic loss: E [`(Y , f (X))] = E

[|Y − f (X)|2

]Beware: As f depends on Dn, R(f ) is a random variable!

E. Scornet Deep Learning January 2020 5 / 118

Page 6: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Supervised Learning

Experience, Task and Performance measureTraining data : D = {(X1,Y1), . . . , (Xn,Yn)} (i.i.d. ∼ P)Predictor: f : X → Y measurableCost/Loss function : `(Y , f (X)) measure how well f (X) “predicts" YRisk:

R(f ) = E [`(Y , f (X))] = EX[EY |X [`(Y , f (X))]

]Often `(Y , f (X)) = |f (X)− Y |2 or `(Y , f (X)) = 1Y 6=f (X)

GoalLearn a rule to construct a predictor f ∈ F from the training data Dn s.t. the riskR(f ) is small on average or with high probability with respect to Dn.

E. Scornet Deep Learning January 2020 6 / 118

Page 7: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Best Solution

The best solution f ∗ (which is independent of Dn) is

f ∗ = argminf∈F

R(f ) = argminf∈F

E [`(Y , f (X))]

Bayes Predictor (explicit solution)I In binary classification with 0− 1 loss:

f ∗(X) =

{+1 if P {Y = +1|X} ≥ P {Y = −1|X}

⇔ P {Y = +1|X} ≥ 1/2−1 otherwise

I In regression with the quadratic lossf ∗(X) = E [Y |X]

Issue: Solution requires to know E [Y |X] for all values of X!

E. Scornet Deep Learning January 2020 7 / 118

Page 8: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Examples

Spam detection (Text classification)

Data: email collectionInput: emailOutput : Spam or No Spam

E. Scornet Deep Learning January 2020 8 / 118

Page 9: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Examples

Face Detection

Data: Annotated database of imagesInput : Sub window in the imageOutput : Presence or no of a face...

E. Scornet Deep Learning January 2020 9 / 118

Page 10: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Examples

Number Recognition

Data: Annotated database of images (each image is represented by a vector of28× 28 = 784 pixel intensities)Input: ImageOutput: Corresponding number

E. Scornet Deep Learning January 2020 10 / 118

Page 11: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Machine Learning

A definition by Tom Mitchell (http://www.cs.cmu.edu/~tom/)A computer program is said to learn from experience E with respect to some class oftasks T and performance measure P, if its performance at tasks in T, as measured by P,improves with experience E.

E. Scornet Deep Learning January 2020 11 / 118

Page 12: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Unsupervised Learning

Experience, Task and Performance measureTraining data : D = {X1, . . . ,Xn} (i.i.d. ∼ P)Task: ???Performance measure: ???

No obvious task definition!

Tasks for this lectureClustering (or unsupervised classification): construct a grouping of the data inhomogeneous classes.Dimension reduction: construct a map of the data in a low dimensional spacewithout distorting it too much.

E. Scornet Deep Learning January 2020 12 / 118

Page 13: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Marketing

Data: Base of customer data containing their properties and past buying records

Goal: Use the customers similarities to find groups.

Two directions:I Clustering: propose an explicit grouping of the customersI Visualization: propose a representation of the customers so that the groups are visibles

E. Scornet Deep Learning January 2020 13 / 118

Page 14: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Machine Learning

E. Scornet Deep Learning January 2020 14 / 118

Page 15: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 15 / 118

Page 16: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 16 / 118

Page 17: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 17 / 118

Page 18: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

What is a neuron?

This is a real neuron!

Figure: Real Neuron - diagram

Figure: Real Neuron

The idea of neural networks began unsurprisingly as a model of how neurons in the brainfunction, termed ‘connectionism’ and used connected circuits to simulate intelligent be-haviour.

E. Scornet Deep Learning January 2020 18 / 118

Page 19: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

McCulloch and Pitts neuron - 1943

In 1943, portrayed with a simple electrical circuit by neurophysiologist Warren McCullochand mathematician Walter Pitts.

A McCulloch-Pitts neuron takes binary inputs, computes a weighted sum and returns 0 ifthe result is below threshold and 1 otherwise.

Figure: [“A logical calculus of the ideas immanent in nervous activity”, McCulloch and Pitts 1943]

E. Scornet Deep Learning January 2020 19 / 118

Page 20: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

McCulloch and Pitts neuron - 1943

In 1943, portrayed with a simple electrical circuit by neurophysiologist Warren McCullochand mathematician Walter Pitts.

A McCulloch-Pitts neuron takes binary inputs, computes a weighted sum and returns 0 ifthe result is below threshold and 1 otherwise.

Figure: [“A logical calculus of the ideas immanent in nervous activity”, McCulloch and Pitts 1943]

Donald Hebb took the idea further by proposing that neural pathways strengthen overeach successive use, especially between neurons that tend to fire at the same time.[The organization of behavior: a neuropsychological theory, Hebb 1949]

E. Scornet Deep Learning January 2020 19 / 118

Page 21: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Perceptron - 1958

In the late 50s, Frank Rosenblatt, a psychologist at Cornell, worked on on decision systemspresent in the eye of a fly, which determine its flee response.

In 1958, he proposed the idea of a Perceptron, calling it Mark I Perceptron. It was asystem with a simple input-output relationship, modelled on a McCulloch-Pitts neuron.

[“Perceptron simulation experiments”, Rosenblatt 1960]

E. Scornet Deep Learning January 2020 20 / 118

Page 22: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Perceptron diagram

Figure: True representation of perceptron

The connections between the input and the first hidden layer cannot be optimized!

E. Scornet Deep Learning January 2020 21 / 118

Page 23: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Perceptron Machine

First implementation: Mark I Perceptron(1958).

The machine was connected to acamera (20x20 photocells, 400-pixelimage)

Patchboard: allowed experimentationwith different combinations of inputfeatures

Potentiometers: that implement theadaptive weights

E. Scornet Deep Learning January 2020 22 / 118

Page 24: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Parameters of the perceptron

Activation function: σ(z) = 1z>0

Parameters:Weights w = (w1, . . . ,wd) ∈ Rd

Bias: b

The output is given by f(w,b)(x).

How do we estimate (w, b)?

E. Scornet Deep Learning January 2020 23 / 118

Page 25: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Parameters of the perceptron

Activation function: σ(z) = 1z>0

Parameters:Weights w = (w1, . . . ,wd) ∈ Rd

Bias: b

The output is given by f(w,b)(x).

How do we estimate (w, b)?

E. Scornet Deep Learning January 2020 23 / 118

Page 26: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

The Perceptron Algorithm

To ease notations, we put w = (w1, . . . ,wd , b) and xi = (xi , 1)

Perceptron Algorithm - first (iterative) learning algorithmStart with w = 0.Repeat over all samples:

I if yi < w, xi >< 0 modify w into w + yi xi ,I otherwise do not modify w.

ExerciseWhat is the rational behind this procedure?Is this procedure related to a gradient descent method?

Gradient descent:

1 Start with w = w0

2 Update w← w− η∇L(w), where L is the loss to be minimized.3 Stop when w does not vary too much.

E. Scornet Deep Learning January 2020 24 / 118

Page 27: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

Perceptron algorithm can be seen as a stochastic gradient descent.

Perceptron AlgorithmRepeat over all samples:

I if yi < w, xi >< 0 modify w into w + yi xi ,I otherwise do not modify w.

A sample is misclassified if yi < w, xi >< 0. Thus we want to minimize the lossL(w) = −

∑i∈Mw

yi < w, xi >,

whereMw is the set of indices misclassified by the hyperplane w.

Stochastic Gradient Descent:1 Select randomly i ∈Mw

2 Update w← w− η∇Li (w) = w− ηyi xi

Perceptron algorithm: η = 1.

E. Scornet Deep Learning January 2020 25 / 118

Page 28: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Exercise

Figure: From http://image.diku.dk/kstensbo/notes/perceptron.pdf

Let R = maxi ‖xi‖. Let w? be the optimalhyperplane of margin

γ = mini

yi〈w?, xi〉,with

‖w?‖ = 1.

Theorem (Block 1962; Novikoff 1963)Assume that the training set Dn ={(x1, yi ), . . . , (xn, yn)} is linearly separable(γ > 0). Start with w0 = 0. Then the num-ber of updates k of the perceptron algorithmis bounded by

k ≤ 1 + R2

γ2.

Exercise: Prove it!

E. Scornet Deep Learning January 2020 26 / 118

Page 29: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

According to Cauchy-Schwarz inequality,

〈w?, wk+1〉 ≤ ‖wk+1‖2 .

since ‖w?‖2 = 1 with equality if and only if wk+1 ∝ w?. By construction,

〈w?, wk+1〉 = 〈w?, wk〉 + yi〈w?, xi〉

≥ 〈w?, wk〉 + γ

≥ 〈w?, w0〉 + kγ≥ kγ,

since w0 = 0. We have

‖wk+1‖2 = ‖wk‖2 + ‖xi‖2 + 2〈wk , yi xi〉

≤ ‖wk‖2 + R2 + 1

≤ k(1 + R2).

Finally,

kγ ≤√

k(1 + R2)

⇔k ≤1 + R2

γ2.

E. Scornet Deep Learning January 2020 27 / 118

Page 30: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Perceptron - Summary and drawbacks

Perceptron algorithmWe have a data set Dn = {(xi , yi ), i = 1, . . . , n}We use the perceptron algorithm to learn the weight vector w and the bias b.We predict using f(w,b)(x) = 1〈w,x〉+b>0.

Limitations

The decision frontier is linear! Too simple model.The perceptron algorithm does not converge if data are not linearly separable: inthis case, the algorithm must not be used.In practice, we do not know if data are linearly separable... Perceptron should neverbe used!

E. Scornet Deep Learning January 2020 28 / 118

Page 31: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Perceptron - Summary and drawbacks

Perceptron algorithmWe have a data set Dn = {(xi , yi ), i = 1, . . . , n}We use the perceptron algorithm to learn the weight vector w and the bias b.We predict using f(w,b)(x) = 1〈w,x〉+b>0.

LimitationsThe decision frontier is linear! Too simple model.The perceptron algorithm does not converge if data are not linearly separable: inthis case, the algorithm must not be used.In practice, we do not know if data are linearly separable... Perceptron should neverbe used!

E. Scornet Deep Learning January 2020 28 / 118

Page 32: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Perceptron will make machine intelligent... or not!

The Perceptron project led by Rosenblatt was funded by the US Office of Naval Research.

The Navy revealed the embryo of an electronic computer today that it expects will beable to walk, talk, see, write, reproduce itself and be conscious of its existence. Laterperceptrons will be able to recognize people and call out their names and instantlytranslate speech in one language to speech and writing in another language, it was

predicted.

Press conference, 7 July 1958, New York Times.

For an extensive study of the perceptron, [Principles of neurodynamics. perceptrons and the theory of brain

mechanisms, Rosenblatt 1961]

E. Scornet Deep Learning January 2020 29 / 118

Page 33: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

AI winter

In 1969, Minsky and Papert exhibit the fact that it was difficult for perceptron todetect parity (number of activated pixels)detect connectedness (are the pixels connected?)represent simple non linear function like XOR

There is no reason to suppose that any of [the virtue of perceptrons] carry over to themany-layered version. Nevertheless, we consider it to be an important research problemto elucidate (or reject) our intuitive judgement that the extension is sterile. Perhaps somepowerful convergence theorem will be discovered, or some profound reason for the failureto produce an interesting "learning theorem" for the multilayered machine will be found.

[“Perceptrons.”, Minsky and Papert 1969]

This book is the starting point of the period known as “AI winter”, a significant declinein funding of neural network research.

Controversy between Rosenblatt, Minsky, Papert:[“A sociological study of the official history of the perceptrons controversy”, Olazaran 1996]

E. Scornet Deep Learning January 2020 30 / 118

Page 34: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

AI Winter: the XOR function

Exercise. Consider two binary variables x1 ∈ {0, 1} and x2 ∈ {0, 1}. The logical functionAND applied to x1, x2 is defined as

AND :{0, 1}2 → {0, 1}

(x1, x2) 7→{

1 if x1 = x2 = 10 otherwise

1) Find a perceptron (i.e., weights and bias) that implements the AND function.

The logical function XOR applied to x1, x2 is defined asXOR :{0, 1}2 → {0, 1}

(x1, x2) 7→{

0 if x1 = x2 = 0 or x1 = x2 = 11 otherwise

2) Prove that no perceptron can implement the XOR function.3) Find a neural network with one hidden layer that implements the XOR function.

E. Scornet Deep Learning January 2020 31 / 118

Page 35: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

1 The following perceptron with w1 = w2 = 1 and b = −1.5 implements the ANDfunction:

x1

x2

b

w1

w2

b

σ(w1x1 + w2x2 + b)

Output layerInput layer

2 The perceptron algorithm builds a hyperplane and predicts accordingly, dependingon which side of the hyperplane the observation falls into. Unfortunately the XORfunction cannot be represented with a hyperplane in the original space {0, 1}2.

E. Scornet Deep Learning January 2020 32 / 118

Page 36: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

3 The following network implements the XOR function withw (1)1,1 = −1

w (1)1,2 = −1

b(1)1 = 0.5

w (1)2,1 = 1

w (1)2,2 = 1

b(1)2 = −1.5

w (2)1,1 = −1

w (2)1,2 = −1

b(2)1 = 0.5

E. Scornet Deep Learning January 2020 33 / 118

Page 37: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

x1

x2

b(1)1

b(1)2

b(2)1

"0"

"2"

w (1)1,1

w (1)2,1

w (1)1,2

w (1)2,2

w (2)1,1

w (2)1,2

L1Input layer Output layer

E. Scornet Deep Learning January 2020 34 / 118

Page 38: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Moving forward - ADALINE, MADALINE

In 1959 at Stanford, Bernard Widrow and Marcian Hoff developed AdaLinE (ADAptiveLINear Elements) and MAdaLinE (Multiple AdaLinE) the latter being the first networksuccessfully applied to a real world problem.[Adaptive switching circuits, Bernard Widrow and Hoff 1960]

Loss: square difference between a weighted sum of inputs and the outputOptimization procedure: trivial gradient descent

E. Scornet Deep Learning January 2020 35 / 118

Page 39: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

MADALINE

Many Adalines: network with one hidden layer composed of many Adaline units.[“Madaline Rule II: a training algorithm for neural networks”, Winter and Widrow 1988]

Applications:Speech and pattern recognition[“Real-Time Adaptive Speech-Recognition System”,

Talbert et al. 1963]

Weather forecasting[“Application of the adaline system to weather

forecasting”, Hu 1964]

Adaptive filtering and adaptive signalprocessing[“Adaptive signal processing”, Bernard and Samuel 1985]

E. Scornet Deep Learning January 2020 36 / 118

Page 40: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 37 / 118

Page 41: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Neural network with one hidden layer

Generic notations:W (`)

i,j : weights between the j neuron in the `− 1 layer and the i neuron of the ` layer.

b(`)j : bias of the j neuron of the ` layer.

a(`)j : output of the j neuron of the ` layer.

z (`)j : input of the j neuron of the ` layer, such that a(`)

j = σ(z (`)j ).

E. Scornet Deep Learning January 2020 38 / 118

Page 42: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

How to find weights and bias?

Perceptron algorithm does not work anymore!

E. Scornet Deep Learning January 2020 39 / 118

Page 43: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Gradient Descent Algorithm

The prediction of the network is given by fθ(x).

Empirical risk minimization:

argminθ

1n

n∑i=1

`(Yi , fθ(Xi )) ≡ argminθ

1n

n∑i=1

`i (θ)

(Stochastic) Gradient descent rule:While |θt − θt−1| ≥ ε do

I Sample It ⊂ {1, . . . , n}I

θt+1 = θt − η( 1|It |

∑i∈It

∇θ`i (θt))

How to compute ∇θ`i efficiently?

E. Scornet Deep Learning January 2020 40 / 118

Page 44: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Gradient Descent Algorithm

The prediction of the network is given by fθ(x).

Empirical risk minimization:

argminθ

1n

n∑i=1

`(Yi , fθ(Xi )) ≡ argminθ

1n

n∑i=1

`i (θ)

(Stochastic) Gradient descent rule:While |θt − θt−1| ≥ ε do

I Sample It ⊂ {1, . . . , n}I

θt+1 = θt − η( 1|It |

∑i∈It

∇θ`i (θt))

How to compute ∇θ`i efficiently?

E. Scornet Deep Learning January 2020 40 / 118

Page 45: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Backprop Algorithm

A Clever Gradient Descent ImplementationPopularized by Rumelhart, McClelland, Hinton in 1986.Can be traced back to Werbos in 1974.Nothing but the use of chain rule derivation with a touch of dynamic programing.

Key ingredient to make the Neural Networks work!Still at the core of Deep Learning algorithm.

E. Scornet Deep Learning January 2020 41 / 118

Page 46: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Backpropagation equations

Neural network with L layers, with vector output, with quadratic costC = 1

2‖y − a(L)‖2.By definition,

δ(`)j = ∂C

∂z (`)j

.

The four fundamental equations of backpropagation are given byδ(L) = ∇aC � σ′(z (L)),

δ(`) = ((w (`+1))T δ(`+1))� σ′(z (`))∂C∂b(`)

j

= δ(`)j

∂C∂w (`)

j,k

= a(`−1)k δ

(`)j .

E. Scornet Deep Learning January 2020 42 / 118

Page 47: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Proof

We start with the first equality

δ(L) = ∇aC � σ′(z(L)).

Applying the chain rule gives

δ(L)j =

∑k

∂C

∂a(L)k

∂a(L)k

∂z(L)j

,

where z(L)j is the input of the j neuron of the layer L. Since the activation a(L)

k depends only on the input z(L)j ,

we have

δ(L)j =

∂C

∂a(L)j

∂a(L)j

∂z(L)j

.

Besides, since a(L)j = σ(z(L)

j ), we have

δ(L)j =

∂C

∂a(L)j

σ′(z(L)

j ),

which is the component wise version of the first equality.

E. Scornet Deep Learning January 2020 43 / 118

Page 48: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Proof

Now, we want to prove

δ(`) = ((w (`+1))T

δ(`+1))� σ′(z(`)).

Again, using the chain rule,

δ(`)j =

∂C

∂z(`)j

=∑

k

∂C

∂z(`+1)k

∂z(`+1)k

∂z(`)j

=∑

k

δ(`+1)k

∂z(`+1)k

∂z(`)j

.

Recalling that z(`+1)k =

∑jw (`+1)

kj σ(z(`)j ) + b`+1

k , we get

δ(`)j =

∑k

δ(`+1)k w (`+1)

kj σ′(z(`)

j ),

which concludes the proof.

E. Scornet Deep Learning January 2020 44 / 118

Page 49: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Proof

Now, we want to prove∂C

∂b(`)j

= δ(`)j .

Using the chain rule,

∂C∂b`j

=∑

k

∂C

∂z(`)k

∂z(`)k

∂b`j.

However, only z(`)j depends on b`j and z(`)

j =∑

kw (`)

jk σ(z(`−1)k ) + b`j . Therefore,

∂C∂b`j

= δ(`)j .

E. Scornet Deep Learning January 2020 45 / 118

Page 50: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Proof

Finally, we want to prove

∂C

∂w (`)j,k

= a(`−1)k δ

(`)j .

By the chain rule,

∂C

∂w (`)j,k

=∑

m

∂C

∂z(`)m

∂z(`)m

∂w (`)j,k

.

Since z(`)m =

∑k

w (`)mk σ(z(`−1)

k ) + b`m, we have

∂z(`)m

∂w (`)j,k

= σ(z(`−1)k )1m=j .

Consequently,∂C

∂w (`)j,k

= δ(`)j σ(z(`−1)

k ) = a(`−1)k δ

(`)j .

E. Scornet Deep Learning January 2020 46 / 118

Page 51: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Backpropagation Algorithm

Letδ

(`)j = ∂C

∂z (`)j

,

where z (`)j is the entry of the neuron j of the layer `.

Backpropagation AlgorithmInitialize randomly weights and bias in the network.

For each training sample xi ,1 Feedforward: let xi go through the network and store the value of activation function

and its derivative, for each neuron.2 Output error: compute the neural network error for xi .3 Backpropagation: compute recursively the vectors δ(`) starting from ` = L to ` = 1.

Update the weights and bias using the backpropagation equations.

E. Scornet Deep Learning January 2020 47 / 118

Page 52: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Neural Network terminology

Epoch: one forward pass and one backward pass of all training examples.

(Mini) Batch size: number of training examples in one forward/backward pass. Thehigher the batch size is, the more memory space you’ll need.

Number of iterations: number of passes, each pass using [batch size] number ofexamples. To be clear, one pass = one forward pass + one backward pass (we do notcount the forward pass and backward pass as two different passes).

For example, for 1000 training examples, if you set the batch size at 500, it will take 2iterations to complete 1 epoch.

E. Scornet Deep Learning January 2020 48 / 118

Page 53: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 49 / 118

Page 54: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 50 / 118

Page 55: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Sigmoid

5 4 3 2 1 0 1 2 3 4 50.0

0.5

1.0

Sigmoid function

x 7→ exp(x)1 + exp(x)

Problems:

1 Sigmoid is not a zero centeredfunction -> need for rescaling data.

2 Saturated function: gradient killer ->need for rescaling data

3 Plus: exp is a bit computationalexpensive

E. Scornet Deep Learning January 2020 51 / 118

Page 56: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Tanh

5 4 3 2 1 0 1 2 3 4 51

0

1

Hyperbolic tangente function

x 7→ exp(x)− exp(−x)exp(x) + exp(−x)

Problems:

1 Tanh is a zero centered function ->no need for rescaling data

2 Saturated function: gradient killer ->need for rescaling data

3 Plus: exp is a bit computationalexpensive

Note: tanh(x) = 2σ(2x)− 1.

E. Scornet Deep Learning January 2020 52 / 118

Page 57: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Rectified Linear Unit

5 4 3 2 1 0 1 2 3 4 55

4

3

2

1

0

1

2

3

4

5

ReLU / positive part

x 7→ max(0, x)

Problems:

1 Not a saturated function.2 Computationally efficient3 Empirically, convergence is faster than

sigmoid/tanh.4 Plus: biologically plausible

E. Scornet Deep Learning January 2020 53 / 118

Page 58: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

A little bit more on ReLU

Introduced by [“Imagenet classification with deep convolutional neural networks”, Krizhevsky et al. 2012] in AlexNet

Problems:Not zero centeredGradient is null if x < 0If weights are not properly initialized, ReLU output can be zero.

Usually initial bias for ReLU so that they fire up: useful or not? Mystery...

Related to biology [“Deep sparse rectifier neural networks”, Glorot, Bordes, et al. 2011]:Most of the time, neurons are inactive.when they activate, their activation is proportional to their input.

E. Scornet Deep Learning January 2020 54 / 118

Page 59: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Leaky ReLU/ Parametric ReLU / Absolute value rectification

5 4 3 2 1 0 1 2 3 4 55

4

3

2

1

0

1

2

3

4

5x 7→ max(αx , x)

Leaky ReLU: α = 0.1[“Rectifier nonlinearities improve neural network acoustic

models”, Maas et al. 2013]

Absolute Value Rectification:α = −1

[“What is the best multi-stage architecture for object

recognition?”, Jarrett et al. 2009]

Parametric ReLU: α optimized duringbackpropagation. Activation function islearned.[“Empirical evaluation of rectified activations in

convolutional network”, Xu et al. 2015]

E. Scornet Deep Learning January 2020 55 / 118

Page 60: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

ELU

5 4 3 2 1 0 1 2 3 4 55

4

3

2

1

0

1

2

3

4

5

Exponential Linear Unit

x 7→{

x if x ≥ 0α(exp(x)− 1) otherwise

Negative saturation regime, closer tozero mean output.

α is set to 1.0.

Robustness to noise[“Fast and accurate deep network learning by exponential

linear units (elus)”, Clevert et al. 2015]

E. Scornet Deep Learning January 2020 56 / 118

Page 61: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Maxout

5 4 3 2 1 0 1 2 3 4 55

4

3

2

1

0

1

2

3

4

5

ReLU / positive part

x 7→ max(w1x + b1,w2x + b2)

Linear regime, not saturating, notdying.

Number of parameters multiplied by 2(resp. by k if k entries)

Learn piecewise linear functions (up tok pieces).[“Maxout networks”, Goodfellow, Warde-Farley, et al.

2013] [“Deep maxout neural networks for speech

recognition”, Cai et al. 2013]

Resist to catastrophic forgetting[“An empirical investigation of catastrophic forgetting in

gradient-based neural networks”, Goodfellow, Mirza,

et al. 2013]

E. Scornet Deep Learning January 2020 57 / 118

Page 62: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Conclusion on activation functions

Use ReLU.Test Leaky ReLU, maxout, ELU.Try out Tanh, but not expect too much.Do not use sigmoid.

E. Scornet Deep Learning January 2020 58 / 118

Page 63: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 59 / 118

Page 64: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Output units

Linear output unit:y = W Th + b

→ Linear regression based on the new variables h.

Sigmoid output unit, used to predict {0, 1} outputs:P(Y = 1|h) = σ(W Th + b),

where σ(t) = et/(1 + et)..

→ Logistic regression based on the new variables h.

Softmax output unit, used to predict {1, . . . ,K}:

softmax(z)i = ezi∑Kk=1 ezk

where, each zi is the activation of one neuron of the previous layer, given byzi = W T

i h + bi .

→ Multinomial logistic regression based on the new variables h.

E. Scornet Deep Learning January 2020 60 / 118

Page 65: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Multinomial logistic regression

Generalization of logistic regression for multiclass outputs: for all 1 ≤ k ≤ K ,

log(P[Yi = k]

Z

)= βkXi ,

Hence, for all 1 ≤ k ≤ K ,

P[Yi = k] = Zeβk Xi ,

whereZ = 1∑K

k=1 eβk Xi.

Thus,

P[Yi = k] = eβk Xi∑K`=1 eβ`Xi

.

E. Scornet Deep Learning January 2020 61 / 118

Page 66: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Biology bonus

Softmax, used with cross-entropy:

− log(P(Y = k|z)) = − log softmax(z)k

= zk − log(∑

j

exp(zj))

' zk −maxj

zj

' 0,No contribution to the cost when softmax(z)y is maximal.

Lateral inhibition: believed to exist between nearby neurons in the cortex. When thedifference between the max and the other is large, winner takes all: one neuron is set to 1and the others go to zero.

More complex models: Conditional Gaussian Mixture: y is multimodal[“On supervised learning from sequential data with applications for speech recognition”; “Generating sequences with recurrent

neural networks”, Schuster 1999; Graves 2013].

E. Scornet Deep Learning January 2020 62 / 118

Page 67: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 63 / 118

Page 68: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Cost functions

Mean Square Error (MSE)

1n

n∑i=1

`(Yi , fθ(Xi )) = 1n

n∑i=1

(Yi − fθ(Xi ))2

Mean Absolute Error

1n

n∑i=1

`(Yi , fθ(Xi )) = 1n

n∑i=1

|Yi − fθ(Xi )|

0− 1 Error

1n

n∑i=1

`(Yi , fθ(Xi )) = 1n

n∑i=1

1Yi 6=fθ(Xi )

E. Scornet Deep Learning January 2020 64 / 118

Page 69: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Cost functions

Cross entropy (or negative log-likelihood):

1n

n∑i=1

`(Yi , fθ(Xi )) = −1n

n∑i=1

K∑k=1

1yi =k log([fθ(Xi )]k

)Cross-entropy:

Very popular!Should help to prevent saturation phenomenon compared to MSE:

− log(P(Y = yi |X)) = − log(σ((2y − 1)(W Th + b))),with

σ(t) = et

1 + et

Usually, saturation occurs when (2y − 1)(W Th+ b)� 1. In that case, − log(P(Y =yi |X)) is linear in W and b which makes the gradient easy to compute, and thegradient descent easy to implement.

Mean Square Error should not be used with softmax output units[“Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition”,

Bridle 1990]

E. Scornet Deep Learning January 2020 65 / 118

Page 70: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 66 / 118

Page 71: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Weight initialization

First idea: Set all weights and bias to the same value.

E. Scornet Deep Learning January 2020 67 / 118

Page 72: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Small or big weights?

1 First idea: small random numbers, typically0.01×N (0, 10−4).

I work for small networksI for big networks (∼ 10 layers) output become dirac in 0: there is no activation at all.

2 Second idea: “big random numbers”N (0, 10−4).

→ Saturating phenomenon

In any case, no need to tune the bias: they can be initially set to zero.

E. Scornet Deep Learning January 2020 68 / 118

Page 73: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Other initialization

Idea: the variance of the input should be the same as the variance of the output.

1 Xavier initializationInitialize bias to zero and weights randomly using

U[−

√6√nj + nj+1

,

√6√nj + nj+1

],

where nj is the size of layer j[“Understanding the difficulty of training deep feedforward neural networks”, Glorot and Bengio 2010].→ Sadly, it does not work for ReLU (non activated neurons)

2 He et al. initializationInitialize bias to zero and weights randomly using

N(0,√2

nj

),

where nj is the size of layer j.[“Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”, He et al. 2015].

Bonus: [“All you need is a good init”, Mishkin and Matas 2015]

E. Scornet Deep Learning January 2020 69 / 118

Page 74: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Exercise

1 Consider a neural network with two hidden layers (containing n1 and n2 neuronsrespectively) and equipped with linear hidden units. Find a simple sufficientcondition on the weights so that the variance of the hidden units stay constantaccross layers.

2 Considering the same network, find a simple sufficient condition on the weights sothat the gradient stay constant across layers when applying backpropagationprocedure.

3 Based on previous questions, propose a simple way to initialize weights.

E. Scornet Deep Learning January 2020 70 / 118

Page 75: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

1 Consider the neuron i of the second hidden layer. The output of this neuron is

a(`)i =

n1∑j=1

W (2)ij z(1)

j + b(2)i

=

n1∑j=1

W (2)ij

( n0∑k=1

W (2)jk xk + b(2)

j

)+ b(2)

i

=

n1∑j=1

W (2)ij

( n0+1∑k=1

W (2)jk xk)

+ b(2)i ,

where we set xn0+1 = 1, and W (1)j,n0+1 = b(1)

j to ease notations. At fixed x , since the weights and bias arecentred, we have

V[

a(`)i

]= V[ n1∑

j=1

n0+1∑k=1

W (2)ij W (2)

jk xk + b(2)i

]=

n1∑j=1

n0+1∑k=1

x2kV[

W (2)ij W (2)

jk

]E. Scornet Deep Learning January 2020 71 / 118

Page 76: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

=

n1∑j=1

n0+1∑k=1

x2kσ

22σ

21 ,

where σ21 (resp. σ2

2) is the shared variance of all weights between layers 0 and 1 (resp. layers 1 and 2).Since we want that V

[a(`)

i

]= V[

a(`−1)i

], it is sufficient that

n1σ22 = 1.

E. Scornet Deep Learning January 2020 72 / 118

Page 77: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

2 Consider a neural network where the cost function is given, for one training example, by

C =(

y −1

nd+1

nd+1∑j=1

z(d)j

)2,

which implies

∂C

∂z(d)j

= −2Wj

nd+1

(y −

1nd+1

nd∑i=1

Wi z(d)i

).

Hence,

E[ ∂C

∂z(d)j

]= E[ 2W 2

j

n2d+1

z(d)j

]=

2σ2d+1

n2d+1

E[

z(d)j

].

Besides, according to the chain rule, we have

∂C

∂z(d−1)k

=

nd+1∑j=1

∂C

∂z(d)j

∂z(d)j

∂z(d−1)k

=

nd+1∑j=1

∂C

∂z(d)j

W (d)jk .

E. Scornet Deep Learning January 2020 73 / 118

Page 78: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

More precisely,

E[ ∂C

∂z(d)j

W (d)jk

]= E[ ∂C

∂z(d)j

W (d)jk

]= E[−

2Wj

nd+1yW (d)

jk +2Wj

n2d+1

W (d)jk

∑i

Wi(∑

`

W (d)i` z(d−1)

`+ b(d)

i

)]= E[ 2W 2

j

n2d+1

W (d)jk

(∑`

W (d)j` z(d−1)

`+ b(d)

j

)]= E[ 2W 2

j

n2d+1

(W (d)jk )2z(d−1)

`

]=

2σ2d+1σ

2d

n2d+1

E[

z(d−1)`

].

Using the chain rule, we get

E[ ∂C

∂z(d−1)k

]= E[ ∂C

∂z(d)j

]⇔nd+1

2σ2d+1σ

2d

n2d+1

E[

z(d−1)`

]=

2σ2d+1

n2d+1

E[

z(d)j

]E. Scornet Deep Learning January 2020 74 / 118

Page 79: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

⇔nd+1σ2d =

E[

z(d)j

]E[

z(d−1)`

] .This gives the sufficient condition nd+1σ

2d ∼ 1 assuming that the activations of two consecutive hidden

layers are similar.

E. Scornet Deep Learning January 2020 75 / 118

Page 80: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 76 / 118

Page 81: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Regularizing to avoid overfitting

Avoid overfitting by imposing some constraints over the parameter space.

Reducing variance and increasing bias.

E. Scornet Deep Learning January 2020 77 / 118

Page 82: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Overfitting

Many different manners to avoid overfitting:

Penalization (L1 or L2)Replacing the cost function L by L(θ,X , y) = L(θ,X , y) + pen(θ).

Soft weight sharing - cf CNN lectureReduce the parameter space artificially by imposing explicit constraints.

DropoutRandomly kill some neurons during optimization and predict with the full network.

Batch normalizationRenormalize a layer inside a batch, so that the network does not overfit on thisparticular batch.

Early stoppingStop the gradient descent procedure when the error on the validation set increases.

E. Scornet Deep Learning January 2020 78 / 118

Page 83: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 79 / 118

Page 84: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Constraint the optimization problemminθL(θ,X , y), s.t. pen(θ) ≤ cste.

Using Lagrangian formulation, this problem is equivalent to:minθL(θ,X , y) + λ pen(θ),

whereL(θ,X , y) is the loss function (data-driven term)pen is a function that increases when θ becomes more complex (penalty term)λ ≥ 0 is a constant standing for the strength of the penalty term.

For Neural Networks, pen only penalizes the weights and not the bias: the later beingeasier to estimate than weights.

E. Scornet Deep Learning January 2020 80 / 118

Page 85: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Example of penalization

minθL(θ,X , y) + pen(θ),

Ridgepen(θ) = λ‖θ‖22

[“Ridge regression: Biased estimation for nonorthogonal problems”, Hoerl and Kennard 1970].See also [“Lecture notes on ridge regression”, Wieringen 2015]

Lassopen(θ) = λ‖θ‖1

[“Regression shrinkage and selection via the lasso”, Tibshirani 1996]

Elastic Netpen(θ) = λ‖θ‖22 + µ‖θ‖1

[“Regularization and variable selection via the elastic net”, Zou and Hastie 2005]

E. Scornet Deep Learning January 2020 81 / 118

Page 86: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Simple case: linear regression

Linear regressionThe estimate of linear regression β is given by

β ∈ argminβ∈Rd

n∑i=1

(Yi −d∑

j=1

βjx (j)i )2,

which can be written asβ ∈ argmin

β∈Rd‖Y − Xβ‖22,

where X ∈ Mn,d(R).

Solution:β = (X′X)−1X′Y .

E. Scornet Deep Learning January 2020 82 / 118

Page 87: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Penalized regression

Penalized linear regressionThe estimate of linear regression βλ,q is given by

βλ,q ∈ argminβ∈Rd

‖Y − Xβ‖22 + λ‖β‖qq.

q = 2: Ridge linear regressionq = 1: LASSO

E. Scornet Deep Learning January 2020 83 / 118

Page 88: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Ridge regression, q = 2

Ridge linear regressionThe ridge estimate βλ,2 is given by

βλ,2 ∈ argminβ∈Rd

‖Y − Xβ‖22 + λ‖β‖22.

Solution:βλ,2 = (X′X + λI)−1X′Y .

This estimate has a bias equal to −λ(X′X + λI)−1β, and a varianceσ2(X′X + λI)−1X′X(X′X + λI)−1. Note that

V[βλ,2] ≤ V[β].

In the case of orthonormal design (X′X = I), we have

βλ,2 = β

1 + λ= 1

1 + λX ′j Y .

E. Scornet Deep Learning January 2020 84 / 118

Page 89: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Ridge regression q = 2

In general, the ridge penalization considers

pen(θ) = 12‖β‖

22 = 1

2

d∑j=1

β2j ,

in the optimization problemminθL(β,X , y) + pen(β),

It penalizes the “size” of β.

This is the most widely used penalization

It’s nice and easyIt allows to “deal” with correlated features.It actually helps training! With a ridge penalization, the optimization problem iseasier.

E. Scornet Deep Learning January 2020 85 / 118

Page 90: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Sparsity

There is another desirable property on β

If βj = 0, then feature j has no impact on the prediction:y = sign(x>β + b)

If we have many features (d is large), it would be nice if β contained zeros, and many ofthem

Means that only few features are statistically relevant.Means that only few features are useful to predict the label

Leads to a simpler model, with a “reduced” dimension

How do we enforce sparsity in β?

E. Scornet Deep Learning January 2020 86 / 118

Page 91: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Sparsity

Tempting to solve

βλ,0 ∈ argminβ∈Rd

‖Y − Xβ‖22 + λ‖β‖0.

where‖β‖0 = #{j ∈ {1, . . . , d} : βj 6= 0}.

To solve this, explore all possible supports of β. Too long! (NP-hard)

Find a convex proxy of ‖‖0: the `1-norm ‖β‖1 =∑d

j=1 |βj |

E. Scornet Deep Learning January 2020 87 / 118

Page 92: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

LASSO

Ridge linear regressionThe LASSO (Least Absolute Selection and Shrinkage Operator) estimate of linearregression βλ,2 is given by

βλ,2 ∈ argminβ∈Rd

‖Y − Xβ‖22 + λ‖β‖1.

Solution: No close form in the general case

If the Xj are orthonormal then

βλ,1 = X ′j Y(1− λ

2|X ′j Y |

)+,

where (x)+ = max(0, x).Thus, in the very specific case of orthogonal design, we can easily show that L1penalization implies a sparse vector if the parameter λ is properly tuned.

E. Scornet Deep Learning January 2020 88 / 118

Page 93: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Sparsity - a picture

Why `2 (ridge) does not induce sparsity?

E. Scornet Deep Learning January 2020 89 / 118

Page 94: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 90 / 118

Page 95: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Dropout

Dropout refers to dropping out units (hidden and visible) in a neural network, i.e.,temporarily removing it from the network, along with all its incoming and outgoingconnections.

Each unit is independently retained with probabilityp = 0.5 for hidden unitsp ∈ [0.5, 1] for input units, usually p = 0.8.

[“Improving neural networks by preventing co-adaptation of feature detectors”, Hinton et al. 2012]

E. Scornet Deep Learning January 2020 91 / 118

Page 96: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Dropout

E. Scornet Deep Learning January 2020 92 / 118

Page 97: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Dropout algorithm

Training step. While not convergence

1 Inside one epoch, for each mini-batch of size m,

1 Sample m different mask. A mask consists in one Bernoulli per node of the network(inner and entry nodes but not output nodes). These Bernoulli variables are i.i.d..Usually

F the probability of selecting an hidden node is 0.5F the probability of selecting an input node is 0.8

2 For each one of the m observation in the mini-batch,F Do a forward pass on the masked networkF Compute backpropagation in the masked networkF Compute the gradient for each weights

3 Update the parameter according to the usual formula.

Prediction step.Use all neurons in the network with weights given by the previous optimization procedure,times the probability p of being selected (0.5 for inner nodes, 0.8 for input nodes).

E. Scornet Deep Learning January 2020 93 / 118

Page 98: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Another way of seeing dropout - Ensemble method

Averaging many different neural networks.

Different can mean either:

randomizing the data set on which wetrain each network (via subsampling)Problem: not enough data to obtaingood performance...

building different network architecturesand train each large network separatelyon the whole training setProblem: computationally prohibitive attraining time and test time!

Miscelleanous:[“Fast dropout training”, Wang and Manning 2013]

[“Dropout: A simple way to prevent neural networks from

overfitting”, Srivastava et al. 2014]

E. Scornet Deep Learning January 2020 94 / 118

Page 99: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Exercise: linear units

1 Consider a neural networks with linear activation functions. Prove that dropout can be seen as a modelaveraging method.

2 Given one training example, consider the error of the ensemble of neural network and that of one randomneural network sample with dropout:

Eens =12

(y − aens )2 =12

(y −d∑

j=1

pi wi xi )2

Esingle =12

(y − asingle)2 =12

(y −d∑

j=1

δi wi xi )2,

where δi ∈ {0, 1} represents the presence (δi = 1) or the absence of a connexion between the input xiand the output.Prove that

E[∇w Esingle ] = ∇w Eens +12

d∑j=1

w2i x2

i V(δi ).

Comment.

E. Scornet Deep Learning January 2020 95 / 118

Page 100: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

1 Simple calculations...2 The gradient of the ensemble is given by

∂Eens

∂wi= −(y − aens)

∂aens

∂wi= −(y − aens)pi Ii .

and that of a single (random) network satisfies∂Esingle

∂wi= −(y − asingle)∂asingle

∂wi

= −(y − asingle)δi Ii

= −tδixi + wiδ2i x2

i +∑j 6=i

wjδiδjxixj .

Taking the expectation of the last expression gives

E

[∂Esingle

∂wi

]= −tpiδi + wipix2

i +∑j 6=i

wipipjxixj

= −(t − aens)pixi + wix2i pi (1− pi ).

Therefore, we haveE

[∂Esingle

∂wi

]= ∂Eens

∂wi+ wiV[δixi ].

E. Scornet Deep Learning January 2020 96 / 118

Page 101: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Solution

In terms of error, we have

E[Esingle ] = Eens + 12

d∑j=1

w2i x2

i V[δi ].

The regularization term is maximized for p = 0.5.

E. Scornet Deep Learning January 2020 97 / 118

Page 102: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 98 / 118

Page 103: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Batch normalization

The network converges faster if its input are scaled (mean, variance) and decorrelated.[“Efficient backprop”, LeCun et al. 1998]

Hard to decorrelate variables: requiring to compute covariance matrix.[“Batch normalization: Accelerating deep network training by reducing internal covariate shift”, Ioffe and Szegedy 2015]

Ideas:Improving gradient flowsAllowing higher learning ratesReducing strond dependence on initializationRelated to regularization (maybe slightly reduces the need for Dropout)

E. Scornet Deep Learning January 2020 99 / 118

Page 104: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Algorithm

See [“Batch normalization: Accelerating deep network training by reducing internal covariate shift”, Ioffe and Szegedy 2015]

1 For every unit in the first layer,1 µB = 1

m∑m

i=1 xi2 σ2B = 1

m∑m

i=1(xi − µB)2

3 xi = xi−µB√σ2B+ε

4 yi = γxi + β ≡ BNγ,β(xi )2 yi is fed to the next layer and the procedure iterates.3 Backpropagation is performed on the network parameters including (γ(k), β(k)). This

returns a network.4 For inference, compute the average over many training batches B:

EB[x ] = EB[µB] and VB[x ] = mm − 1EB[σ2

B].

5 For inference, replace every function x 7→ NBγ,β(x) in the network by

x 7→ γ√VB[x ] + ε

x +(β − γEB[x ]√

VB[x ] + ε

).

E. Scornet Deep Learning January 2020 100 / 118

Page 105: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 101 / 118

Page 106: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Early stopping

Idea:Store the parameter values that lead to the lowest error on the validation setReturn these values rather than the latest ones.

E. Scornet Deep Learning January 2020 102 / 118

Page 107: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Early stopping algorithm

Parameters:patience p of the algorithm: number of times to observe worsening validation seterror before giving up;the number of steps n between evaluations.

1 Start with initial random values θ0.

2 Let θ? = θ0, Err? =∞, j = 0, i = 0.

3 While j < p1 Update θ by running the training algorithm for n steps2 i = i + n3 Compute the error Err(θ) on the validation set4 If Err(θ) < Err?

F θ? = θF Err? = Err(θ)F j = 0

else j = j + 1.

4 Return θ? and the overall number of steps i? = i − np.

E. Scornet Deep Learning January 2020 103 / 118

Page 108: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

How to leverage on early stopping?

First idea: use early stopping to determine the best number of iterations i? and train onthe whole data set for i? iterations.

Let X (train), y (train) be the training set.

Split X (train), y (train) into X (subtrain), y (subtrain) and X (valid), y (valid).

Run early stopping algorithm starting from random θ using X (subtrain), y (subtrain) fortraining data and X (valid), y (valid) for validation data. This returns i? the optimalnumber of steps.

Set θ to random values again.

Train on X (train), y (train) for i? steps.

E. Scornet Deep Learning January 2020 104 / 118

Page 109: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

How to leverage on early stopping?

Second idea: use early stopping to determine the best parameters and the training errorat the best number of iterations. Starting from θ?, train on the whole data set until theerror matches the previous early stopping error.

Let X (train), y (train) be the training set.

Split X (train), y (train) into X (subtrain), y (subtrain) and X (valid), y (valid).

Run early stopping algorithm starting from random θ using X (subtrain), y (subtrain) fortraining data and X (valid), y (valid) for validation data. This returns the optimalparameters θ?.

Set ε = L(θ?,X (subtrain), y (subtrain)).

While L(θ?,X (valid), y (valid)) > ε, train on X (train), y (train) for n steps.

E. Scornet Deep Learning January 2020 105 / 118

Page 110: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

To go further

Early stopping is a very old ideaI [“Three topics in ill-posed problems”, Wahba 1987]

I [“A formal comparison of methods proposed for the numerical solution of first kind integral equations”, Anderssenand Prenter 1981]

I [“Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping”, Caruana et al. 2001]

But also an active area of researchI [“Adaboost is consistent”, Bartlett and Traskin 2007]

I [“Boosting algorithms as gradient descent”, Mason et al. 2000]

I [“On early stopping in gradient descent learning”, Yao et al. 2007]

I [“Boosting with early stopping: Convergence and consistency”, Zhang, Yu, et al. 2005]

I [“Early stopping for kernel boosting algorithms: A general analysis with localized complexities”, Wei et al. 2017]

E. Scornet Deep Learning January 2020 106 / 118

Page 111: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

More on reducing overfitting averaging

Soft-weight sharing:[“Simplifying neural networks by soft weight-sharing”, Nowlan and Hinton 1992]

Model averaging:Average over: random initialization, random selection of minibatches,hyperparameters, or outcomes of nondeterministic neural networks.

Boosting neural networks by incrementally adding neural networks to the ensemble[“Training methods for adaptive boosting of neural networks”, Schwenk and Bengio 1998]

Boosting has also been applied interpreting an individual neural network as anensemble, incrementally adding hidden units to the networks[“Convex neural networks”, Bengio et al. 2006]

E. Scornet Deep Learning January 2020 107 / 118

Page 112: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Outline

1 Introduction

2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm

3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization

4 RegularizationPenalizationDropoutBatch normalizationEarly stopping

5 All in all

E. Scornet Deep Learning January 2020 108 / 118

Page 113: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Pipeline for neural networks

Step 1: Preprocessing the data (substract mean, divide by standard deviation).More complex if data are images.

Step 2: Choose the architecture (number of layers, number of nodes per layer)

Step 3:1 First, run the network and see if the loss is reasonable (compare with dumb classifier:

uniform for classification, mean for regression)2 Add some regularization and check that the error on the training set increases.3 On a small portion of data, make sure you can overfit when turning down the

regularization.4 Find the best learning rate

1 The error does not change too much → learning rate too small2 The error explodes, NaN → learning rate too high3 Find a rough range [10−5, 10−3].

Playing with neural network:

http://playground.tensorflow.org/

E. Scornet Deep Learning January 2020 109 / 118

Page 114: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

To go further

The kernel perceptron algorithm was alreadyintroduced by[“Theoretical foundations of the potential function method in pat-

tern recognition learning”, Aizerman 1964].

General idea (work for all methods using onlydot product): replace the dot product by amore complex kernel function.

Linear separation in the original space be-comes a linear separation in a more complexspace, i.e., a non linear separation in the orig-inal space.

Margin bounds for the Perceptron algorithm in the general non-separable case wereproven by[“Large margin classification using the perceptron algorithm”, Freund and Schapire 1999]

and then by[“Perceptron mistake bounds”, Mohri and Rostamizadeh 2013]

who extended existing results and gave new L1 bounds.

E. Scornet Deep Learning January 2020 110 / 118

Page 115: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Mark A Aizerman. “Theoretical foundations of the potential functionmethod in pattern recognition learning”. In: Automation and remotecontrol 25 (1964), pp. 821–837.

RS Anderssen and PM Prenter. “A formal comparison of methods proposedfor the numerical solution of first kind integral equations”. In: TheANZIAM Journal 22.4 (1981), pp. 488–500.

Yoshua Bengio et al. “Convex neural networks”. In: Advances in neuralinformation processing systems. 2006, pp. 123–130.

Hans-Dieter Block. “The perceptron: A model for brain functioning. i”. In:Reviews of Modern Physics 34.1 (1962), p. 123.

John S Bridle. “Probabilistic interpretation of feedforward classificationnetwork outputs, with relationships to statistical pattern recognition”. In:Neurocomputing. Springer, 1990, pp. 227–236.

Widrow Bernard and D Stearns Samuel. “Adaptive signal processing”. In:Englewood Cliffs, NJ, Prentice-Hall, Inc 1 (1985), p. 491.

Peter L Bartlett and Mikhail Traskin. “Adaboost is consistent”. In: Journalof Machine Learning Research 8.Oct (2007), pp. 2347–2368.

E. Scornet Deep Learning January 2020 111 / 118

Page 116: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Rich Caruana, Steve Lawrence, and C Lee Giles. “Overfitting in neural nets:Backpropagation, conjugate gradient, and early stopping”. In: Advances inneural information processing systems. 2001, pp. 402–408.

Meng Cai, Yongzhe Shi, and Jia Liu. “Deep maxout neural networks forspeech recognition”. In: Automatic Speech Recognition and Understanding(ASRU), 2013 IEEE Workshop on. IEEE. 2013, pp. 291–296.

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. “Fast andaccurate deep network learning by exponential linear units (elus)”. In: arXivpreprint arXiv:1511.07289 (2015).

Yoav Freund and Robert E Schapire. “Large margin classification using theperceptron algorithm”. In: Machine learning 37.3 (1999), pp. 277–296.

Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of trainingdeep feedforward neural networks”. In: Proceedings of the thirteenthinternational conference on artificial intelligence and statistics. 2010,pp. 249–256.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifierneural networks”. In: Proceedings of the Fourteenth InternationalConference on Artificial Intelligence and Statistics. 2011, pp. 315–323.

E. Scornet Deep Learning January 2020 112 / 118

Page 117: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Ian J Goodfellow, Mehdi Mirza, et al. “An empirical investigation ofcatastrophic forgetting in gradient-based neural networks”. In: arXivpreprint arXiv:1312.6211 (2013).

Ian J Goodfellow, David Warde-Farley, et al. “Maxout networks”. In: arXivpreprint arXiv:1302.4389 (2013).

Alex Graves. “Generating sequences with recurrent neural networks”. In:arXiv preprint arXiv:1308.0850 (2013).

Kaiming He et al. “Delving deep into rectifiers: Surpassing human-levelperformance on imagenet classification”. In: Proceedings of the IEEEinternational conference on computer vision. 2015, pp. 1026–1034.

DO Hebb. The organization of behavior: a neuropsychological theory.Wiley, 1949.

Geoffrey E Hinton et al. “Improving neural networks by preventingco-adaptation of feature detectors”. In: arXiv preprint arXiv:1207.0580(2012).

Arthur E Hoerl and Robert W Kennard. “Ridge regression: Biasedestimation for nonorthogonal problems”. In: Technometrics 12.1 (1970),pp. 55–67.

E. Scornet Deep Learning January 2020 113 / 118

Page 118: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Michael Jen-Chao Hu. “Application of the adaline system to weatherforecasting”. PhD thesis. Department of Electrical Engineering, StanfordUniversity, 1964.

Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift”. In: arXiv preprintarXiv:1502.03167 (2015).

Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. “What is the bestmulti-stage architecture for object recognition?” In: Computer Vision, 2009IEEE 12th International Conference on. IEEE. 2009, pp. 2146–2153.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenetclassification with deep convolutional neural networks”. In: Advances inneural information processing systems. 2012, pp. 1097–1105.

Yann LeCun et al. “Efficient backprop”. In: Neural networks: Tricks of thetrade. Springer, 1998, pp. 9–50.

Llew Mason et al. “Boosting algorithms as gradient descent”. In: Advancesin neural information processing systems. 2000, pp. 512–518.

Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. “Rectifiernonlinearities improve neural network acoustic models”. In: Proc. icml.Vol. 30. 1. 2013, p. 3.

E. Scornet Deep Learning January 2020 114 / 118

Page 119: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Dmytro Mishkin and Jiri Matas. “All you need is a good init”. In: arXivpreprint arXiv:1511.06422 (2015).

Warren S McCulloch and Walter Pitts. “A logical calculus of the ideasimmanent in nervous activity”. In: The bulletin of mathematical biophysics5.4 (1943), pp. 115–133.

Marvin Minsky and Seymour Papert. “Perceptrons.” In: (1969).

Mehryar Mohri and Afshin Rostamizadeh. “Perceptron mistake bounds”.In: arXiv preprint arXiv:1305.0208 (2013).

Steven J Nowlan and Geoffrey E Hinton. “Simplifying neural networks bysoft weight-sharing”. In: Neural computation 4.4 (1992), pp. 473–493.

Albert B Novikoff. On convergence proofs for perceptrons. Tech. rep.STANFORD RESEARCH INST MENLO PARK CA, 1963.

Mikel Olazaran. “A sociological study of the official history of theperceptrons controversy”. In: Social Studies of Science 26.3 (1996),pp. 611–659.

Frank Rosenblatt. “Perceptron simulation experiments”. In: Proceedings ofthe IRE 48.3 (1960), pp. 301–309.

E. Scornet Deep Learning January 2020 115 / 118

Page 120: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Frank Rosenblatt. Principles of neurodynamics. perceptrons and the theoryof brain mechanisms. Tech. rep. CORNELL AERONAUTICAL LAB INCBUFFALO NY, 1961.

Holger Schwenk and Yoshua Bengio. “Training methods for adaptiveboosting of neural networks”. In: Advances in neural information processingsystems. 1998, pp. 647–653.

Michael Schuster. “On supervised learning from sequential data withapplications for speech recognition”. In: Daktaro disertacija, Nara Instituteof Science and Technology 45 (1999).

Nitish Srivastava et al. “Dropout: A simple way to prevent neural networksfrom overfitting”. In: The Journal of Machine Learning Research 15.1(2014), pp. 1929–1958.

LR Talbert, GF Groner, and JS Koford. “Real-Time AdaptiveSpeech-Recognition System”. In: The Journal of the Acoustical Society ofAmerica 35.5 (1963), pp. 807–807.

Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In:Journal of the Royal Statistical Society. Series B (Methodological) (1996),pp. 267–288.

E. Scornet Deep Learning January 2020 116 / 118

Page 121: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Grace Wahba. “Three topics in ill-posed problems”. In: Inverse andill-posed problems. Elsevier, 1987, pp. 37–51.

Bernard Widrow and Marcian E Hoff. Adaptive switching circuits.Tech. rep. Stanford Univ Ca Stanford Electronics Labs, 1960.

Wessel N van Wieringen. “Lecture notes on ridge regression”. In: arXivpreprint arXiv:1509.09169 (2015).

Sida Wang and Christopher Manning. “Fast dropout training”. In:international conference on machine learning. 2013, pp. 118–126.

Capt Rodney Winter and B Widrow. “Madaline Rule II: a training algorithmfor neural networks”. In: Second Annual International Conference on NeuralNetworks. 1988, pp. 1–401.

Yuting Wei, Fanny Yang, and Martin J Wainwright. “Early stopping forkernel boosting algorithms: A general analysis with localized complexities”.In: Advances in Neural Information Processing Systems. 2017,pp. 6067–6077.

Bing Xu et al. “Empirical evaluation of rectified activations in convolutionalnetwork”. In: arXiv preprint arXiv:1505.00853 (2015).

E. Scornet Deep Learning January 2020 117 / 118

Page 122: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. “On early stopping ingradient descent learning”. In: Constructive Approximation 26.2 (2007),pp. 289–315.

Tong Zhang, Bin Yu, et al. “Boosting with early stopping: Convergence andconsistency”. In: The Annals of Statistics 33.4 (2005), pp. 1538–1579.

Hui Zou and Trevor Hastie. “Regularization and variable selection via theelastic net”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67.2 (2005), pp. 301–320.

E. Scornet Deep Learning January 2020 118 / 118