lecture 15: from (sigmoidal) perceptrons to neural networks

35
Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks Reference: We will be referring to sections etc of ‘Deep Learning’ by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=1187

Upload: others

Post on 04-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Lecture 15: From (Sigmoidal) Perceptrons toNeural Networks

Reference: We will be referring to sections etc of ‘Deep Learning’by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=1187

Page 2: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Recap: Non-linearity via Kernelization (Eg., LR)2 The Regularized (Logistic) Cross-Entropy Loss function

(minimized wrt w ∈ �p):

E (w) = −

1

m

m�

i=1

�y (i) log fw

�x(i)

�+

�1 − y (i)

�log

�1 − fw

�x(i)

��� +

λ

2m�w�22 (1)

3 Equivalent dual kernelized objective1

(minimized wrt α ∈ �m):

ED (α) =

m�

i=1

m�

j=1

−y (i)K�x(i), x(j)

�αj +

λ

2αi K

�x(i), x(j)

�αj

+ log

1 +

m�

j=1

αj K�x(i), x(j)

(2)

Decision function fw(x) =1

1+ exp

m�

j=1

αjK�x, x(j)

1Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/

Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf

Page 3: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Story so Far

Perceptron

Kernel Perceptron

Logistic Regression

Kernelized Logistic RegressionNeural Networks:

Page 4: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Story so Far

Perceptron

Kernel Perceptron

Logistic Regression

Kernelized Logistic RegressionNeural Networks: Universal Approximation Properties andDepth (Section 6.4)

With a single hidden layer of a sufficient size and a reasonable choiceof nonlinearity (including the sigmoid, hyperbolic tangent, and RBFunit), one can represent any smooth function to any desired accuracyThe greater the required accuracy, the more hidden units are requiredNo free lunch theorems

Page 5: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Problem in Perspective

Given data points xi , i = 1, 2, . . . ,m

Possible class choices: c1, c2, . . . , ckWish to generate a mapping/classifier

f : x → {c1, c2, . . . , ck}

To get class labels y1, y2, . . . , ym

Page 6: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Problem in Perspective

In general, series of mappings

xf (·)−−→ y

g(·)−−→ zh(·)−−→ {c1, c2, . . . , ck}

where y , z are in some latent space.

https://playground.tensorflow.org

Page 7: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Other non-linear activation functions?

Consider classification: f (x) = g�wTφ(x)

https://playground.tensorflow.org

Page 8: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Other non-linear activation functions?

Consider classification: f (x) = g�wTφ(x)

sign�wTφ(x)

�replaced by g

�wTφ(x)

�where g(s) is a

1 step function: g(s) = 1 if s ∈ [θ,∞) and g(s) = 0 otherwise OR2 sigmoid function: g(s) = 1

1+e−s with possible thresholding using some

θ (such as 12).

3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function

4 Softplus: g(s) = ln (1 + es)

Options 2 and 4 have the thresholding step deferred.Threshold changes as bias is changed.

Page 9: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Demostration at https://www.desmos.com/calculator

Page 10: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Other non-linear activation functions?

Consider classification: f (x) = g�wTφ(x)

sign�wTφ(x)

�replaced by g

�wTφ(x)

�where g(s) is a

1 step function: g(s) = 1 if s ∈ [θ,∞) and g(s) = 0 otherwise OR2 sigmoid function: g(s) = 1

1+e−s with possible thresholding using some

θ (such as 12).

3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function

4 Softplus: g(s) = ln (1 + es)

Options 2 and 4 have the thresholding step deferred.Threshold changes as bias is changed.

Neural Networks: Cascade of layers of perceptrons giving younon-linearity. Check out https://playground.tensorflow.org/

Page 11: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Recall: Logistic Extension to multi-class

1 Each class c = 1, 2, . . . ,K − 1 can have a different weightvector [wc,1,wc,2, . . . ,wc,k , . . . ,wc,K−1] and

p(Y = c|φ(x)) = e−(wc)Tφ(x)

1 +K−1�

k=1

e−(wk)Tφ(x)

for c = 1, . . . ,K − 1 so that

p(Y = K |φ(x)) = 1

1 +K−1�

k=1

e−(wk)Tφ(x)

Page 12: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Softmax: (Equivalent) LR extension to multi-class

1 Each class c = 1, 2, . . . ,K can have a different weight vector[wc,1,wc,2 . . .wc,K ] and

p(Y = c|φ(x)) = e−(wc)Tφ(x)

K�

k=1

e−(wk)Tφ(x)

for c = 1, . . . ,K .2 This has one set of additional (redundant) weight vector

parameters3 Tutorial 7: Show the (simple) equivalence between the two

formulations

Page 13: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Multi-layer Perceptron/LR (Neural Network) andVC Dimension

Page 14: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Measure for (non)separability using a classifier?

Aspect 1: Number of functions that can be representedTutorial 3: Given n boolean variables how many of 22

n

booleanfunctions can be represented by the classifier? Eg, withperceptron we saw that for n = 2 it is 14, for n = 3 it is 104,for n = 4 it is 1882

Page 15: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Measure for (non)separability using a classifier?

Aspect 1: Number of functions that can be representedTutorial 3: Given n boolean variables how many of 22

n

booleanfunctions can be represented by the classifier? Eg, withperceptron we saw that for n = 2 it is 14, for n = 3 it is 104,for n = 4 it is 1882

Page 16: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Measure for (non)separability using a classifier?

Aspect 2: Cardinality of largest set of points that canbe shatteredVC (Vapnik - Chervonenkis) dimension ⇒ Richness of space offunctions learnable by a statistical classification algorithm.

A classification function fw is said to shatter a set of data points(x1, x2, . . . , xn) if, for all assignments of labels to those points, thereexists a w such that fw makes no errors when evaluating that set ofdata points.Cardinality of the largest set of points that f (w) can shatter is itsVC-dimension (see extra & optional material on VC-dimension).

Page 17: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

VC dimension: Examples

Three points can be shattered using linear classifiers

✗� ✗�

✗�

✗� ✗� ✗�

✗�

✗�

✗�

✗� ✗�

✗�

Page 18: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

VC dimension: Examples

Three points can be shattered using linear classifiers

✗� ✗�

✗�

✗� ✗� ✗�

✗�

✗�

✗�

✗� ✗�

✗�

Four points can be shattered using axis-parallel rectangles

✗�

✗� ✗� ✗�✗�✗�✗�

✗�

✗�

✗�

✗�

✗�

✗� ✗� ✗� ✗�

✗� ✗�✗�

✗�

✗�✗�✗�✗�

✗� ✗� ✗� ✗�

✗� ✗�✗�

✗�

Page 19: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �),

Page 20: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �),

Page 21: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2),

Page 22: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2), VC dimension= 3Eg: For f as perceptron (in �n),

Page 23: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Measure for (non)separability using a classifier?

A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2), VC dimension= 3Eg: For f as perceptron (in �n), VC dimension = n + 1 (seeextra slides for proof)

Page 24: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Neural Networks

Great Expressive power (Recall VC dimension): Keys to non-linearity1 non-linearity in activation functions (Tutorial 7)2 cascade of non-linear activation functions

Varied activation functions

Page 25: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Neural Networks

Great Expressive power (Recall VC dimension): Keys to non-linearity1 non-linearity in activation functions (Tutorial 7)2 cascade of non-linear activation functions

Varied activation functions f (x) = g�wTφ(x)

�where g(s) can be any of

following:1 sign/step function: g(s) = sign(s) or g(s) = 1 if s ∈ [θ,∞) and

g(s) = 0 otherwise2 sigmoid function: g(s) = 1

1+e−s with possible thresholding using some

θ (such as 12). tanh(s) = 2sigmoid(2s)− 1

3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function

4 Softplus: g(s) = ln (1 + es): A smooth version of ReLU5 Others: leaky ReLU, RBF, Maxout, Hard tanh, absolute value

rectification (Section 6.2.1)Neural Networks: Cascade of layers of perceptrons giving you non-linearity

Page 26: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

The 4 Design Choices in a Neural Network (Section6.1)

Page 27: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Some activation functions

Page 28: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Derivatives of some activation functions

Page 29: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Some interesting visualizations

https://distill.pub/2018/building-blocks/

https://distill.pub/2017/feature-visualization/

http://colah.github.io/posts/2015-01-Visualizing-Representations/

http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Page 30: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Simple OR using (step) perceptron

x

y

b

θ = 12

1

1

−0.25

x ∨ y

Page 31: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

AND using (step) perceptron

x

y

b

θ = 12

1

1

−1.25

x ∧ y

Page 32: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

NOT using perceptron

x

b

θ = 12

−1

0.75

¬x

Page 33: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Feed-forward Neural Nets

xn

x2

x1

1

z1 = g (�

)

z2 = g (�

)

wn1

w21

w11

w01

wn2

w22

w12

w02

f1 = g(.)

f2 = g(.)

inputs

Page 34: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Eg: Feed-forward Neural Net for XOR (θ = 0)

Page 35: Lecture 15: From (Sigmoidal) Perceptrons to Neural Networks

Eg: Feed-forward Neural Net for XOR (θ = 0)

x2

x1

1

z1 = g (�

)

z2 = g (�

) 1

1

1

−0.25

−1

−1

1.25

fw = g(.)

1

1−1.25

inputs