machine learning (cse 446): backpropagation · learning with a two-layer network parameters: w 2rh...

32
Machine Learning (CSE 446): Backpropagation Noah Smith c 2017 University of Washington [email protected] November 8, 2017 1 / 32

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Machine Learning (CSE 446):Backpropagation

Noah Smithc© 2017

University of [email protected]

November 8, 2017

1 / 32

Page 2: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Neuron-Inspired Classifiers

xn W × ∑

b

Lnyn

tanhŷ

weights

classifier output, “f”

input

correct output loss

v× ∑ !

“activation”

“hidden units”

2 / 32

Page 3: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Two-Layer Neural Network

f(x) = sign

(H∑

h=1

vh · tanh (wh · x + bh)

)= sign (v · tanh (Wx + b))

I Two-layer networks allow decision boundaries that are nonlinear.

I It’s fairly easy to show that “XOR” can be simulated (recall conjunction featuresfrom the “practical issues” lecture on 10/18).

I Theoretical result: any continuous function on a bounded region in Rd can beapproximated arbitrarily well, with a finite number of hidden units.

I The number of hidden units affects how complicated your decision boundary canbe and how easily you will overfit.

3 / 32

Page 4: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Learning with a Two-Layer Network

Parameters: W ∈ RH×d, b ∈ RH , and v ∈ RH

I If we choose a differentiable loss, then the the whole function will be differentiablewith respect to all parameters.

I Because of the squashing function, which is not convex, the overall learningproblem is not convex.

I What does (stochastic) (sub)gradient descent do with non-convex functions? Itfinds a local minimum.

I To calculate gradients, we need to use the chain rule from calculus.

I Special name for (S)GD with chain rule invocations: backpropagation.

4 / 32

Page 5: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation

For every node in the computation graph, we wish to calculate the first derivative ofLn with respect to that node. For any node a, let:

a =∂Ln

∂a

5 / 32

Page 6: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation

For every node in the computation graph, we wish to calculate the first derivative ofLn with respect to that node. For any node a, let:

a =∂Ln

∂a

Base case:

Ln =∂Ln

∂Ln= 1

6 / 32

Page 7: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation

For every node in the computation graph, we wish to calculate the first derivative ofLn with respect to that node. For any node a, let:

a =∂Ln

∂a

Base case:

Ln =∂Ln

∂Ln= 1

From here on, we overload notation and let a and b refer to nodes and to their values.

7 / 32

Page 8: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

BackpropagationFor every node in the computation graph, we wish to calculate the first derivative ofLn with respect to that node. For any node a, let:

a =∂Ln

∂a

After working forwards through the computation graph to obtain the loss Ln, we workbackwards through the computation graph, using the chain rule to calculate a for everynode a, making use of the work already done for nodes that depend on a.

∂Ln

∂a=∑b:a→b

∂Ln

∂b· ∂b∂a

a =∑b:a→b

b · ∂b∂a

=∑b:a→b

b ·

1 if b = a+ c for some cc if b = a · c for some c

1− b2 if b = tanh(a)

8 / 32

Page 9: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation with Vectors

Pointwise (“Hadamard”) product for vectors in Rn:

a� b =

a[1] · b[1]a[2] · b[2]

...a[n] · b[n]

a =∑

b:a→b

|b|∑i=1

b[i] · ∂b[i]

∂a

=∑

b:a→b

b if b = a + c for some c

b� c if b = a� c for some cb� (1− b� b) if b = tanh(a)

9 / 32

Page 10: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation, Illustrated

xn W d=Wxn a=b+d

b

Lnyn

e=tanh a

vf=v⊙e g=∑hf[h]

Intermediate nodes are de-anonymized, to make notation easier.

10 / 32

Page 11: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation, Illustrated

xn W d=Wxn a=b+d

b

Lnyn

e=tanh a

vf=v⊙e g=∑hf[h]

1

∂Ln∂Ln

= 1

11 / 32

Page 12: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation, Illustrated

xn W d=Wxn a=b+d

b

Lnyn

e=tanh a

vf=v⊙e g=∑hf[h]

1

The form of g will be loss-function specific (e.g., −2(yn − g) for squared loss).

12 / 32

Page 13: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation, Illustrated

xn W d=Wxn a=b+d

b

Lnyn

e=tanh a

vf=v⊙e g=∑hf[h]

1

ḡḡ⋅1

Sum.

13 / 32

Page 14: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation, Illustrated

xn W d=Wxn a=b+d

b

Lnyn

e=tanh a

vf=v⊙e g=∑hf[h]

1

ḡḡ⋅1ḡ⋅e

ḡ⋅v

Product.

14 / 32

Page 15: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation, Illustrated

xn W d=Wxn a=b+d

b

Lnyn

e=tanh a

vf=v⊙e g=∑hf[h]

1

ḡḡ⋅1ḡ⋅e

ā=ḡ⋅v ⊙(1−e⊙e) ḡ⋅v

Hyperbolic tangent.

15 / 32

Page 16: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation, Illustrated

xn W d=Wxn a=b+d

b

Lnyn

e=tanh a

vf=v⊙e g=∑hf[h]

1

ḡḡ⋅1ḡ⋅e

ā=ḡ⋅v ⊙(1−e⊙e) ḡ⋅vā

ā

Sum.

16 / 32

Page 17: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Backpropagation, Illustrated

xn W d=Wxn a=b+d

b

Lnyn

e=tanh a

vf=v⊙e g=∑hf[h]

1

ḡḡ⋅1ḡ⋅e

ā=ḡ⋅v ⊙(1−e⊙e) ḡ⋅vāāxn⊤

ā

Product.

17 / 32

Page 18: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?

I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

18 / 32

Page 19: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?

I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

19 / 32

Page 20: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?

I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

20 / 32

Page 21: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?I number of training iterations

I learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

21 / 32

Page 22: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?I number of training iterationsI learning rate for SGD

I number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

22 / 32

Page 23: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)

I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

23 / 32

Page 24: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)

I amount of randomness in initializationI regularization

I Interpretability?

24 / 32

Page 25: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initialization

I regularization

I Interpretability?

25 / 32

Page 26: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

26 / 32

Page 27: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability?

27 / 32

Page 28: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Practical Notes

I Don’t initalize all parameters to zero; add some random noise.

I Random restarts: train K networks with different initializers, and you’ll get Kdifferent classifiers of varying quality.

I Hyperparameters?I number of training iterationsI learning rate for SGDI number of hidden units (H)I number of “layers” (and number of hidden units in each layer)I amount of randomness in initializationI regularization

I Interpretability? /

28 / 32

Page 29: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Challenge of Deeper Networks

Backpropagation aims to assign “credit” (or “blame”) to each parameter.In a deep network, credit/blame is shared across all layers.So parameters at early layers tend to have very small gradients.One solution is to train a shallow network, then use it to initialize a deeper network,perhaps gradually increasing network depth. This is called layer-wise training.

29 / 32

Page 30: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Radial Basis Function Networks

xn w1 sqd ×

γ1

Lnyn

exp

ŷ

weights

classifier output, “f”

input

correct output loss

w2 sqd ×

γ2

exp

v1×

v2

×

!

“activation”

“hidden units”

In the diagram, sqd(x,w) = ‖x−w‖22.30 / 32

Page 31: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

Radial Basis Function Networks

Generalizing to H hidden units:

f(x) = sign

(H∑

h=1

v[h] · exp(−γh · ‖x−wh‖22

))

Each hidden unit is like a little “bump” in data space. wh is the position of the bump,and γh is inversely proportional to its width.

31 / 32

Page 32: Machine Learning (CSE 446): Backpropagation · Learning with a Two-Layer Network Parameters: W 2RH d, b 2RH, and v 2RH I If we choose a di erentiable loss, then the the whole function

A Gentle Reading on Backpropagation

http://colah.github.io/posts/2015-08-Backprop/

32 / 32