lecture 12 - hacettepe Üniversitesiaykut/classes/... · lecture 12: −computational graph...

Post on 22-Jan-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 12:−Computational Graph−Backpropagation

Aykut Erdem November 2018Hacettepe University

Last time… Multilayer Perceptron

!2

• Layer Representation

• (typically) iterate betweenlinear mapping Wx and nonlinear function

• Loss functionto measure quality ofestimate so far

yi = Wixi

xi+1 = �(yi)

x1

x2

x3

x4

y

W1

W2

W3

W4

l(y, yi)

slide by Alex Smola

Last time… Forward Pass

• Output of the network can be written as:

(j indexing hidden units, k indexing the output units, D number of inputs) • Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

!3

slide by Raquel Urtasun, Richard Zem

el, Sanja Fidler

Forward Pass: What does the Network Compute?

Output of the network can be written as:

hj(x) = f (vj0 +DX

i=1

xivji )

ok(x) = g(wk0 +JX

j=1

hj(x)wkj)

(j indexing hidden units, k indexing the output units, D number of inputs)

Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

�(z) =1

1 + exp(�z), tanh(z) =

exp(z)� exp(�z)

exp(z) + exp(�z), ReLU(z) = max(0, z)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62

Forward Pass: What does the Network Compute?

Output of the network can be written as:

hj(x) = f (vj0 +DX

i=1

xivji )

ok(x) = g(wk0 +JX

j=1

hj(x)wkj)

(j indexing hidden units, k indexing the output units, D number of inputs)

Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

�(z) =1

1 + exp(�z), tanh(z) =

exp(z)� exp(�z)

exp(z) + exp(�z), ReLU(z) = max(0, z)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62

Forward Pass: What does the Network Compute?

Output of the network can be written as:

hj(x) = f (vj0 +DX

i=1

xivji )

ok(x) = g(wk0 +JX

j=1

hj(x)wkj)

(j indexing hidden units, k indexing the output units, D number of inputs)

Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

�(z) =1

1 + exp(�z), tanh(z) =

exp(z)� exp(�z)

exp(z) + exp(�z), ReLU(z) = max(0, z)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62

Last time… Forward Pass in Python• Example code for a forward pass for a 3-layer network in Python:

• Can be implemented efficiently using matrix operations • Example above: W1 is matrix of size 4 × 3, W2 is 4 × 4. What about

biases and W3?!4

slide by Raquel Urtasun, Richard Zem

el, Sanja Fidler [http://cs231n.github.io/neural-networks-1/]

Backpropagation

!5

Recap: Loss function/Optimization

!6

-3.45 -8.87 0.09 2.9

4.48 8.02 3.78 1.06

-0.36 -0.72

-0.51 6.04 5.31

-4.22 -4.19 3.58 4.49

-4.37 -2.09 -2.93

3.42 4.64 2.65 5.1

2.64 5.55

-4.34 -1.5

-4.79 6.14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

1. Define a loss function that quantifies our unhappiness with the scores across the training data.

2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

We defined a (linear) score function:

!7

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

!8

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

!9

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

!10

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

!11

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

!12

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

!13

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

!14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

!15

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

!16

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Optimization

!17

Gradient Descent

!18

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Mini-batch Gradient Descent• only use a small portion of the training set

to compute the gradient

!19

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Mini-batch Gradient Descent• only use a small portion of the training set

to compute the gradient

!20

there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

The effects of different update form formulas

!21(image credits to Alec Radford)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!22

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!23

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Computational Graph

!24

x

W

* hingeloss

R

+ Ls(scores)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Convolutional Network(AlexNet)

input imageweights

loss

!25

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Neural Turing Machine

input tape

loss

!26

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!27

e.g. x = -2, y = 5, z = -4

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!28

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!29

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!30

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!31

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!32

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!33

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!34

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!35

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!36

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!37

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Chain rule:

!38

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!39

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Chain rule:

!40

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

!41

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

“local gradient”

!42

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

gradients

“local gradient”

!43

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

gradients

“local gradient”

!44

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

gradients

“local gradient”

!45

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

gradients

“local gradient”

!46

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!47

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!48

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!49

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!50

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!51

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!52

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!53

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!54

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!55

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

(-1) * (-0.20) = 0.20

!56

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!57

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

[local gradient] x [its gradient][1] x [0.2] = 0.2[1] x [0.2] = 0.2 (both inputs!)

!58

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!59

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

[local gradient] x [its gradient]x0: [2] x [0.2] = 0.4w0: [-1] x [0.2] = -0.20.40

!60

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

sigmoid function

sigmoid gate0.40

!61

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

0.40

Patterns in backward flow• add gate: gradient distributor• max gate: gradient router• mul gate: gradient… “switcher”?

!62

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!63

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Gradients add at branches

+

Implementation: forward/backward API

!64

Graph (or Net) object. (Rough pseudo code)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

0.40

Implementation: forward/backward API

!65

(x,y,z are scalars)

*

x

y

z

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Implementation: forward/backward API

!66

(x,y,z are scalars)

*

x

y

z

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Summary• neural nets will be very large: no hope of writing down

gradient formula by hand for all parameters• backpropagation = recursive application of the chain rule

along a computational graph to compute the gradients of all inputs/parameters/intermediates

• implementations maintain a graph structure, where the nodes implement the forward() / backward() API.

• forward: compute result of an operation and save any intermediates needed for gradient computation in memory

• backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs.

!67

Where are we now…

!68

Mini-batch SGD

Loop:1.Sample a batch of data2.Forward prop it through the graph, get loss3.Backprop to calculate the gradients4.Update the parameters using the gradient

Next Lecture:

Convolutional Neural Networks

�69

top related