lecture 12 - hacettepe Üniversitesiaykut/classes/... · lecture 12: −computational graph...

Lecture 12:−Computational Graph−Backpropagation

Aykut Erdem November 2018Hacettepe University

Last time… Multilayer Perceptron

!2

• Layer Representation

• (typically) iterate betweenlinear mapping Wx and nonlinear function

• Loss functionto measure quality ofestimate so far

yi = Wixi

xi+1 = �(yi)

x1

x2

x3

x4

y

W1

W2

W3

W4

l(y, yi)

slide by Alex Smola

Last time… Forward Pass

• Output of the network can be written as:

(j indexing hidden units, k indexing the output units, D number of inputs) • Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

!3

slide by Raquel Urtasun, Richard Zem

el, Sanja Fidler

Forward Pass: What does the Network Compute?

Output of the network can be written as:

hj(x) = f (vj0 +DX

i=1

xivji )

ok(x) = g(wk0 +JX

j=1

hj(x)wkj)

(j indexing hidden units, k indexing the output units, D number of inputs)

Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

�(z) =1

1 + exp(�z), tanh(z) =

exp(z)� exp(�z)

exp(z) + exp(�z), ReLU(z) = max(0, z)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62



hj(x) = f (vj0 +DX

i=1

xivji )

ok(x) = g(wk0 +JX

j=1

hj(x)wkj)



�(z) =1


exp(z)� exp(�z)





hj(x) = f (vj0 +DX

i=1

xivji )

ok(x) = g(wk0 +JX

j=1

hj(x)wkj)



�(z) =1


exp(z)� exp(�z)



Last time… Forward Pass in Python• Example code for a forward pass for a 3-layer network in Python:

• Can be implemented efficiently using matrix operations • Example above: W1 is matrix of size 4 × 3, W2 is 4 × 4. What about

biases and W3?!4

slide by Raquel Urtasun, Richard Zem

el, Sanja Fidler [http://cs231n.github.io/neural-networks-1/]

http://cs231n.github.io/neural-networks-1/

Backpropagation

!5

Recap: Loss function/Optimization

!6

-3.45 -8.87 0.09 2.9

4.48 8.02 3.78 1.06

-0.36 -0.72

-0.51 6.04 5.31

-4.22 -4.19 3.58 4.49

-4.37 -2.09 -2.93

3.42 4.64 2.65 5.1

2.64 5.55

-4.34 -1.5

-4.79 6.14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

1. Define a loss function that quantifies our unhappiness with the scores across the training data.

2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

We defined a (linear) score function:

!7


Softmax Classifier (Multinomial Logistic Regression)

!8



!9



!10



!11



!12



!13



!14



!15




!16


Optimization

!17

Gradient Descent

!18


Mini-batch Gradient Descent• only use a small portion of the training set

to compute the gradient

!19


Mini-batch Gradient Descent• only use a small portion of the training set

to compute the gradient

!20

there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)


The effects of different update form formulas

!21(image credits to Alec Radford)


!22


!23


Computational Graph

!24

x

W

* hingeloss

R

+ Ls(scores)


Convolutional Network(AlexNet)

input imageweights

loss

!25


Neural Turing Machine

input tape

loss

!26


!27

e.g. x = -2, y = 5, z = -4


!28

e.g. x = -2, y = 5, z = -4

Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

!29

e.g. x = -2, y = 5, z = -4


!30

e.g. x = -2, y = 5, z = -4


!31

e.g. x = -2, y = 5, z = -4


!32

e.g. x = -2, y = 5, z = -4


!33

e.g. x = -2, y = 5, z = -4


!34

e.g. x = -2, y = 5, z = -4


!35

e.g. x = -2, y = 5, z = -4


!36

e.g. x = -2, y = 5, z = -4


!37

e.g. x = -2, y = 5, z = -4


Chain rule:

!38

e.g. x = -2, y = 5, z = -4


!39

e.g. x = -2, y = 5, z = -4


Chain rule:

!40


f

activations

!41


f

activations

“local gradient”

!42


f

activations

gradients


!43


f

activations

gradients


!44


f

activations

gradients


!45


f

activations

gradients


!46

Another example:


!47

Another example:


!48

Another example:


!49

Another example:


!50

Another example:


!51

Another example:


!52

Another example:


!53

Another example:


!54

Another example:


!55

Another example:


(-1) * (-0.20) = 0.20

!56

Another example:


!57

Another example:


[local gradient] x [its gradient][1] x [0.2] = 0.2[1] x [0.2] = 0.2 (both inputs!)

!58

Another example:


!59

Another example:


[local gradient] x [its gradient]x0: [2] x [0.2] = 0.4w0: [-1] x [0.2] = -0.20.40

!60


sigmoid function

sigmoid gate0.40

!61


sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

0.40

Patterns in backward flow• add gate: gradient distributor• max gate: gradient router• mul gate: gradient… “switcher”?

!62


!63


Gradients add at branches

+

Implementation: forward/backward API

!64

Graph (or Net) object. (Rough pseudo code)


0.40


!65

(x,y,z are scalars)

*

x

y

z



!66

(x,y,z are scalars)

*

x

y

z


Summary• neural nets will be very large: no hope of writing down

gradient formula by hand for all parameters• backpropagation = recursive application of the chain rule

along a computational graph to compute the gradients of all inputs/parameters/intermediates

• implementations maintain a graph structure, where the nodes implement the forward() / backward() API.

• forward: compute result of an operation and save any intermediates needed for gradient computation in memory

• backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs.

!67

Where are we now…

!68

Mini-batch SGD

Loop:1.Sample a batch of data2.Forward prop it through the graph, get loss3.Backprop to calculate the gradients4.Update the parameters using the gradient

Next Lecture:

Convolutional Neural Networks

�69

lecture 12 - hacettepe Üniversitesiaykut/classes/... · lecture 12: −computational graph...

Documents