lecture 12 - hacettepe Üniversitesiaykut/classes/... · lecture 12: −computational graph...
TRANSCRIPT
Lecture 12:−Computational Graph−Backpropagation
Aykut Erdem November 2018Hacettepe University
Last time… Multilayer Perceptron
!2
• Layer Representation
• (typically) iterate betweenlinear mapping Wx and nonlinear function
• Loss functionto measure quality ofestimate so far
yi = Wixi
xi+1 = �(yi)
x1
x2
x3
x4
y
W1
W2
W3
W4
l(y, yi)
slide by Alex Smola
Last time… Forward Pass
• Output of the network can be written as:
(j indexing hidden units, k indexing the output units, D number of inputs) • Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)
!3
slide by Raquel Urtasun, Richard Zem
el, Sanja Fidler
Forward Pass: What does the Network Compute?
Output of the network can be written as:
hj(x) = f (vj0 +DX
i=1
xivji )
ok(x) = g(wk0 +JX
j=1
hj(x)wkj)
(j indexing hidden units, k indexing the output units, D number of inputs)
Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)
�(z) =1
1 + exp(�z), tanh(z) =
exp(z)� exp(�z)
exp(z) + exp(�z), ReLU(z) = max(0, z)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62
Forward Pass: What does the Network Compute?
Output of the network can be written as:
hj(x) = f (vj0 +DX
i=1
xivji )
ok(x) = g(wk0 +JX
j=1
hj(x)wkj)
(j indexing hidden units, k indexing the output units, D number of inputs)
Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)
�(z) =1
1 + exp(�z), tanh(z) =
exp(z)� exp(�z)
exp(z) + exp(�z), ReLU(z) = max(0, z)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62
Forward Pass: What does the Network Compute?
Output of the network can be written as:
hj(x) = f (vj0 +DX
i=1
xivji )
ok(x) = g(wk0 +JX
j=1
hj(x)wkj)
(j indexing hidden units, k indexing the output units, D number of inputs)
Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)
�(z) =1
1 + exp(�z), tanh(z) =
exp(z)� exp(�z)
exp(z) + exp(�z), ReLU(z) = max(0, z)
Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62
Last time… Forward Pass in Python• Example code for a forward pass for a 3-layer network in Python:
• Can be implemented efficiently using matrix operations • Example above: W1 is matrix of size 4 × 3, W2 is 4 × 4. What about
biases and W3?!4
slide by Raquel Urtasun, Richard Zem
el, Sanja Fidler [http://cs231n.github.io/neural-networks-1/]
Backpropagation
!5
Recap: Loss function/Optimization
!6
-3.45 -8.87 0.09 2.9
4.48 8.02 3.78 1.06
-0.36 -0.72
-0.51 6.04 5.31
-4.22 -4.19 3.58 4.49
-4.37 -2.09 -2.93
3.42 4.64 2.65 5.1
2.64 5.55
-4.34 -1.5
-4.79 6.14
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
1. Define a loss function that quantifies our unhappiness with the scores across the training data.
2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)
TODO:
We defined a (linear) score function:
!7
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial Logistic Regression)
!8
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial Logistic Regression)
!9
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial Logistic Regression)
!10
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial Logistic Regression)
!11
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial Logistic Regression)
!12
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial Logistic Regression)
!13
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial Logistic Regression)
!14
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial Logistic Regression)
!15
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Softmax Classifier (Multinomial Logistic Regression)
Softmax Classifier (Multinomial Logistic Regression)
!16
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Optimization
!17
Gradient Descent
!18
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Mini-batch Gradient Descent• only use a small portion of the training set
to compute the gradient
!19
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Mini-batch Gradient Descent• only use a small portion of the training set
to compute the gradient
!20
there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
The effects of different update form formulas
!21(image credits to Alec Radford)
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!22
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!23
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Computational Graph
!24
x
W
* hingeloss
R
+ Ls(scores)
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Convolutional Network(AlexNet)
input imageweights
loss
!25
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Neural Turing Machine
input tape
loss
!26
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!27
e.g. x = -2, y = 5, z = -4
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!28
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!29
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!30
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!31
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!32
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!33
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!34
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!35
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!36
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!37
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Chain rule:
!38
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!39
e.g. x = -2, y = 5, z = -4
Want: slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Chain rule:
!40
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
f
activations
!41
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
f
activations
“local gradient”
!42
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
f
activations
gradients
“local gradient”
!43
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
f
activations
gradients
“local gradient”
!44
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
f
activations
gradients
“local gradient”
!45
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
f
activations
gradients
“local gradient”
!46
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!47
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!48
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!49
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!50
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!51
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!52
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!53
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!54
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!55
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
(-1) * (-0.20) = 0.20
!56
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!57
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
[local gradient] x [its gradient][1] x [0.2] = 0.2[1] x [0.2] = 0.2 (both inputs!)
!58
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!59
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
[local gradient] x [its gradient]x0: [2] x [0.2] = 0.4w0: [-1] x [0.2] = -0.20.40
!60
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
sigmoid function
sigmoid gate0.40
!61
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
sigmoid function
sigmoid gate
(0.73) * (1 - 0.73) = 0.2
0.40
Patterns in backward flow• add gate: gradient distributor• max gate: gradient router• mul gate: gradient… “switcher”?
!62
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
!63
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Gradients add at branches
+
Implementation: forward/backward API
!64
Graph (or Net) object. (Rough pseudo code)
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
0.40
Implementation: forward/backward API
!65
(x,y,z are scalars)
*
x
y
z
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Implementation: forward/backward API
!66
(x,y,z are scalars)
*
x
y
z
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Summary• neural nets will be very large: no hope of writing down
gradient formula by hand for all parameters• backpropagation = recursive application of the chain rule
along a computational graph to compute the gradients of all inputs/parameters/intermediates
• implementations maintain a graph structure, where the nodes implement the forward() / backward() API.
• forward: compute result of an operation and save any intermediates needed for gradient computation in memory
• backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs.
!67
Where are we now…
!68
Mini-batch SGD
Loop:1.Sample a batch of data2.Forward prop it through the graph, get loss3.Backprop to calculate the gradients4.Update the parameters using the gradient
Next Lecture:
Convolutional Neural Networks
�69