l5 optimization gradients - college of computing · 2019. 1. 23. · 5hjxodul]dwlrq 'dwd orvv 0rgho...

CS 4803 / 7643: Deep Learning

Zsolt Kira

Georgia Tech

Topics: – Optimization

– Computing Gradients

Admin

• PS0 grades out

• Note: PS0 is 5% (pass/fail), HW0 is 15% (points)

• HW1 coming out in week 4

• Group formation (see piazza post)!

• Thursday guest lecture by Peter on training NNs

(C) Dhruv Batra and Zsolt Kira 2

Recap from last time


Regularization

4

Data loss: Model predictions should match training data

Regularization: Prevent the model from doing too well on training data

= regularization strength(hyperparameter)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Occam’s Razor: “Among competing hypotheses, the simplest is the best”William of Ockham, 1285 - 1347

Regularization

5

Data loss: Model predictions should match training data

Regularization: Prevent the model from doing too well on training data

= regularization strength(hyperparameter)

Simple examplesL2 regularization: L1 regularization: Elastic net (L1 + L2):

More complex:DropoutBatch normalizationStochastic depth, fractional pooling, etc


(Before) Linear score function:


Neural networks: without the brain stuff

7


(Now) 2-layer Neural Network



8

x hW1 sW2

3072 100 10




(Now) 2-layer Neural Network

9


(Now) 2-layer Neural Networkor 3-layer Neural Network



10

Impulses carried toward cell body

Impulses carried away from cell body

This image by Felipe Peruchois licensed under CC-BY 3.0

dendrite

cell body

axon

presynaptic terminal


11

sigmoid activation function

Impulses carried toward cell body

Impulses carried away from cell body

This image by Felipe Peruchois licensed under CC-BY 3.0

dendrite

cell body

axon

presynaptic terminal


Sigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU


Activation functions

Multilayer Networks• Cascade Neurons together

• The output from one layer is the input to the next

• Each Layer has its own sets of weights

(C) Dhruv Batra and Zsolt Kira 13Image Credit: Andrej Karpathy, CS231n

Plan for Today

• (Finish) Optimization

• Computing Gradients


Optimization

Strategy: Follow the slope



In 1-dimension, the derivative of a function:



In 1-dimension, the derivative of a function:

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension

The slope in any direction is the dot product of the direction with the gradientThe direction of steepest descent is the negative gradient


Gradient Descent


original W

negative gradient directionW_1

W_2


Full sum expensive when N is large!

Approximate sum using a minibatch of examples32 / 64 / 128 common


Stochastic Gradient Descent (SGD)

How do we compute gradients?

• Analytic or “Manual” Differentiation

• Symbolic Differentiation

• Numerical Differentiation

• Automatic Differentiation– Forward mode AD

– Reverse mode AD• aka “backprop”


(C) Dhruv Batra and Zsolt Kira 28By Brnbrnz (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)]

current W:

[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347

gradient dW:

[?,?,?,?,?,?,?,?,?,…]


current W:

[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347

gradient dW:

[?,?,?,?,?,?,?,?,?,…]


W + h (first dim):

[0.34 + 0.0001,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25322

gradient dW:

[-2.5,?,?,?,?,?,?,?,?,…]

(1.25322 - 1.25347)/0.0001= -2.5

current W:

[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347


W + h (first dim):

[0.34 + 0.0001,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25322

gradient dW:

[-2.5,?,?,?,?,?,?,?,?,…]

current W:

[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347

W + h (second dim):

[0.34,-1.11 + 0.0001,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25353


gradient dW:

[-2.5,0.6,?,?,?,?,?,?,?,…]

current W:

[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347

W + h (second dim):

[0.34,-1.11 + 0.0001,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25353

(1.25353 - 1.25347)/0.0001= 0.6


gradient dW:

[-2.5,0.6,?,?,?,?,?,?,?,…]

current W:

[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347

W + h (third dim):

[0.34,-1.11,0.78 + 0.0001,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347


gradient dW:

[-2.5,0.6,0,?,?,?,?,?,?,…]

current W:

[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347

W + h (third dim):

[0.34,-1.11,0.78 + 0.0001,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347

(1.25347 - 1.25347)/0.0001= 0


Numerical gradient: slow :(, approximate :(, easy to write :)Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your implementation with numerical gradient.This is called a gradient check.


Numerical vs Analytic Gradients

Perceptron

Output Loss Update• Simple linear hyperplanes

– Regression perceptron(activation function=identity)

Update Rule for Sigmoid Activation Function

Logistic Regression as a Cascade


Given a library of simple functions

Compose into a

complicate function

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

x

W

hinge loss

R

+ Ls (scores)

*


Computational Graph

input image

loss

weights

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Convolutional network (AlexNet)


Figure reproduced with permission from a Twitter post by Andrej Karpathy.

input image

loss

Neural Turing Machine


Computational Graphs

• Notation


Example


+

sin( )

x1 x2

*

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato(C) Dhruv Batra and Zsolt Kira 50

Computational Graph

Key Computation: Forward-Prop

(C) Dhruv Batra and Zsolt Kira 51Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Key Computation: Back-Prop


Neural Network Training

• Step 1: Compute Loss on mini-batch [F-Pass]




• Step 2: Compute gradients wrt parameters [B-Pass]





• Step 3: Use gradient to update parameters


62


Backpropagation: a simple example

63



64

e.g. x = -2, y = 5, z = -4



65

e.g. x = -2, y = 5, z = -4

Want:



66

e.g. x = -2, y = 5, z = -4

Want:



67

e.g. x = -2, y = 5, z = -4

Want:



68

e.g. x = -2, y = 5, z = -4

Want:



69

e.g. x = -2, y = 5, z = -4

Want:



70

e.g. x = -2, y = 5, z = -4

Want:



71

e.g. x = -2, y = 5, z = -4

Want:



72

e.g. x = -2, y = 5, z = -4

Want:



73

e.g. x = -2, y = 5, z = -4

Want:



74

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Upstream gradient

Localgradient



75

Chain rule:

e.g. x = -2, y = 5, z = -4

Want: Upstream gradient

Localgradient



76

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Upstream gradient

Localgradient



77

Chain rule:

e.g. x = -2, y = 5, z = -4

Want: Upstream gradient

Localgradient



Patterns in backward flow


add gate: gradient distributor




Q: What is a max gate?




max gate: gradient router





Q: What is a mul gate?





mul gate: gradient switcher



85

Another example:


86

Another example:


87

Another example:


88

Another example:


89

Another example:

Upstream gradient

Localgradient


90

Another example:


91

Another example:

Upstream gradient

Localgradient


92

Another example:


93

Another example:

Upstream gradient

Localgradient


94

Another example:


95

Another example:

Upstream gradient

Localgradient


96

Another example:


97

Another example:

[upstream gradient] x [local gradient][0.2] x [1] = 0.2[0.2] x [1] = 0.2 (both inputs!)


98

Another example:


99

Another example:

[upstream gradient] x [local gradient]x0: [0.2] x [2] = 0.4w0: [0.2] x [-1] = -0.2


100

sigmoid function

sigmoid gate

Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!


sigmoid gate

101

[upstream gradient] x [local gradient][1.00] x [(1 - 0.73) (0.73)]= 0.2

sigmoid function

Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!


l5 optimization gradients - college of computing · 2019. 1. 23. · 5hjxodul]dwlrq 'dwd orvv 0rgho...

Documents