cs231n discussion section backpropagation april 17,...

39
Backpropagation Slides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA: Yi Wen April 17, 2020 CS231n Discussion Section

Upload: others

Post on 10-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Backpropagation

Slides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen

TA: Yi Wen

April 17, 2020CS231n Discussion Section

Page 2: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Agenda

● Motivation● Backprop Tips & Tricks● Matrix calculus primer

Page 3: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Agenda

● Motivation● Backprop Tips & Tricks● Matrix calculus primer

Page 4: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

MotivationRecall: Optimization objective is minimize loss

Page 5: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

MotivationRecall: Optimization objective is minimize loss

Goal: how should we tweak the parameters to decrease the loss?

Page 6: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Agenda

● Motivation● Backprop Tips & Tricks● Matrix calculus primer

Page 7: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

A Simple ExampleLoss

Goal: Tweak the parameters to minimize loss

=> minimize a multivariable function in parameter space

Page 8: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

A Simple Example

=> minimize a multivariable function

Plotted on WolframAlpha

Page 9: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Approach #1: Random Search Intuition: the step we take in the domain of function

Page 10: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Approach #2: Numerical GradientIntuition: rate of change of a function with respect to a variable surrounding a small region

Page 11: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Approach #2: Numerical GradientIntuition: rate of change of a function with respect to a variable surrounding a small region

Finite Differences:

Page 12: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Approach #3: Analytical GradientRecall: partial derivative by limit definition

Page 13: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Approach #3: Analytical GradientRecall: chain rule

Page 14: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Approach #3: Analytical GradientRecall: chain rule

E.g.

Page 15: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Approach #3: Analytical GradientRecall: chain rule

E.g.

Page 16: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Approach #3: Analytical GradientRecall: chain rule

Intuition: upstream gradient values propagate backwards -- we can reuse them!

Page 17: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Gradient

“direction and rate of fastest increase”

Numerical Gradient vs Analytical Gradient

Page 18: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

What about Autograd?● Deep learning frameworks can automatically perform backprop!

● Problems might surface related to underlying gradients when debugging your models

“Yes You Should Understand Backprop”

https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

Page 19: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Problem Statement: Backpropagation

Given a function f with respect to inputs x, labels y, and parameters 𝜃compute the gradient of Loss with respect to 𝜃

Page 20: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Problem Statement: Backpropagation An algorithm for computing the gradient of a compound function as a series of local, intermediate gradients:

1. Identify intermediate functions (forward prop)2. Compute local gradients (chain rule)3. Combine with upstream error signal to get full gradient

Input x y outputlocal(x,W,b) => y

dx,dW,db <= grad_local(dy,x,W,b)

dx dy

W,b

dW,db

Page 21: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Modularity: Previous Example

Compound function

Intermediate Variables(forward propagation)

Page 22: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Modularity: 2-Layer Neural Network

Compound function

Intermediate Variables(forward propagation)

=> Squared Euclidean Distance between and

Page 23: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

? f(x;W,b) = Wx + b ?Intermediate Variables(forward propagation)

(↑ lecture note) Input one feature vector

(← here) Input a batch of data (matrix)

Page 24: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Intermediate Variables(forward propagation)

Intermediate Gradients(backward propagation)

1. intermediate functions 2. local gradients3. full gradients

???

???

???

Page 25: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Agenda

● Motivation● Backprop Tips & Tricks● Matrix calculus primer

Page 26: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Derivative w.r.t. Vector

Scalar-by-Vector

Vector-by-Vector

Page 27: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Derivative w.r.t. Vector: Chain Rule1. intermediate functions 2. local gradients3. full gradients

?

Page 28: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Derivative w.r.t. Vector: Takeaway

Page 29: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Derivative w.r.t. Matrix

Vector-by-Matrix ?

Scalar-by-Matrix

Page 30: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Derivative w.r.t. Matrix: Dimension Balancing

When you take scalar-by-matrix gradients

The gradient has shape of denominator

● Dimension balancing is the “cheap” but efficient approach to gradient calculations in most practical settings

Page 31: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Derivative w.r.t. Matrix: Takeaway

Page 32: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Intermediate Variables(forward propagation)

Intermediate Gradients(backward propagation)

1. intermediate functions 2. local gradients3. full gradients

Page 33: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Backprop Menu for Success

1. Write down variable graph

2. Keep track of error signals

3. Compute derivative of loss function

4. Enforce shape rule on error signals, especially when deriving

over a linear transformation

Page 34: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Vector-by-vector

?

Page 35: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Vector-by-vector

?

Page 36: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Vector-by-vector

?

Page 37: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Vector-by-vector

?

Page 38: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Matrix multiplication [Backprop]

? ?

Page 39: CS231n Discussion Section Backpropagation April 17, 2020cs231n.stanford.edu/slides/2020/section_2_backprop.pdfSlides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen TA:

Elementwise function [Backprop]

?