backpropagationandthechainrule - github pages · 2020. 12. 30. · backpropagationandthechainrule...

24
Backpropagation and the Chain Rule David S. Rosenberg New York University April 17, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 1 / 24

Upload: others

Post on 19-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Backpropagation and the Chain Rule

David S. Rosenberg

New York University

April 17, 2018

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 1 / 24

Page 2: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Contents

1 Introduction

2 Partial Derivatives and the Chain Rule

3 Example: Least Squares Regression

4 Example: Ridge Regression

5 General Backpropagation

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 2 / 24

Page 3: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Introduction

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 3 / 24

Page 4: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Learning with Back-Propagation

Back-propagation is an algorithm for computing the gradientWith lots of chain rule, you could also work out the gradient by hand.Back-propagation is

a clean way to organize the computation of the gradientan efficient way to compute the gradient

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 4 / 24

Page 5: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Partial Derivatives and the Chain Rule

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 5 / 24

Page 6: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Partial Derivatives

Consider a function g : Rp→ Rn.

Typical computation graph: Broken out into components:

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 6 / 24

Page 7: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Partial Derivatives

Consider a function g : Rp→ Rn.

Partial derivative ∂bi∂aj

is the instantaneousrate of change of bi as we change aj .If we change aj slightly to aj +δ,Then (for small δ), bi changes toapproximately bi +

∂bi∂ajδ.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 7 / 24

Page 8: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Partial Derivatives of an Affine Function

Define the affine function g(x) =Mx + c , for M ∈ Rn×p and c ∈ R.

If we let b = g(a), then what is bi?bi depends on the ith row of M:

bi =

p∑k=1

Mikak + ci

and∂bi∂aj

=Mij .

So for an an affine mapping, entries ofmatrix M directly tell us the rates ofchange.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 8 / 24

Page 9: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Chain Rule (in terms of partial derivatives)

g : Rp→ Rn and f : Rn→ Rm. Let b = g(a). Let c = f (b).

Chain rule says that

∂ci∂aj

=

n∑k=1

∂ci∂bk

∂bk∂aj

.

Change in aj may change each of b1, . . . ,bn.Changes in b1, . . . ,bn may each effect ci .Chain rule tells us that, to first order, the net change in ci is

the sum of the changes induced along each path from aj to ci .

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 9 / 24

Page 10: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Example: Least Squares Regression

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 10 / 24

Page 11: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Review: Linear least squares

Hypothesis space{f (x) = wT x +b | w ∈ Rd ,b ∈ R

}.

Data set (x1,y1) , . . . ,(xn,yn) ∈ Rd ×R.Define

`i (w ,b) =[(wT xi +b

)− yi

]2.

In SGD, in each round we’d choose a random index i ∈ 1, . . . ,n and take a gradient step

wj ← wj −η∂`i (w ,b)

∂wj, for j = 1, . . . ,d

b ← b−η∂`i (w ,b)

∂b,

for some step size η > 0.Let’s revisit how to calculate these partial derivatives...

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 11 / 24

Page 12: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Computation Graph and Intermediate Variables

For a generic training point (x ,y), denote the loss by

`(w ,b) =[(wT x +b

)− y

]2.

Let’s break this down into some intermediate computations:

(prediction) y =

d∑j=1

wjxj +b

(residual) r = y − y

(loss) ` = r2

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 12 / 24

Page 13: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Partial Derivatives on Computation Graph

We’ll work our way from graph output ` back to the parameters w and b:

∂`

∂r= 2r

∂`

∂y=

∂`

∂r

∂r

∂y= (2r)(−1) = −2r

∂`

∂b=

∂`

∂y

∂y

∂b= (−2r)(1) = −2r

∂`

∂wj=

∂`

∂y

∂y

∂wj= (−2r)xj =−2rxj

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 13 / 24

Page 14: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Example: Ridge Regression

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 14 / 24

Page 15: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Ridge Regression: Computation Graph and Intermediate Variables

For training point (x ,y), the `2-regularized objective function is

J(w ,b) =[(wT x +b

)− y

]2+λwTw .

Let’s break this down into some intermediate computations:

(prediction) y =

d∑j=1

wjxj +b

(residual) r = y − y

(loss) ` = r2

(regularization) R = λwTw

(objective) J = `+R

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 15 / 24

Page 16: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Partial Derivatives on Computation Graph

We’ll work our way from graph output ` back to the parameters w and b:

∂J

∂`=

∂J

∂R= 1

∂J

∂y=

∂J

∂`

∂`

∂r

∂r

∂y= (1)(2r)(−1) = −2r

∂J

∂b=

∂J

∂y

∂y

∂b= (−2r)(1) = −2r

∂J

∂wj= ?

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 16 / 24

Page 17: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Handling Nodes with Multiple Children

Consider a 7→ J = h(f (a),g(a)).

It’s helpful to think about having two independent copies of a, call them a(1) and a(2)...

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 17 / 24

Page 18: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Handling Nodes with Multiple Children

∂J

∂a=

∂J

∂a(1)∂a(1)

∂a+

∂J

∂a(2)∂a(2)

∂a

=∂J

∂a(1)+

∂J

∂a(2)

Derivative w.r.t. a is the sum of derivatives w.r.t. each copy of a.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 18 / 24

Page 19: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Partial Derivatives on Computation Graph

We’ll work our way from graph output ` back to the parameters w and b:

∂J

∂y=

∂J

∂`

∂`

∂r

∂r

∂y= (1)(2r)(−1) = −2r

∂J

∂w(2)j

=∂J

∂y

∂y

∂w(2)j

=∂J

∂yxj

∂J

∂w(1)j

=∂J

∂R

∂R

∂w(1)j

= (1)(2λw (1)

j

)∂J

∂wj=

∂J

∂w(1)j

+∂J

∂w(2)j

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 19 / 24

Page 20: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

General Backpropagation

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 20 / 24

Page 21: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Backpropagation: Overview

Backpropagation is a specific way to evaluate the partial derivatives of a computationgraph output J w.r.t. the inputs and outputs of all nodes.Backpropagation works node-by-node.To run a “backward” step at a node f , we assume

we’ve already run “backward” for all of f ’s children.Backward at node f : a 7→ b returns

Partial of objective value J w.r.t. f ’s output: ∂J∂b

Partial of objective value J w.r.t f ’s input: ∂J∂a

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 21 / 24

Page 22: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Backpropagation: Simple Case

Simple case: all nodes take a single scalar as input and have a single scalar output.

Backprop for node f :

Input: ∂J∂b(1)

, . . . , ∂J∂b(N)

(Partials w.r.t. inputs to all children)

Output:

∂J

∂b=

N∑k=1

∂J

∂b(k)

∂J

∂a=

∂J

∂b

∂b

∂a

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 22 / 24

Page 23: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Backpropagation (General case)

More generally, consider f : Rd → Rn.

Input: ∂J

∂b(i)j

, i = 1, . . . ,N, j = 1, . . . ,n

Output:

∂J

∂bj=

N∑k=1

∂J

∂b(k)j

∂J

∂ai=

n∑j=1

∂J

∂bj

∂bj∂ai

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 23 / 24

Page 24: BackpropagationandtheChainRule - GitHub Pages · 2020. 12. 30. · BackpropagationandtheChainRule DavidS.Rosenberg New York University April17,2018 David S. Rosenberg (New York University)

Running Backpropagation

If we run “backward” on every node in our graph,we’ll have the gradients of J w.r.t. all our parameters.

To run backward on a particular node,we assumed we already ran it on all children.

A topological sort of the nodes in a directed [acyclic] graphis an ordering which every node appears before its children.

So we’ll evaluate backward on nodes in a reverse topological ordering.

David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 17, 2018 24 / 24