![Page 1: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/1.jpg)
Deep Learning Basics Lecture 2: Backpropagation
Princeton University COS 495
Instructor: Yingyu Liang
![Page 2: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/2.jpg)
How to train the dragon?
โฆ โฆ
โฆโฆ โฆ
โฆ
๐ฅ
โ1 โ2 โ๐ฟ
๐ฆ
![Page 3: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/3.jpg)
How to get the expected output
Loss of the system๐(๐ฅ; ๐) = ๐(๐๐ , ๐ฅ, ๐ฆ)
๐ฅ ๐ ๐ฅ; ๐ โ 0๐๐(๐ฅ)
๐ฆ
![Page 4: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/4.jpg)
How to get the expected output
๐ฅ ๐ ๐ฅ; ๐ + ๐ โ 0
Find direction ๐ so that:
Loss ๐(๐ฅ; ๐ + ๐)
![Page 5: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/5.jpg)
How to get the expected output
๐ฅ
How to find ๐: ๐ ๐ฅ; ๐ + ๐๐ฃ โ ๐ ๐ฅ; ๐ + ๐ป๐ ๐ฅ; ๐ โ ๐๐ฃ for small scalar ๐
๐ ๐ฅ; ๐ + ๐ โ 0
Loss ๐(๐ฅ; ๐ + ๐)
![Page 6: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/6.jpg)
How to get the expected output
๐ฅ
Conclusion: Move ๐ along โ๐ป๐ ๐ฅ; ๐ for a small amount
๐ ๐ฅ; ๐ + ๐
Loss ๐(๐ฅ; ๐ + ๐)
![Page 7: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/7.jpg)
Neural Networks as real circuits Pictorial illustration of gradient descent
![Page 8: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/8.jpg)
Gradient
โข Gradient of the loss is simpleโข E.g., ๐ ๐๐ , ๐ฅ, ๐ฆ = ๐๐ ๐ฅ โ ๐ฆ 2/2
โข๐๐
๐๐= (๐๐ ๐ฅ โ ๐ฆ)
๐๐
๐๐
โข Key part: gradient of the hypothesis
![Page 9: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/9.jpg)
Open the box: real circuit
![Page 10: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/10.jpg)
Single neuron
๐ฅ2
๐ฅ1
โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2
![Page 11: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/11.jpg)
Single neuron
๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2
Gradient: ๐๐
๐๐ฅ1= 1,
๐๐
๐๐ฅ2= โ1
![Page 12: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/12.jpg)
Two neurons
+ ๐ฅ2
๐ฅ1
โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ฅ3 + ๐ฅ4
๐ฅ3
๐ฅ4
![Page 13: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/13.jpg)
Two neurons
+ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ฅ3 + ๐ฅ4
Gradient: ๐๐ฅ2
๐๐ฅ3= 1,
๐๐ฅ2
๐๐ฅ4= 1. What about
๐๐
๐๐ฅ3?
๐ฅ3
๐ฅ4
๐๐ฅ2
๐๐ฅ3= 1
๐๐ฅ2
๐๐ฅ4= 1
![Page 14: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/14.jpg)
Two neurons
+ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ฅ3 + ๐ฅ4
Gradient:๐๐
๐๐ฅ3=
๐๐
๐๐ฅ2
๐๐ฅ2
๐๐ฅ3= โ1
๐ฅ3
๐ฅ4
โ1
โ1
![Page 15: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/15.jpg)
Multiple input
+ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ฅ3 + ๐ฅ5 + ๐ฅ4
Gradient:๐๐ฅ2
๐๐ฅ5= 1
๐ฅ3
๐ฅ4
โ1
โ1
๐ฅ5
๐๐ฅ2
๐๐ฅ5= 1
![Page 16: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/16.jpg)
Multiple input
+ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ฅ3 + ๐ฅ5 + ๐ฅ4
Gradient:๐๐
๐๐ฅ5=
๐๐
๐๐ฅ5
๐๐ฅ5
๐๐ฅ3= โ1
๐ฅ3
๐ฅ4
โ1
โ1
๐ฅ5
โ1
![Page 17: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/17.jpg)
Weights on the edges
+ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
๐ฅ3
๐ฅ4
๐ค3
๐ค4
![Page 18: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/18.jpg)
Weights on the edges
+ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
๐ฅ3
๐ค3
๐ฅ4
๐ค4
![Page 19: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/19.jpg)
Weights on the edges
+ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
Gradient:๐๐
๐๐ค3=
๐๐
๐๐ฅ2
๐๐ฅ2
๐๐ค3= โ1 ร ๐ฅ3 = โ๐ฅ3
๐ฅ3
๐ฅ4
๐ค3
๐ค4
โ๐ฅ3
โ๐ฅ4
![Page 20: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/20.jpg)
Activation
๐ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
๐ฅ3
๐ฅ4
๐ค3
๐ค4
![Page 21: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/21.jpg)
Activation
๐ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
Let ๐๐๐ก2 = ๐ค3๐ฅ3 + ๐ค4๐ฅ4
๐ฅ3
๐ฅ4
๐ค3
๐ค4
๐๐๐ก2
![Page 22: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/22.jpg)
Activation
๐ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
Gradient:๐๐
๐๐ค3=
๐๐
๐๐ฅ2
๐๐ฅ2
๐๐๐๐ก2
๐๐๐๐ก2
๐๐ค3= โ1 ร ๐โฒ ร ๐ฅ3 = โ๐โฒ๐ฅ3
๐ฅ3
๐ฅ4
๐ค3
๐ค4
๐๐๐ก2
๐๐ฅ2
๐๐๐๐ก2= ๐โฒ
๐๐๐๐ก2
๐๐ค3= ๐ฅ3
![Page 23: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/23.jpg)
Activation
๐ ๐ฅ2
โ1
๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = ๐ฅ1 โ ๐ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
Gradient:๐๐
๐๐ค3=
๐๐
๐๐ฅ2
๐๐ฅ2
๐๐๐๐ก2
๐๐๐๐ก2
๐๐ค3= โ1 ร ๐โฒ ร ๐ฅ3 = โ๐โฒ๐ฅ3
๐ฅ3
๐ฅ4
๐ค3
๐ค4
๐๐๐ก2โ๐โฒ
โ๐โฒ๐ฅ3
![Page 24: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/24.jpg)
Multiple paths
๐ ๐ฅ2
โ1
+ ๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = (๐ฅ1+๐ฅ5) โ ๐ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
๐ฅ3
๐ฅ4
๐ค3
๐ค4
๐ฅ5
๐๐๐ก2
![Page 25: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/25.jpg)
Multiple paths
๐ ๐ฅ2
โ1
+ ๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = (๐ฅ1+๐ฅ5) โ ๐ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
๐ฅ3 ๐ค3
๐ค4
๐๐๐ก2
![Page 26: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/26.jpg)
Multiple paths
๐ ๐ฅ2
โ1
+ ๐ฅ1
1โ ๐
Function: ๐ = ๐ฅ1 โ ๐ฅ2 = (๐ฅ3+๐ฅ5) โ ๐ ๐ค3๐ฅ3 + ๐ค4๐ฅ4
Gradient:๐๐
๐๐ฅ3=
๐๐
๐๐ฅ2
๐๐ฅ2
๐๐๐๐ก2
๐๐๐๐ก2
๐๐ฅ3+
๐๐
๐๐ฅ1
๐๐ฅ1
๐๐ฅ3= โ1 ร ๐โฒ ร ๐ค3 + 1 ร 1 = โ๐โฒ๐ค3 + 1
๐ฅ3 ๐ค3
๐ค4
๐๐๐ก2
![Page 27: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/27.jpg)
Summary
โข Forward to compute ๐
โข Backward to compute the gradients
๐ โ21
๐ โ11
+ ๐
๐ฅ1
๐ฅ2
๐๐๐ก11
๐๐๐ก21
![Page 28: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/28.jpg)
Math form
![Page 29: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/29.jpg)
Gradient descent
โข Minimize loss ๐ฟ ๐ , where the hypothesis is parametrized by ๐
โข Gradient descentโข Initialize ๐0
โข ๐๐ก+1 = ๐๐ก โ ๐๐ก๐ป ๐ฟ ๐๐ก
![Page 30: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/30.jpg)
Stochastic gradient descent (SGD)
โข Suppose data points arrive one by one
โข ๐ฟ ๐ =1
๐ฯ๐ก=1
๐ ๐(๐, ๐ฅ๐ก , ๐ฆ๐ก), but we only know ๐(๐, ๐ฅ๐ก , ๐ฆ๐ก) at time ๐ก
โข Idea: simply do what you can based on local informationโข Initialize ๐0
โข ๐๐ก+1 = ๐๐ก โ ๐๐ก๐ป๐(๐๐ก, ๐ฅ๐ก , ๐ฆ๐ก)
![Page 31: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โขSuppose data points arrive one by one โข๐ฟ =1 ๐ ฯ๐ก=1 ๐๐( , ๐ก, ๐ก), but we only](https://reader033.vdocument.in/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/31.jpg)
Mini-batch
โข Instead of one data point, work with a small batch of ๐ points
(๐ฅ๐ก๐+1,๐ฆ๐ก๐+1),โฆ, (๐ฅ๐ก๐+๐,๐ฆ๐ก๐+๐)
โข Update rule
๐๐ก+1 = ๐๐ก โ ๐๐ก๐ป1
๐
1โค๐โค๐
๐ ๐๐ก , ๐ฅ๐ก๐+๐ , ๐ฆ๐ก๐+๐
โข Typical batch size: ๐ = 128