gradient descent - xiaohongliu.ca · review: gradient descent •in step 3, we have to solve the...
TRANSCRIPT
![Page 1: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/1.jpg)
Gradient Descent
![Page 2: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/2.jpg)
Review: Gradient Descent
• In step 3, we have to solve the following optimization problem:
𝜃∗ = argmin𝜃
𝐿 𝜃 L: loss function 𝜃: parameters
Suppose that θ has two variables {θ1, θ2}
Randomly start at 𝜃0 =𝜃10
𝜃20 𝛻𝐿 𝜃 =
Τ𝜕𝐿 𝜃1 𝜕𝜃1Τ𝜕𝐿 𝜃2 𝜕𝜃2
𝜃11
𝜃21 =
𝜃10
𝜃20 − 𝜂
Τ𝜕𝐿 𝜃10 𝜕𝜃1Τ𝜕𝐿 𝜃2
0 𝜕𝜃2𝜃1 = 𝜃0 − 𝜂𝛻𝐿 𝜃0
𝜃12
𝜃22 =
𝜃11
𝜃21 − 𝜂
Τ𝜕𝐿 𝜃11 𝜕𝜃1Τ𝜕𝐿 𝜃2
1 𝜕𝜃2𝜃2 = 𝜃1 − 𝜂𝛻𝐿 𝜃1
![Page 3: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/3.jpg)
Review: Gradient Descent
Start at position 𝜃0
Compute gradient at 𝜃0
Move to 𝜃1 = 𝜃0 - η𝛻𝐿 𝜃0
Compute gradient at 𝜃1
Move to 𝜃2 = 𝜃1 – η𝛻𝐿 𝜃1Movement
Gradient
……
𝜃0
𝜃1
𝜃2
𝜃3
𝛻𝐿 𝜃0
𝛻𝐿 𝜃1
𝛻𝐿 𝜃2
𝛻𝐿 𝜃3
𝜃1
𝜃2 Gradient: Loss 的等高線的法線方向
![Page 4: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/4.jpg)
Gradient DescentTip 1: Tuning your
learning rates
![Page 5: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/5.jpg)
Learning Rate
No. of parameters updates
Loss
Loss
Very Large
Large
small
Just make
11 iii L
Set the learning rate η carefully
If there are more than three parameters, you cannot visualize this.
But you can always visualize this.
![Page 6: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/6.jpg)
Adaptive Learning Rates
• Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.• At the beginning, we are far from the destination, so we
use larger learning rate
• After several epochs, we are close to the destination, so we reduce the learning rate
• E.g. 1/t decay: 𝜂𝑡 = Τ𝜂 𝑡 + 1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning rates
![Page 7: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/7.jpg)
Adagrad
• Divide the learning rate of each parameter by the root mean square of its previous derivatives
𝜎𝑡: root mean square of the previous derivatives of parameter w
w is one parameters
𝑔𝑡 =𝜕𝐿 𝜃𝑡
𝜕𝑤
Vanilla Gradient descent
Adagrad
𝑤𝑡+1 ← 𝑤𝑡 − 𝜂𝑡𝑔𝑡
𝜂𝑡 =𝜂
𝑡 + 1
𝑤𝑡+1 ← 𝑤𝑡 −𝜂𝑡
𝜎𝑡𝑔𝑡
Parameter dependent
![Page 8: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/8.jpg)
Adagrad
𝑤1 ← 𝑤0 −𝜂0
𝜎0𝑔0
……
𝑤2 ← 𝑤1 −𝜂1
𝜎1𝑔1
𝑤𝑡+1 ← 𝑤𝑡 −𝜂𝑡
𝜎𝑡𝑔𝑡
𝜎0 = 𝑔0 2
𝜎1 =1
2𝑔0 2 + 𝑔1 2
𝜎𝑡 =1
𝑡 + 1
𝑖=0
𝑡
𝑔𝑖 2
𝑤3 ← 𝑤2 −𝜂2
𝜎2𝑔2 𝜎2 =
1
3𝑔0 2 + 𝑔1 2 + 𝑔2 2
𝜎𝑡: root mean square of the previous derivatives of parameter w
![Page 9: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/9.jpg)
Adagrad
• Divide the learning rate of each parameter by the root mean square of its previous derivatives
𝜂𝑡 =𝜂
𝑡 + 1
𝑤𝑡+1 ← 𝑤𝑡 −𝜂
σ𝑖=0𝑡 𝑔𝑖 2
𝑔𝑡
1/t decay
𝑤𝑡+1 ← 𝑤𝑡 −𝜂
𝜎𝑡𝑔𝑡
𝜎𝑡 =1
𝑡 + 1
𝑖=0
𝑡
𝑔𝑖 2
𝜂𝑡
𝜎𝑡
![Page 10: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/10.jpg)
Contradiction?
𝑤𝑡+1 ← 𝑤𝑡 −𝜂
σ𝑖=0𝑡 𝑔𝑖 2
𝑔𝑡
Vanilla Gradient descent
Adagrad
Larger gradient, larger step
Larger gradient, smaller step
Larger gradient, larger step
𝑤𝑡+1 ← 𝑤𝑡 − 𝜂𝑡𝑔𝑡
𝑔𝑡 =𝜕𝐿 𝜃𝑡
𝜕𝑤𝜂𝑡 =
𝜂
𝑡 + 1
![Page 11: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/11.jpg)
Intuitive Reason
• How surprise it is
𝑤𝑡+1 ← 𝑤𝑡 −𝜂
σ𝑖=0𝑡 𝑔𝑖 2
𝑔𝑡
造成反差的效果
g0 g1 g2 g3 g4 ……
0.001 0.001 0.003 0.002 0.1 ……
g0 g1 g2 g3 g4 ……
10.8 20.9 31.7 12.1 0.1 ……
反差
𝑔𝑡 =𝜕𝐶 𝜃𝑡
𝜕𝑤𝜂𝑡 =
𝜂
𝑡 + 1
特別大
特別小
![Page 12: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/12.jpg)
Larger gradient, larger steps?
𝑦 = 𝑎𝑥2 + 𝑏𝑥 + 𝑐
𝜕𝑦
𝜕𝑥= |2𝑎𝑥 + 𝑏|
𝑥0
|𝑥0 +𝑏
2𝑎|
𝑥0
|2𝑎𝑥0 + 𝑏|
Best step:
−𝑏
2𝑎
|2𝑎𝑥0 + 𝑏|
2𝑎Larger 1st order derivative means far from the minima
![Page 13: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/13.jpg)
Comparison between different parameters
𝑤1
𝑤2
𝑤1
𝑤2
a
b
c
d
c > d
a > b
Larger 1st order derivative means far from the minima
Do not cross parameters
![Page 14: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/14.jpg)
Second Derivative
𝑦 = 𝑎𝑥2 + 𝑏𝑥 + 𝑐
𝜕𝑦
𝜕𝑥= |2𝑎𝑥 + 𝑏|
−𝑏
2𝑎
𝑥0
|𝑥0 +𝑏
2𝑎|
𝑥0
|2𝑎𝑥0 + 𝑏|
Best step:
𝜕2𝑦
𝜕𝑥2= 2𝑎 The best step is
|First derivative|
Second derivative
|2𝑎𝑥0 + 𝑏|
2𝑎
![Page 15: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/15.jpg)
Comparison between different parameters
𝑤1
𝑤2
𝑤1
𝑤2
a
b
c
d
c > d
a > b
Larger 1st order derivative means far from the minima
Do not cross parameters|First derivative|
Second derivativeThe best step is
Smaller Second
LargerSecond
Larger second derivative
smaller second derivative
![Page 16: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/16.jpg)
|First derivative|
Second derivative
The best step is
Use first derivative to estimate second derivative
first derivative 2
𝑤1 𝑤2
larger second derivativesmaller second
derivative
𝑤𝑡+1 ← 𝑤𝑡 −𝜂
σ𝑖=0𝑡 𝑔𝑖 2
𝑔𝑡
?
![Page 17: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/17.jpg)
Gradient DescentTip 2: Stochastic
Gradient Descent
Make the training faster
![Page 18: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/18.jpg)
Stochastic Gradient Descent
Gradient Descent
Stochastic Gradient Descent
11 iii L
11 inii L
Pick an example xn
Faster!
𝐿 =
𝑛
ො𝑦𝑛 − 𝑏 + 𝑤𝑖𝑥𝑖𝑛
2
Loss is the summation over all training examples
𝐿𝑛 = ො𝑦𝑛 − 𝑏 + 𝑤𝑖𝑥𝑖𝑛
2
Loss for only one example
![Page 19: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/19.jpg)
• Demo
![Page 20: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/20.jpg)
Stochastic Gradient Descent
Gradient Descent
Stochastic Gradient Descent
See all examples
See all examples
See only one example
Update after seeing all examples
If there are 20 examples, 20 times faster.
Update for each example
![Page 21: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/21.jpg)
Gradient DescentTip 3: Feature Scaling
![Page 22: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/22.jpg)
Feature Scaling
Make different features have the same scaling
Source of figure: http://cs231n.github.io/neural-networks-2/
𝑦 = 𝑏 + 𝑤1𝑥1 +𝑤2𝑥2
𝑥1 𝑥1
𝑥2 𝑥2
![Page 23: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/23.jpg)
Feature Scaling
y1w
2w1x
2x
b
1, 2 ……
100, 200 ……
1w
2w Loss L
y1w
2w1x
2x
b
1, 2 ……
1w
2w Loss L
1, 2 ……
𝑦 = 𝑏 + 𝑤1𝑥1 +𝑤2𝑥2
![Page 24: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/24.jpg)
Feature Scaling
……
……
……
……
……
…… ……
𝑥1 𝑥2 𝑥3 𝑥𝑟 𝑥𝑅
mean: 𝑚𝑖
standard deviation: 𝜎𝑖
𝑥𝑖𝑟 ←
𝑥𝑖𝑟 −𝑚𝑖
𝜎𝑖
The means of all dimensions are 0, and the variances are all 1
For each dimension i:
𝑥11
𝑥21
𝑥12
𝑥22
![Page 25: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/25.jpg)
Gradient DescentTheory
![Page 26: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/26.jpg)
Question
• When solving:
• Each time we update the parameters, we obtain 𝜃that makes 𝐿 𝜃 smaller.
𝜃∗ = argmin𝜃
𝐿 𝜃 by gradient descent
𝐿 𝜃0 > 𝐿 𝜃1 > 𝐿 𝜃2 > ⋯
Is this statement correct?
![Page 27: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/27.jpg)
Warning of Math
![Page 28: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/28.jpg)
1
2
Formal Derivation
• Suppose that θ has two variables {θ1, θ2}
0
1
2
How?
L(θ)
Given a point, we can easily find the point with the smallest value nearby.
![Page 29: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/29.jpg)
Taylor Series
• Taylor series: Let h(x) be any function infinitely differentiable around x = x0.
kk
k
xxk
x0
0
0
!
hxh
2
00
000!2
xxxh
xxxhxh
When x is close to x0 000 xxxhxhxh
![Page 30: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/30.jpg)
sin(x)=
……
E.g. Taylor series for h(x)=sin(x) around x0=π/4
The approximation is good around π/4.
![Page 31: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/31.jpg)
Multivariable Taylor Series
000
000
00
,,,, yy
y
yxhxx
x
yxhyxhyxh
When x and y is close to x0 and y0
000
000
00
,,,, yy
y
yxhxx
x
yxhyxhyxh
+ something related to (x-x0)2 and (y-y0)2 + ……
![Page 32: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/32.jpg)
Back to Formal Derivation
1
2 ba,
bba
aba
ba
2
2
1
1
,L,L,LL
Based on Taylor Series:If the red circle is small enough, in the red circle
21
,L,
,L
bav
bau
bvaus 21
L
bas ,L
L(θ)
![Page 33: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/33.jpg)
1
2
Back to Formal DerivationBased on Taylor Series:If the red circle is small enough, in the red circle
bvaus 21L
bas ,L
ba,
L(θ)
Find θ1 and θ2 in the red circle minimizing L(θ)
22
2
2
1 dba
21
,L,
,L
bav
bau
constant
dSimple, right?
![Page 34: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/34.jpg)
Gradient descent – two variablesRed Circle: (If the radius is small)
bvaus 21L
1 2
1 2
21,
vu,
v
u
2
1
To minimize L(θ)
v
u
b
a
2
1
Find θ1 and θ2 in the red circle minimizing L(θ)
22
2
2
1 dba
21,
![Page 35: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/35.jpg)
Back to Formal Derivation
Find 𝜃1 and 𝜃2 yielding the smallest value of 𝐿 𝜃 in the circle
v
u
b
a
2
1
2
1
,L
,L
ba
ba
b
a This is gradient descent.
Based on Taylor Series:If the red circle is small enough, in the red circle
bvaus 21L
bas ,L
21
,L,
,L
bav
bau
constant
Not satisfied if the red circle (learning rate) is not small enough
You can consider the second order term, e.g. Newton’s method.
![Page 36: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/36.jpg)
End of Warning
![Page 37: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/37.jpg)
More Limitation of Gradient Descent
𝐿
𝑤1 𝑤2
Loss
The value of the parameter w
Very slow at the plateau
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤= 0
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤= 0
𝜕𝐿 ∕ 𝜕𝑤≈ 0
![Page 38: Gradient Descent - xiaohongliu.ca · Review: Gradient Descent •In step 3, we have to solve the following optimization problem: ∗=argmin 𝜃 𝐿 L: loss function : parameters](https://reader033.vdocument.in/reader033/viewer/2022050610/5fb1bef8956bb22f8a16d22c/html5/thumbnails/38.jpg)
Acknowledgement
•感謝 Victor Chen發現投影片上的打字錯誤