lecture 5 recap · •approximate our function by a second-order taylor series expansion ......
TRANSCRIPT
![Page 1: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/1.jpg)
Lecture 5 recap
1Prof. Leal-Taixé and Prof. Niessner
![Page 2: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/2.jpg)
Neural Network
Depth
Wid
th
2Prof. Leal-Taixé and Prof. Niessner
![Page 3: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/3.jpg)
Gradient Descent for Neural Networks
Prof. Leal-Taixé and Prof. Niessner 3
𝑥0
𝑥1
𝑥2
ℎ0
ℎ1
ℎ2
ℎ3
𝑦0
𝑦1
𝑡0
𝑡1
𝑦𝑖 = 𝐴(𝑏1,𝑖 +
𝑗
ℎ𝑗𝑤1,𝑖,𝑗)ℎ𝑗 = 𝐴(𝑏0,𝑗 +
𝑘
𝑥𝑘𝑤0,𝑗,𝑘)
𝐿𝑖 = 𝑦𝑖 − 𝑡𝑖2
𝛻𝑤,𝑏𝑓𝑥,𝑡 (𝑤) =
𝜕𝑓
𝜕𝑤0,0,0……𝜕𝑓
𝜕𝑤𝑙,𝑚,𝑛……𝜕𝑓
𝜕𝑏𝑙,𝑚
Just simple: 𝐴 𝑥 = max(0, 𝑥)
![Page 4: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/4.jpg)
Stochastic Gradient Descent (SGD)𝜃𝑘+1 = 𝜃𝑘 − 𝛼𝛻𝜃𝐿(𝜃
𝑘 , 𝑥{1..𝑚}, 𝑦{1..𝑚})
𝛻𝜃𝐿 =1
𝑚σ𝑖=1𝑚 𝛻𝜃𝐿𝑖
Note the terminology: iteration vs epoch
Prof. Leal-Taixé and Prof. Niessner 4
𝑘 now refers to 𝑘-th iteration
𝑚 training samples in the current batch
Gradient for the 𝑘-th batch
![Page 5: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/5.jpg)
Gradient Descent with Momentum𝑣𝑘+1 = 𝛽 ⋅ 𝑣𝑘 + 𝛻𝜃𝐿(𝜃
𝑘)
𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅ 𝑣𝑘+1
Exponentially-weighted average of gradient
Important: velocity 𝑣𝑘 is vector-valued!
Prof. Leal-Taixé and Prof. Niessner 5
Gradient of current minibatchvelocity
accumulation rate(‘friction’, momentum)
learning ratevelocitymodel
![Page 6: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/6.jpg)
Gradient Descent with Momentum
𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅ 𝑣𝑘+1
Prof. Leal-Taixé and Prof. Niessner 6
Step will be largest when a sequence of gradients all point to the same direction
Fig. credit: I. Goodfellow
Hyperparameters are 𝛼, 𝛽𝛽 is often set to 0.9
![Page 7: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/7.jpg)
RMSProp
𝑠𝑘+1 = 𝛽 ⋅ 𝑠𝑘 + (1 − 𝛽)[𝛻𝜃𝐿 ∘ 𝛻𝜃𝐿]
𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅𝛻𝜃𝐿
𝑠𝑘+1 + 𝜖
Hyperparameters: 𝛼, 𝛽, 𝜖
Prof. Leal-Taixé and Prof. Niessner 7
Typically 10−8Often 0.9
Element-wise multiplication
Needs tuning!
![Page 8: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/8.jpg)
RMSProp
Prof. Leal-Taixé and Prof. Niessner 8
X-directionSmall gradients
Y-D
irec
tio
nLa
rge
grad
ien
ts
Fig. credit: A. Ng
𝑠𝑘+1 = 𝛽 ⋅ 𝑠𝑘 + (1 − 𝛽)[𝛻𝜃𝐿 ∘ 𝛻𝜃𝐿]
𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅𝛻𝜃𝐿
𝑠𝑘+1 + 𝜖We’re dividing by square gradients:- Division in Y-Direction will be large- Division in X-Direction will be small
(uncentered) variance of gradients -> second momentum
Can increase learning rate!
![Page 9: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/9.jpg)
Adaptive Moment Estimation (Adam)Combines Momentum and RMSProp
𝑚𝑘+1 = 𝛽1 ⋅ 𝑚𝑘 + 1 − 𝛽1 𝛻𝜃𝐿 𝜃𝑘
𝑣𝑘+1 = 𝛽2 ⋅ 𝑣𝑘 + (1 − 𝛽2)[𝛻𝜃𝐿 𝜃𝑘 ∘ 𝛻𝜃𝐿 𝜃𝑘 ]
𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅𝑚𝑘+1
𝑣𝑘+1+𝜖
Prof. Leal-Taixé and Prof. Niessner 9
First momentum: mean of gradients
Second momentum: variance of gradients
![Page 10: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/10.jpg)
AdamCombines Momentum and RMSProp
𝑚𝑘+1 = 𝛽1 ⋅ 𝑚𝑘 + 1 − 𝛽1 𝛻𝜃𝐿 𝜃𝑘
𝑣𝑘+1 = 𝛽2 ⋅ 𝑣𝑘 + (1 − 𝛽2)[𝛻𝜃𝐿 𝜃𝑘 ∘ 𝛻𝜃𝐿 𝜃𝑘 ]
𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅ෝ𝑚𝑘+1
ො𝑣𝑘+1+𝜖
Prof. Leal-Taixé and Prof. Niessner 10
𝑚𝑘+1 and 𝑣𝑘+1 are initialized with zero-> bias towards zero
Typically, bias-corrected moment updates
ෝ𝑚𝑘+1 =𝑚𝑘
1 − 𝛽1
ො𝑣𝑘+1 =𝑣𝑘
1 − 𝛽2
![Page 11: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/11.jpg)
Convergence
11Prof. Leal-Taixé and Prof. Niessner
![Page 12: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/12.jpg)
Convergence
12Prof. Leal-Taixé and Prof. Niessner
![Page 13: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/13.jpg)
Importance of Learning Rate
13Prof. Leal-Taixé and Prof. Niessner
![Page 14: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/14.jpg)
Jacobian and Hessian
• Derivative
• Gradient
• Jacobian
• HessianSECOND
DERIVATIVE
14Prof. Leal-Taixé and Prof. Niessner
![Page 15: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/15.jpg)
Newton’s method• Approximate our function by a second-order Taylor
series expansion
https://en.wikipedia.org/wiki/Taylor_series
First derivative Second derivative (curvature)
15Prof. Leal-Taixé and Prof. Niessner
![Page 16: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/16.jpg)
Newton’s method• Differentiate and equate to zero
Update step
SGD
We got rid of the learning rate!
16Prof. Leal-Taixé and Prof. Niessner
![Page 17: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/17.jpg)
Newton’s method• Differentiate and equate to zero
Update step
Parameters of a network
(millions)
Number of elements in the Hessian
Computational complexity of
‘inversion’ per iteration
17Prof. Leal-Taixé and Prof. Niessner
![Page 18: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/18.jpg)
Newton’s method• SGD (green)
• Newton’s method exploits the curvature to take a more direct route
Image from Wikipedia18
Prof. Leal-Taixé and Prof. Niessner
![Page 19: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/19.jpg)
Newton’s method
Can you apply Newton’s method for linear
regression? What do you get as a result?
19Prof. Leal-Taixé and Prof. Niessner
![Page 20: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/20.jpg)
BFGS and L-BFGS• Broyden-Fletcher-Goldfarb-Shanno algorithm• Belongs to the family of quasi-Newton methods• Have an approximation of the inverse of the Hessian
• BFGS• Limited memory: L-BFGS
20Prof. Leal-Taixé and Prof. Niessner
![Page 21: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/21.jpg)
Gauss-Newton• 𝑥𝑘+1 = 𝑥𝑘 − 𝐻𝑓 𝑥𝑘
−1𝛻𝑓(𝑥𝑘)
– ’true’ 2nd derivatives are often hard to obtain (e.g., numerics)
– 𝐻𝑓 ≈ 2𝐽𝐹𝑇𝐽𝐹
• Gauss-Newton (GN): 𝑥𝑘+1 = 𝑥𝑘 − [2𝐽𝐹 𝑥𝑘
𝑇𝐽𝐹 𝑥𝑘 ]−1𝛻𝑓(𝑥𝑘)
• Solve linear system (again, inverting a matrix is unstable):2 𝐽𝐹 𝑥𝑘
𝑇𝐽𝐹 𝑥𝑘 𝑥𝑘 − 𝑥𝑘+1 = 𝛻𝑓(𝑥𝑘)
Solve for delta vector
Prof. Leal-Taixé and Prof. Niessner 21
![Page 22: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/22.jpg)
Levenberg• Levenberg
– “damped” version of Gauss-Newton:– 𝐽𝐹 𝑥𝑘
𝑇𝐽𝐹 𝑥𝑘 + 𝜆 ⋅ 𝐼 ⋅ 𝑥𝑘 − 𝑥𝑘+1 = 𝛻𝑓(𝑥𝑘)
– The damping factor 𝜆 is adjusted in each iteration ensuring:– 𝑓 𝑥𝑘 > 𝑓(𝑥𝑘+1)
• if inequation is not fulfilled increase 𝜆• Trust region
• “Interpolation” between Gauss-Newton (small 𝜆) and Gradient Descent (large 𝜆)
Tikhonovregularization
Prof. Leal-Taixé and Prof. Niessner 22
![Page 23: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/23.jpg)
Levenberg-Marquardt
• Levenberg-Marquardt (LM)
𝐽𝐹 𝑥𝑘𝑇𝐽𝐹 𝑥𝑘 + 𝜆 ⋅ 𝑑𝑖𝑎𝑔(𝐽𝐹 𝑥𝑘
𝑇𝐽𝐹 𝑥𝑘 ) ⋅ 𝑥𝑘 − 𝑥𝑘+1= 𝛻𝑓(𝑥𝑘)
– Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient according to the curvature.• Avoids slow convergence in components with a small
gradient
Prof. Leal-Taixé and Prof. Niessner 23
![Page 24: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/24.jpg)
Which, what and when?• Standard: Adam
• Fallback option: SGD with momentum
• Newton, L-BFGS, GN, LM only if you can do full batch updates (doesn’t work well for minibatches!!)
24Prof. Leal-Taixé and Prof. Niessner
This practically never happens for DLTheoretically, it would be nice though due to fast convergence
![Page 25: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/25.jpg)
General Optimization• Linear Systems (Ax = b)
– LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc.
• Non-linear (gradient-based)– Newton, Gauss-Newton, LM, (L)BFGS <- second order– Gradient Descent, SGD <- first order
• Others:– Genetic algorithms, MCMC, Metropolis-Hastings, etc.– Constrained and convex solvers (Langrage, ADMM,
Primal-Dual, etc.)Prof. Leal-Taixé and Prof. Niessner 25
![Page 26: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/26.jpg)
Please Remember!• Think about your problem and optimization at hand
• SGD is specifically designed for minibatch
• When you can, use 2nd order method -> it’s just faster
• GD or SGD is not a way to solve a linear system!
Prof. Leal-Taixé and Prof. Niessner 26
![Page 27: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/27.jpg)
Importance of Learning Rate
27Prof. Leal-Taixé and Prof. Niessner
![Page 28: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/28.jpg)
Learning Rate
Prof. Leal-Taixé and Prof. Niessner 28
Need high learning rate when far away Need low learning rate when close
![Page 29: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/29.jpg)
Learning Rate Decay
• 𝛼 =1
1+𝑑𝑒𝑐𝑎𝑦𝑟𝑎𝑡𝑒⋅𝑒𝑝𝑜𝑐ℎ⋅ 𝛼0
– E.g., 𝛼0 = 0.1, 𝑑𝑒𝑐𝑎𝑦𝑟𝑎𝑡𝑒 = 1.0
– > Epoch 0: 0.1
– > Epoch 1: 0.05
– > Epoch 2: 0.033
– > Epoch 3: 0.025
...Prof. Leal-Taixé and Prof. Niessner 29
0
0.02
0.04
0.06
0.08
0.1
0.12
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
Learning Rate over Epochs
![Page 30: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/30.jpg)
Learning Rate DecayMany options:
• Step decay 𝛼 = 𝛼 − 𝑡 ⋅ 𝛼 (only every n steps)– T is decay rate (often 0.5)
• Exponential decay 𝛼 = 𝑡𝑒𝑝𝑜𝑐ℎ ⋅ 𝛼0– t is decay rate (t < 1.0)
• 𝛼 =𝑡
𝑒𝑝𝑜𝑐ℎ⋅ 𝑎0
– t is decay rate
• Etc.
Prof. Leal-Taixé and Prof. Niessner 30
![Page 31: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/31.jpg)
Training ScheduleManually specify learning rate for entire training process
• Manually set learning rate every n-epochs• How?
– Trial and error (the hard way)– Some experience (only generalizes to some degree)
Consider: #epochs, training set size, network size, etc.
Prof. Leal-Taixé and Prof. Niessner 31
![Page 32: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/32.jpg)
Learning Rate: Implications
• What if too high?
• What if too low?
Prof. Leal-Taixé and Prof. Niessner 32
![Page 33: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/33.jpg)
Training• Given ground dataset with ground lables
– {𝑥𝑖 , 𝑦𝑖}
• For instance 𝑥𝑖-th training image, with label 𝑦𝑖• Often dim 𝑥 ≫ dim(𝑦) (e.g., for classification)• 𝑖 is often in the 100-thousands or millions
– Take network 𝑓 and its parameters 𝑤, 𝑏
– Use SGD (or variation) to find optimal parameters 𝑤, 𝑏• Gradients from backprop
Prof. Leal-Taixé and Prof. Niessner 33
![Page 34: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/34.jpg)
Learning• Learning means generalization to unknown dataset
– (so far no ‘real’ learning)– I.e., train on known dataset -> test with optimized
parameters on unknown dataset
• Basically, we hope that based on the train set, the optimized parameters will give similar results on different data (i.e., test data)
Prof. Leal-Taixé and Prof. Niessner 34
![Page 35: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/35.jpg)
Learning• Training set (‘train’):
– Use for training your neural network
• Validation set (‘val’):– Hyperparameter optimization– Check generalization progress
• Test set (‘test’):– Only for the very end– NEVER TOUCH DURING DEVELOPMENT OR TRAINING
Prof. Leal-Taixé and Prof. Niessner 35
![Page 36: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/36.jpg)
Learning• Typical splits
– Train (60%), Val (20%), Test (20%)– Train (80%), Val (10%), Test (10%)
• During training:– Train error comes from average mini-batch error– Typically take subset of validation every n iterations
Prof. Leal-Taixé and Prof. Niessner 36
![Page 37: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/37.jpg)
Learning• Training graph
- Accuracy - Loss
Prof. Leal-Taixé and Prof. Niessner 37
(EMA smoothing)
![Page 38: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/38.jpg)
Learning• Validation graph
Prof. Leal-Taixé and Prof. Niessner 38
![Page 39: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/39.jpg)
Over- and Underfitting
Prof. Leal-Taixé and Prof. Niessner 39
Underfitted Appropriate Overfitted
Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017
![Page 40: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/40.jpg)
Over- and Underfitting
Prof. Leal-Taixé and Prof. Niessner 40
Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html
![Page 41: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/41.jpg)
Hyperparameters• Network architecture (e.g., num layers, #weights)• Number of iterations• Learning rate(s) (i.e., solver parameters, decay, etc.)• Regularization (more later next lecture) • Batch size• …• Overall: learning setup + optimization = hyerparameter
Prof. Leal-Taixé and Prof. Niessner 41
![Page 42: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/42.jpg)
Hyperparameter Tuning• Methods:
– Manual search: most common – Grid search (structured, for ‘real’ applications)
Define ranges for all parameters spaces and select points (usually pseudo-uniformly distributed). Iterate over all possible configurations
– Random search:Like grid search but one picks points at random in the predefined ranges
Prof. Leal-Taixé and Prof. Niessner 42
![Page 43: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/43.jpg)
Simple Grid Search Examplelearning_rates = [1e-2, 1e-3, 1e-4, 1e-5]regularization_strengths = [1e2, 1e3, 1e4, 1e5]num_iters = [500, 1000, 1500]best_val = 0
for learning_rate in learning_rates:for reg in regularization_strengths:
for iterations in num_iters:model = train_model(learning_rate, reg., iterations)validation_accuracy = evaluate(model)if validation_accuracy > best_val:
best_val = validation_accuracybest_model = model
Prof. Leal-Taixé and Prof. Niessner 43
![Page 44: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/44.jpg)
Cross Validation• Example: k=5
Prof. Leal-Taixé and Prof. Niessner 44
Figure extracted from cs231n
![Page 45: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/45.jpg)
Cross Validation
Prof. Leal-Taixé and Prof. Niessner 45
• Used when data set is extremely small and/or our method of choice has low training times
• Partition data into k subsets, train on k-1 and evaluate performance on the remaining subset
• To reduce variability: perform on different partitions and average results
![Page 46: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/46.jpg)
Cross Validation
Prof. Leal-Taixé and Prof. Niessner 46
Results for k=5
Hyperparmeter value
Figure extracted from cs231n
![Page 47: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/47.jpg)
Basic recipe for machine learning
47
![Page 48: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/48.jpg)
Basic recipe for machine learning• Split your data
48
Find your hyperparameters
20%
train testvalidation
20%60%
![Page 49: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/49.jpg)
Basic recipe for machine learning• Split your data
49
20%
train testvalidation
20%60%
Human level error …... 1%
Training set error ….... 5%
Val/test set error ….... 8%
Bias (or underfitting)
Variance (overfitting)
![Page 50: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/50.jpg)
Basic recipe for machine learning
50Credits: A. NgMore on
![Page 51: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/51.jpg)
Next lecture• Monday: Deadline Ex1!
• Next Tuesday:– Discussion solution exercise and presentation exercise 2
• Next lecture on Dec 6th: – Training Neural Networks
51Prof. Leal-Taixé and Prof. Niessner
![Page 52: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient](https://reader036.vdocument.in/reader036/viewer/2022070801/5f02a8a17e708231d4055dee/html5/thumbnails/52.jpg)
See you next week!
Prof. Leal-Taixé and Prof. Niessner 52