![Page 1: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/1.jpg)
Dor Ringel
Adam: A Method for Stochastic Optimization
Diederik P. Kingma, Jimmy BaPresented by Dor Ringel
![Page 2: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/2.jpg)
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
2
![Page 3: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/3.jpg)
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
3
![Page 4: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/4.jpg)
Dor Ringel 4
![Page 5: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/5.jpg)
Dor Ringel 5
![Page 6: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/6.jpg)
Dor Ringel 7
Basic Supervised Machine Learning Terminology
Notation Explanation
! Instance space
" Label space
!~$ unknown probability distribution
%: ! ⟶ " True mapping between instance and label spaces, unknown.
![Page 7: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/7.jpg)
Dor Ringel8
Examples of instance and labels spacesX Y
The space of all RGB images of some dimension Is there a cat in the image ({0,1})
The space of a stock’s historical price sequences The stock’s next day’s closing price ([0,∞))
The space of all finite Chinese sentences The set of all finite English sentences
The space of all information regarding two companies A probability of a merge to be successful
The space of all MRI images A probability, location and type of a tumor
The space of all finite length voice sequences The corresponding Amazon product being referred
. .
. .
. .
![Page 8: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/8.jpg)
Dor Ringel 9
Basic Supervised Machine Learning Terminology
Notation Explanation
! Instance space
" Label space
!~$ unknown probability distribution
%: ! ⟶ " True mapping between instance and label spaces, unknown
![Page 9: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/9.jpg)
Dor Ringel 10
Basic Supervised Machine Learning Terminology
Notation Explanation
! Instance space
" Label space
!~$ unknown probability distribution
%: ! ⟶ " True mapping between instance and label spaces, unknown
{(*+ , -+)}+012 , *+ ∈ !, -+ ∈ " Training set
ℎ: ! ⟶ ", ℎ ∈ 5 Hypothesis, the object we wish to learn
![Page 10: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/10.jpg)
Dor Ringel 11
Basic Supervised Machine Learning theory
• The goal is to find a hypothesis ℎ that approximates " as best as possible, but approximates in what sense?
• We need a way for evaluating a hypothesis’ quality
![Page 11: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/11.jpg)
Dor Ringel 12
Basic Supervised Machine Learning theory
Let us define a loss function !: #×# ⟶ [0,∞)
• For example:• Zero-one loss +(ℎ(./) ≠ 1/)• Quadratic loss ℎ(./) − 1/ 3
Here we use: 1/ = 5(./)
• A measure of “how bad” the hypothesis did on a single sample
![Page 12: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/12.jpg)
Dor Ringel 13
Basic Supervised Machine Learning theory
Let us also define the Generalization error (aka Risk) of a hypothesis ℎ:
"# ℎ = %&'~# ) ℎ *+ , -(*+)
• A measure of “how bad” the hypothesis did on a the entire instance space• A good hypothesis is one with a low Risk value.
![Page 13: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/13.jpg)
Dor Ringel 14
Basic Supervised Machine Learning theory
Let us also define the Generalization error (aka Risk) of a hypothesis ℎ:
"# ℎ = %&'~# ) ℎ *+ , -(*+)
• A measure of “how bad” the hypothesis did on a the entire instance space• A good hypothesis is one with a low Risk value.• Unfortunately…
![Page 14: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/14.jpg)
Dor Ringel 15
Basic Supervised Machine Learning theoryUnfortunately, the true Risk !" # cannot be computed because the distribution $ is unknown to the learning algorithm.
![Page 15: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/15.jpg)
Dor Ringel 16
Basic Supervised Machine Learning theoryUnfortunately, the true Risk !" # cannot be computed because the distribution $ is unknown to the learning algorithm.
We can, however, compute a proxy of the true Risk, called the Empirical Risk.
![Page 16: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/16.jpg)
Dor Ringel 17
Basic Supervised Machine Learning theoryLet us define the Empirical Risk of a hypothesis ℎ
"#$% ℎ = 1( )*+,-
./ ℎ 0+ , 2 3+
• Recall: {(0+, 3+)}+,-. , 0+ ∈ 9, 3+ ∈ : is the Training set• "; ℎ = <=>~; / ℎ 0+ , 2(0+)
![Page 17: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/17.jpg)
Dor Ringel18
Empirical Risk Minimization (ERM) strategy• After mountains of theory (PAC Learning, VC Theory, etc.), the
following theorem is proven (this is very unformal):
The “best” strategy for a learning algorithm is to minimize the Empirical Risk
![Page 18: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/18.jpg)
Dor Ringel 19
Empirical Risk Minimization strategy (ERM)This means that the learning algorithm defined by the ERM principle boils down to solving the following optimization problem:
Find !ℎ such that:!ℎ = $%&'()*∈,-./0 ℎ
= $%&'()*∈,12 34
567
8
9 ℎ :5 , < =5
![Page 19: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/19.jpg)
Dor Ringel 21
The hypothesis class can be very simple
• when ! ∈ #$ and % ∈ {−1,1}• + is class of all three dimensional hyper planes• ℎ = ./ +.121 +.323 +.$2$• 4 = {./,.1,.3,.$} ∈ #5
![Page 20: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/20.jpg)
Dor Ringel 22
The hypothesis class can be very complex
• Resent year’s Deep learning architectures, result in models with tens of millions of parameters, ! ∈ #$(&'()
![Page 21: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/21.jpg)
Dor Ringel 23
The bottom line
Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex
functions.
!ℎ = $%&'()*∈,1. /0
123
4
5 ℎ 61 , 8 91
![Page 22: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/22.jpg)
Dor Ringel 24
The bottom line
Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex
functions.
We need a principal method for achieving this goal.
!ℎ = $%&'()*∈,1. /0
123
4
5 ℎ 61 , 8 91
![Page 23: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/23.jpg)
Dor Ringel 25
Introducing – The Gradient method
![Page 24: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/24.jpg)
Dor Ringel
Questions?
![Page 25: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/25.jpg)
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
27
![Page 26: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/26.jpg)
Dor Ringel 28
A couple of notes before we head on
• I stick to the papers’ notations.• From here we’ll use ! " , which is the same as #$%& ℎ .
![Page 27: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/27.jpg)
Dor Ringel 29
Introducing – The Gradient method
Input: learning rate !, tolerance parameter " > 0
Initialization: pick %& ∈ () arbitratily
General Step:
• Set %*+, = %* − !∇0 1 %*• if ∇0 1 %*+, ≤ ", then STOP and %*+,is the output
![Page 28: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/28.jpg)
Dor Ringel 30
Gradient descent example – simple Linear regression
! = #, $ = #
{('( , *()}(-./ , '( ∈ !, *( ∈ $
1 = 2' + 4 2 ∈ #, 4 ∈ #}, 5 = (2, 4)
ℎ = 2' + 4
![Page 29: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/29.jpg)
Dor Ringel 31
Gradient descent example – simple Linear regression
! = #, $ = #
{('( , *()}(-./ , '( ∈ !, *( ∈ $
1 = 2' + 4 2 ∈ #, 4 ∈ #}, 5 = (2, 4)
ℎ = 2' + 4
The goal is to find ”good” 2 and 4 values
![Page 30: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/30.jpg)
Dor Ringel 32
Gradient descent example – simple Linear regression
ℎ = #$ + &
' ℎ $ , ) = ) − ℎ($) -
./01 ℎ = 23 4 ∑672
3 ' ℎ $6 , 8 )6= 2
3 4 ∑6723 )6 − ℎ $6
-=23 4 ∑6723 )6 − (#$6 + &) -
![Page 31: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/31.jpg)
Dor Ringel 33
GD example – computing the gradient• For !:""#$%#& ℎ = "
"#)* + ∑-.)
* /- − (!2- + 4) 6= )* + ∑-.)
* ""# /- − (!2- + 4) 6
= )*∑-.)
* 2 /- − (!2- + 4) (−2-) =88! $ = − 2
9:-.)
*2- /- − (!2- + 4)
• Similarly for 4:884 $ = − 2
9:-.)
*/- − (!2- + 4)
![Page 32: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/32.jpg)
Dor Ringel 34
GD example – computing the gradientSo the gradient vector is:
∇"# $ = &&'#,
&&)#
= − 2,-./0
1
2. 3. − '2. + ) ,− 2,-./0
1
3. − ('2. + ))
![Page 33: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/33.jpg)
Dor Ringel 35
GD example – the complete algorithm
Input: learning rate !, tolerance parameter " > 0
Initialization: pick %& = (m, b) ∈ ./ arbitratily
General Step:
• Set %012 = %0 − !∇56 %0
• 7012 = 70 + ! /9 ∑;<2
9 =; >; − 7=; + ?
• ?012 = ?0 + ! /9 ∑;<2
9 >; − 7=; + ?
• if ∇56 %012 ≤ ", then STOP and %012 is the output
![Page 34: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/34.jpg)
Dor Ringel 36
GD example – visualized
![Page 35: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/35.jpg)
Dor Ringel 38
Variants of Gradient descent• Differ in how much data we use to compute the gradients of the
objective function.• A trade-off between the accuracy of the update and the computation
time per update.
![Page 36: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/36.jpg)
Dor Ringel 39
Batch Gradient descent• Computes the gradients of the function w.r.t the parameters ! for the
entire training dataset.
! = ! − $ % ∇'( !; *(,:.), 1(,:.)
![Page 37: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/37.jpg)
Dor Ringel 40
Batch Gradient descent – Pros and ConsPros• Guaranteed to converge to global/local minimum.• An unbiased estimate of gradients.
Cons• Possibly slow or impossible to compute.• Some examples may be redundant.• Converges to the minimum of the basin the parameters are
placed in.
![Page 38: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/38.jpg)
Dor Ringel 41
Stochastic Gradient descent (SGD)• Computes the gradients of the function w.r.t the parameters ! for a
single training sample.
! = ! − $ % ∇'( !; *(,), /(,)
![Page 39: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/39.jpg)
Dor Ringel 42
Stochastic Gradient descent – Pros and Cons
Pros• Much faster to compute.• Potential to jump to better basins (and better local minima).Cons• High variance that causes the objective to fluctuate heavily.
![Page 40: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/40.jpg)
Dor Ringel 43
Mini-batch Gradient descent• Computes the gradients of the function w.r.t the parameters ! for a
mini-batch of " training sample.
• " is usually 32 to 256
! = ! − % & ∇() !; +(-:-/0), 3(-:-/0)
![Page 41: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/41.jpg)
Dor Ringel 44
Mini-batch Gradient descent – Pros and Cons
• the “best” of both worlds - fast, explorational and allows for stable convergence.• Makes use of highly optimized matrix optimizations libraries and
hardware.• The method of for most Supervised Machine leaning scenarios.
![Page 42: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/42.jpg)
Dor Ringel 45
Variants of Gradient descent - visualizations
![Page 43: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/43.jpg)
Dor Ringel 48
Its all Mini-batch from here
• The remaining of the presentation will focus on variants of the Mini-batch version. • From here on – Gradient descent, SGD, Gradient step, all refer to the
Mini-batch variant.• We’ll leave out the parameters !(#:#%&), )(#:#%&) for simplicity.
![Page 44: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/44.jpg)
Dor Ringel 49
Challenges and limitations of the plain SGD
• Choosing a proper learning rate.• Sharing the learning rate for all parameters.• Optimization in the face of highly non-convex functions.
![Page 45: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/45.jpg)
Dor Ringel
Questions?
![Page 46: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/46.jpg)
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
54
![Page 47: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/47.jpg)
Dor Ringel 55
Novelties over the plain SGD• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
We will focus only on algorithms that are feasible to compute in practice for high dimensional data sets (and will ignore second-order methods such as Newton’s method).
![Page 48: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/48.jpg)
Dor Ringel 56
![Page 49: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/49.jpg)
Dor Ringel57
Momentum (Qian, N. 1999)
• Plain SGD can make erratic updates on non-smooth loss functions• Consider an outlier example which “throws off” the learning process
• Need to maintain some history of updates.
• Physics example:• A moving ball acquires “momentum”, at which point it becomes less
sensitive to the direct force (gradient).
![Page 50: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/50.jpg)
Dor Ringel 58
Momentum (Qian, N. 1999) • Add a fraction ! (usually about 0.9) of the update vector of the past
time step to the current update vector.• Faster convergence and reduced oscillations.
"# = !"#%& + ( ) ∇+, -- = - −"#
= - − !"#%& − ( ) ∇+, -
, - − /012345"2 678345/8- ∈ :; − <=>=?242>@∇+, - − A>=B5284 "234/>( − C2=>858A >=42- = - − ( ) ∇+, -
![Page 51: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/51.jpg)
Dor Ringel 59
Momentum (Qian, N. 1999)
(a) SGD without momentum (b) SGD with momentum
![Page 52: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/52.jpg)
Dor Ringel 60
Nestrov accelerated gradient (Nestrov, Y. 1983)
• Momentum is usually pretty high once we get near our goal point.• The Algorithm has no idea when to slow down and therefore might
miss the goal point.• We would like our momentum to have a kind of foresight.
![Page 53: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/53.jpg)
Dor Ringel 61
Momentum has no idea when to slow down
![Page 54: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/54.jpg)
Dor Ringel 62
Nestrov accelerated gradient (Nestrov, Y. 1983)
• First make a jump based on our previous momentum, calculate the gradients and then make a correction.• look ahead by calculating the gradient not w.r.t. to our current
parameters but w.r.t. the approximate future position of our parameters.
!" = $!"%& + ( ) ∇+, - − $!"%&- = - −!"
, - − /012345!2 678345/8- ∈ :; − <=>=?242>@∇+, - − A>=B5284 !234/>( − C2=>858A >=42- = - − ( ) ∇+, -
![Page 55: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/55.jpg)
Dor Ringel 63
Nestrov accelerated gradient (Nestrov, Y. 1983)
![Page 56: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/56.jpg)
Dor Ringel 64
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
• Now we are able to adapt our updates to the slope.• But updates are the same for all the parameters being updated.• We would like to adapt our updates to each individual parameter
depending on their importance.
![Page 57: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/57.jpg)
Dor Ringel 65
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
• Perform larger updates for infrequent parameters and smaller updates for frequent ones.• Use a different learning rate for every parameter !", at every time
step #.• well-suited for dealing with sparse data.• Eliminates the need to manually tune the learning rate (most just use
0.01)
![Page 58: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/58.jpg)
Dor Ringel 66
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
!",$ = ∇'() *",$*"+,,$ = *",$ − . ,
/(,00+12 !",$
) * − 3456789:6 ;<=7893=* ∈ ?@ − ABCBD686CE∇') * −!CBF96=8 :6783C. −G6BC=9=!CB86
* =*−. 2 ∇') *
![Page 59: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/59.jpg)
Dor Ringel 67
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
!",$ = ∇'() *",$*"+,,$ = *",$ − .
,
/(,00+12 !",$
• !",$ - the gradient w.r.t the parameter *$, at time step 3.• 4" ∈ 6787- diagonal matrix, 4",$$ is the sum of squares gradients w.r.t *$ up to time step t.• 9 – prevents division by zero (in the order of 1e-8)
) * − :;<=>3?@= ABC>3?:C
* ∈ 67 − DEFEG=3=FH
∇') * −!FEI?=C3 @=>3:F
. −J=EFC?C!FE3=
* =*−. 2 ∇') *
![Page 60: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/60.jpg)
Dor Ringel 68
Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)
!",$ = ∇'() *",$
*"+,,$ = *",$ − .,
/(,00+12 !",$
*"+, = *" − .,
/(+1⨀!"
• !",$ - the gradient w.r.t the parameter *$, at time step 4.• 5" ∈ 7898- diagonal matrix, 5",$$ is the sum of squares gradients w.r.t *$ up to time step t.• : – prevents division by zero (in the order of 1e-8)
) * − ;<=>?4@A> BCD?4@;D
* ∈ 78 − EFGFH>4>GI
∇') * −!GFJ@>D4 A>?4;G
. −K>FGD@D!GF4>
* =*−. 2 ∇') *
![Page 61: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/61.jpg)
Dor Ringel 69
Adagrad vs. Plain SGD
Adagrad: !"#$,& = !",& − ) $*+,,,#-
. ∇0+1 !",&
Plain SGD: !"#$,& = !",& − ) . ∇0+1 !",&
![Page 62: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/62.jpg)
Dor Ringel 70
Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)
• In Adagrad, the accumulated sum keeps growing. This causes the learning rate to shrink and become infinitesimally small, impeding convergence.• We need an efficient way of reducing this aggressive, monotonic,
decreasing learning rate.
![Page 63: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/63.jpg)
Dor Ringel 71
Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)
• Recursively define a decaying average of those gradients.• The running average at time step t depends only on the previous time
step and the current gradient.
![Page 64: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/64.jpg)
Dor Ringel 72
Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)
! "# $ = &! "# $'( + 1− & "$#
,$-( = ,$ − .1
! "# $ + /0 "$
1 , − 234567895 :;<6782<
, ∈ >? − @ABAC575BD∇F1 , − "BAG85<7 95672B
. − H5AB<8<" BA75, = , − . 0 ∇F1 ,; J, L
& − C2C5<7;C 625:.
"$ = ∇FN1 ,$
![Page 65: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/65.jpg)
Dor Ringel 73
Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)
! "# $ = &! "# $'( + 1− & "$#
,$-( = ,$ − .1
! "# $ + /0 "$
• ! "# $ – the running average of squared gradient, for time step 1.• 2 – prevents division by zero (in the order of 1e-8)
3 , − 4567819:7 ;<=8194=
, ∈ ?@ − ABCBD717CE
∇G3 , − "CBH97=1 :7814C
. − I7BC=9=" CB17
, = , − . 0 ∇G3 ,; K, M
& − D4D7=1<D 847;.
"$ = ∇GO3 ,$
![Page 66: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/66.jpg)
Dor Ringel 74
Adadelta (Zeiler, M, D. 2012)
![Page 67: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/67.jpg)
Dor Ringel 75
Visualizations of the discussed algorithms
![Page 68: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/68.jpg)
Dor Ringel
Questions?
![Page 69: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/69.jpg)
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
77
![Page 70: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/70.jpg)
Dor Ringel 78
Adam (Kingma D. P, & Ba. J 2014)
• RMSProp allows for a adaptive per-parameter update, but the update itself is still done using the current, “noisy”, gradient.
• We would like the gradient itself to be replaced by a similar exponential decaying average of past gradients.
![Page 71: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/71.jpg)
Dor Ringel 79
Adam – Update rule
!" = $%!"&% + (1 − $%),"
-" = $.-"&% + (1 − $.),".
/"0% = /" − 11
-" + 23 !"
• !" and -" are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.• Recommended values in the paper are $% = 0.9, $. = 0.999, 8 = 19 − 8
; / − <=>9?@A-9 BCD?@A<D
/ ∈ FG − HIJI!9@9JK
∇M; / −,JINA9D@ -9?@<J
1 −O9IJDAD,JI@9
/ =/−1 3 ∇M; /;Q,R
," = ∇MS; /"
![Page 72: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/72.jpg)
Dor Ringel 80
Adam – Bias towards zero
• as !" and #" are initialized as 0’s, they are biased towards zero.
• Most significant in the initial steps.
• Most significant when $% and $& are close to 1.
• A correction is required.
![Page 73: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/73.jpg)
Dor Ringel 81
Adam – Bias correction
!"# ="#
1−'(#, *+# =
+#1 −',#
-#.( = -# − /1*+# + 1
2 !"#
• Correct both moments to get the final update rule.
3 - − 456789:+7 ;<=89:4=- ∈ ?@ − ABCB"797CD
∇F3 - −GCBH:7=9 +7894C/ −I7BC=:=GCB97
G# =∇FJ3 -#
![Page 74: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/74.jpg)
Dor Ringel 82
Adam vs. Adadelta
Adam: !"#$ = !" − '$
()*#+, -."
Adadelta: !"#$ = !" − '$
)*#+, /"
0 ! − 123456784 9:;5671;
! ∈ => − ?@[email protected]
∇D0 ! − /A@E74;6 84561A
' − F4@A;7;/ A@64
/" = ∇D*0 !"
." = G$."H$ + (1 − G$)/"
8" = GM8"H$ + (1 − GM)/"M
![Page 75: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/75.jpg)
Dor Ringel 85
Adam - Performance
![Page 76: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/76.jpg)
Dor Ringel 86
Adam - Performance
![Page 77: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/77.jpg)
Dor Ringel 87
Adam - Performance
![Page 78: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/78.jpg)
Dor Ringel
Questions?
![Page 79: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/79.jpg)
Dor Ringel
Content
• Background• Supervised ML theory and the
importance of optimum finding• Gradient descent and its variants• Limitations of SGD
• Earlier approaches/Building blocks• Momentum• Nestrov accelerated gradient (NAG)• AdaGrad (Adaptive Gradient)• AdaDelta and RMSProp
• Adam• Update rule• Bias correction• AdaMax
• Post Adam innovations• Improving Adam• Additional approaches• Shampoo
89
![Page 80: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/80.jpg)
Dor Ringel 90
improving Adam• Ndam – Incorporating Nestrov into Adam
• AdamW - decoupling weight decay!"#$ = !" − ' $
()*#+, -." − '/" !"
• AMSGrad – fixing the exponential moving average01" = .23 01"4$, 1"
!"#$ = !" − ' 101" + 8
, ."
• Maximum of past gradients instead of their exp. moving average
• Adam with warm restarts
![Page 81: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/81.jpg)
Dor Ringel 91
Additional approaches• Snapshot ensembles• Learning to optimize
![Page 82: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/82.jpg)
Dor Ringel 92
Shampoo (Gupta, V., Koren, T., & Singer, Y. 2018)
![Page 83: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/83.jpg)
Dor Ringel 93
Summary• A brief walkthrough of Supervised Machine Learning.• A conviction of the importance and relevance of Gradient methods.• An Overview of modern Gradient descent optimization algorithms.• The contribution of Adam.• Innovations future to Adam.
![Page 84: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/84.jpg)
Dor Ringel 94
![Page 85: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/85.jpg)
Dor Ringel
Questions?
![Page 86: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/86.jpg)
Dor Ringel
The End
![Page 87: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/87.jpg)
Dor Ringel98
links• http://ruder.io/optimizing-gradient-descent/index.html• http://ruder.io/deep-learning-optimization-2017/• https://imgur.com/a/Hqolp• https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3• https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network• https://mathematica.stackexchange.com/questions/9928/identifying-critical-points-lines-of-2-3d-
image-cubes• https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d• https://distill.pub/2017/momentum/• https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-
d7834f67a4f6• https://meetshah1995.github.io/semantic-segmentation/deep-
learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years.html• https://github.com/mattnedrich/GradientDescentExample
![Page 88: Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content](https://reader034.vdocument.in/reader034/viewer/2022052423/5f0a8cf17e708231d42c2fb8/html5/thumbnails/88.jpg)
Dor Ringel99
references• Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
• Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. arXiv , pages 1–14, 2014.
• Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLRWorkshop , (1):2013–2016, 2016.
• Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations , pages 1–13, 2015.
• Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.) , 269:543–547.
• Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks :
• the official journal of the International Neural Network Society , 12(1):145–151, 1999.
• Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 , 2012.
• Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.