![Page 1: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/1.jpg)
Acceleration and MomentumCS6787 Lecture 3 — Fall 2017
![Page 2: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/2.jpg)
How does the step size affect convergence?
• Let’s go back to gradient descent
• Simplest possible case: a quadratic function
xt+1 = xt � ↵rf(xt)
f(x) =1
2x2
xt+1 = xt � ↵xt = (1� ↵)xt
![Page 3: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/3.jpg)
Step size vs. convergence: graphically
|xt+1 � 0| = |1� ↵| |xt � 0|
00.20.40.60.8
11.21.41.61.8
0 0.5 1 1.5 2 2.5 3
conv
erge
nce
rate
step size
![Page 4: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/4.jpg)
What if the curvature is different?
f(x) = 2x2 xt+1 = xt � 4↵xt = (1� 4↵)xt
00.20.40.60.8
11.21.41.61.8
0 0.5 1 1.5 2 2.5 3
conv
erge
nce
rate
step size
previous fnew f
![Page 5: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/5.jpg)
Step size vs. curvature
• For these one-dimensional quadratics, how we should set the step size depends on the curvature• More curvature à smaller ideal step size
• What about higher-dimensional problems?• Let’s look at a really simple quadratic that’s just a sum of our examples.
f(x, y) =1
2x2 + 2y2
![Page 6: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/6.jpg)
Simple two dimensional problem
• Gradient descent:
f(x, y) =1
2x2 + 2y2
xt+1
yt+1
�=
xt
yt
�� ↵
xt
4yt
�
=
1� ↵ 00 1� 4↵
� xt
yt
�
![Page 7: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/7.jpg)
What’s the convergence rate?
• Look at the worst-case contraction factor of the update
• Contraction is maximum of previous two values.
maxx,y
����
1� ↵ 00 1� 4↵
� xy
���������
xy
�����= max(|1� ↵| , |1� 4↵|)
![Page 8: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/8.jpg)
Convergence of two-dimensional quadratic
00.20.40.60.8
11.21.41.61.8
0 0.5 1 1.5 2 2.5 3
conv
erge
nce
rate
step size
previous fnew ftwo-dimensional f
![Page 9: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/9.jpg)
What does this example show?
• We’d like to set the step size larger for dimension with less curvature, and smaller for the dimension with more curvature.
• But we can’t, because there is only a single step-size parameter.
• There’s a trade-off• Optimal convergence rate is substantially worse than what we’d get in each
scenario individually — individually we converge in one iteration.
![Page 10: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/10.jpg)
For general quadratics
• For PSD symmetric A,
• Gradient descent has update step
• What does the convergence rate look like in general?
f(x) =1
2xTAx
xt+1 = xt � ↵Axt = (I � ↵A)xt
![Page 11: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/11.jpg)
Convergence rate for general quadratics
maxx
k(I � ↵A)xkkxk = max
x
1
kxk
�����
I � ↵
nX
i=1
�iuiuTi
!x
�����
= maxx
��Pni=1(1� ↵�i)uiuT
i x��
��Pni=1 uiuT
i x��
= maxi
|1� ↵�i|
= max(1� ↵�min,↵�max � 1)
![Page 12: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/12.jpg)
Optimal convergence rate
• Minimize:
• Optimal value occurs when
• Optimal rate is
max(1� ↵�min,↵�max � 1)
1� ↵�min = ↵�max � 1 ) ↵ =2
�max + �min
max(1� ↵�min,↵�max � 1) =�max � �min
�max + �min
![Page 13: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/13.jpg)
What affects this optimal rate?
• Here, 𝜅 is called the condition number of the matrix A.
• Problems with larger condition numbers converge slower.• Called poorly conditioned.
rate =�max � �min
�max + �min
=�max/�min � 1
�max/�min + 1
=� 1
+ 1.
=�max
�min
![Page 14: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/14.jpg)
Poorly conditioned problems
• Intuitively, these are problems that are highly curved in some directions but flat in others
• Happens pretty often in machine learning• Measure something unrelated à low curvature in that direction• Also affects stochastic gradient descent
• How do we deal with this?
![Page 15: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/15.jpg)
Momentum
![Page 16: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/16.jpg)
Motivation
• Can we tell the difference between the curved and flat directions using information that is already available to the algorithm?
• Idea: in the one-dimensional case, if the gradients are reversing sign, then the step size is too large• Because we’re over-shooting the optimum• And if the gradients stay in the same direction, then step size is too small
• Can we leverage this to make steps smaller when gradients reverse sign and larger when gradients are consistently in the same direction?
![Page 17: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/17.jpg)
Polyak Momentum
• Add extra momentum term to gradient descent
• Intuition: if current gradient step is in same direction as previous step, then move a little further in that direction.• And if it’s in the opposite direction, move less far.
• Also known as the heavy ball method.
xt+1 = xt � ↵rf(xt) + �(xt � xt�1)
![Page 18: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/18.jpg)
Momentum for 1D Quadratics
• Momentum gradient descent gives
f(x) =�
2x2
xt+1 = xt � ↵�xt + �(xt � xt�1)
= (1 + � � ↵�)xt � �xt�1
![Page 19: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/19.jpg)
Characterizing momentum for 1D quadratics
• Start with
• Trick: let
xt+1 = (1 + � � ↵�)xt � �xt�1
xt = �t/2zt
�(t+1)/2zt+1 = (1 + � � ↵�)�t/2zt � � · �(t�1)/2zt�1
zt+1 =1 + � � ↵�p
�zt � zt�1
![Page 20: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/20.jpg)
Characterizing momentum (continued)
• Let
• Then we get the simplified characterization
• This is a degree-t polynomial in u
u =1 + � � ↵�
2p�
zt+1 = 2uzt � zt�1
![Page 21: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/21.jpg)
Chebyshev Polynomials
• If we initialize such that then these are a special family of polynomials called the Chebyshev polynomials
• Standard notation:
• These polynomials have an important property: for all t
z0 = 1, z1 = u
zt+1 = 2uzt � zt�1
�1 u 1 ) �1 zt 1
Tt+1(u) = 2uTt(u)� Tt�1(u)
![Page 22: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/22.jpg)
Chebyshev Polynomials
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
T0(u) = 1
![Page 23: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/23.jpg)
Chebyshev Polynomials
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
T1(u) = u
![Page 24: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/24.jpg)
Chebyshev Polynomials
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
T2(u) = 2u2 � 1
![Page 25: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/25.jpg)
Chebyshev Polynomials
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
![Page 26: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/26.jpg)
Chebyshev Polynomials
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
![Page 27: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/27.jpg)
Chebyshev Polynomials
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
![Page 28: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/28.jpg)
Chebyshev Polynomials
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
![Page 29: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/29.jpg)
Characterizing momentum (continued)
• What does this mean for our 1D quadratics• Recall that we let
• So
xt = �t/2zt
xt = �t/2 · x0 · Tt(u)
= �t/2 · x0 · Tt
✓1 + � � ↵�
2p�
◆
�1 1 + � � ↵�
2p�
1 ) |xt| �t/2 |x0|
![Page 30: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/30.jpg)
Consequences of momentum analysis
• Convergence rate depends only on momentum parameter β• Not on step size or curvature.
• We don’t need to be that precise in setting the step size• It just needs to be within a window• Pointed out in “YellowFin and the Art of Momentum Tuning” by Zhang et. al.
• If we have a multidimensional quadratic problem, the convergence rate will be the same in all directions• This is different from the gradient descent case where we had a trade-off
![Page 31: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/31.jpg)
Choosing the parameters
• How should we set the step size and momentum parameter if we only have bounds on λ ?
• Need:
• Suffices to have:
�1 1 + � � ↵�
2p�
1
�1 =1 + � � ↵�max
2p�
and1 + � � ↵�min
2p�
= 1
![Page 32: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/32.jpg)
Choosing the parameters (continued)
• Adding both equations:
0 =2 + 2� � ↵�max � ↵�min
2p�
0 = 2 + 2� � ↵�max � ↵�min
↵ =2 + 2�
�max + �min
![Page 33: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/33.jpg)
Choosing the parameters (continued)
• Subtracting both equations:
1 + � � ↵�min � 1� � + ↵�max
2p�
= 2
↵(�max � �min)
2p�
= 2
![Page 34: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/34.jpg)
Choosing the parameters (continued)
• Combining these results:
2 + 2�
�max + �min· (�max � �min)
2p�
= 2
↵(�max � �min)
2p�
= 2↵ =2 + 2�
�max + �min
0 = 1� 2p��max + �min
�max � �min+ �
![Page 35: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/35.jpg)
Choosing the parameters (continued)
• Quadratic formula: 0 = 1� 2p��max + �min
�max � �min+ �
p� =
+ 1
� 1�
s✓+ 1
� 1
◆2
� 1
=+ 1
� 1�
r4
2 � 2+ 1
=+ 1
� 1� 2
p
� 1=
(p� 1)
2
� 1=
p� 1p+ 1
![Page 36: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/36.jpg)
Gradient Descent versus Momentum
• Recall: gradient descent had a convergence rate of
• But with momentum, the optimal rate is
• This is called convergence at an accelerated rate
� 1
+ 1
p� =
p� 1p+ 1
![Page 37: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/37.jpg)
Demo
![Page 38: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/38.jpg)
Setting the parameters
• How do we set the momentum in practice for machine learning?
• One method: metaparameter optimization
• Another method: just set β = 0.9• Works across a range of problems• Actually quite popular in deep learning
![Page 39: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/39.jpg)
Nesterov momentum
![Page 40: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/40.jpg)
What about more general functions?
• Previous analysis was for quadratics
• Does this work for general convex functions?
• Answer: not in general• We need to do something slightly different
![Page 41: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/41.jpg)
Nesterov Momentum
• Slightly different rule
• Main difference: separate the momentum state from the point that we are calculating the gradient at.
xt+1 = yt � ↵rf(yt)
yt+1 = xt+1 + �(xt+1 � xt)
![Page 42: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/42.jpg)
Nesterov Momentum Analysis
• Converges at an accelerated rate for ANY convex problem
• Optimal assignment of the parameters:
sp� 1p
↵ =1
�max, � =
p� 1p+ 1
![Page 43: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/43.jpg)
Nesterov Momentum is Also Very Popular
• People use it in practice for deep learning all the time
• Significant speedups in practice
![Page 44: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/44.jpg)
Demo
![Page 45: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/45.jpg)
If time remains, PCA on the board
![Page 46: Acceleration and Momentum - Cornell University · Consequences of momentum analysis •Convergence rate depends only on momentum parameter β •Not on step size or curvature. •We](https://reader033.vdocument.in/reader033/viewer/2022042402/5f131484dba7421a010e68f5/html5/thumbnails/46.jpg)
Questions?
• Upcoming things• Paper 1 review due tonight• Next paper presentation on Wednesday