understanding the role of optimism in minimax optimization ... · min x2rm max y2rn f(x;y) minimax...
TRANSCRIPT
![Page 1: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/1.jpg)
Understanding the Role of Optimism in MinimaxOptimization: A Proximal Point Approach
Aryan Mokhtari (UT Austin)
joint work with
Sarath Pattathil (MIT) & Asu Ozdaglar (MIT)
Workshop on Bridging Game Theory & Deep LearningNeurIPS 2019
Vancouver, Canada, December 14, 2019
1
![Page 2: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/2.jpg)
Introduction
We focus on minimax optimization (known as saddle-pointproblem)
minx∈Rm
maxy∈Rn
f (x, y)
Minimax optimization has been studied for a very long time:
Game Theory: [Basar & Oldser, ’99]
Control Theory: [Hast et al. ’13]
Robust Optimization: [Ben-Tal et al., ’09]
Recently popularized to train GANs - [Goodfellow et al., ’14].
2
![Page 3: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/3.jpg)
Introduction
We focus on minimax optimization (known as saddle-pointproblem)
minx∈Rm
maxy∈Rn
f (x, y)
Minimax optimization has been studied for a very long time:
Game Theory: [Basar & Oldser, ’99]
Control Theory: [Hast et al. ’13]
Robust Optimization: [Ben-Tal et al., ’09]
Recently popularized to train GANs - [Goodfellow et al., ’14].
2
![Page 4: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/4.jpg)
Training GANs
A game between the discriminator and generator
Can be formulated as a minimax optimization problemNonconvex-nonconcave since both discriminator and generator areneural networks.
3
![Page 5: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/5.jpg)
Algorithms
A simple algorithm to use is the Gradient Descent Ascent (GDA)
xk+1 = xk − η∇xf (xk , yk), yk+1 = yk + η∇yf (xk , yk).
The Optimistic Gradient Descent Ascent (OGDA) method is
xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))
yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))
It has been observed that “negative momentum” or “optimism”helps in training GANs. ([Gidel et al., ’18] , [Daskalakis et al., ’18])
4
![Page 6: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/6.jpg)
Algorithms
A simple algorithm to use is the Gradient Descent Ascent (GDA)
xk+1 = xk − η∇xf (xk , yk), yk+1 = yk + η∇yf (xk , yk).
The Optimistic Gradient Descent Ascent (OGDA) method is
xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))
yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))
It has been observed that “negative momentum” or “optimism”helps in training GANs. ([Gidel et al., ’18] , [Daskalakis et al., ’18])
4
![Page 7: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/7.jpg)
Algorithms
A simple algorithm to use is the Gradient Descent Ascent (GDA)
xk+1 = xk − η∇xf (xk , yk), yk+1 = yk + η∇yf (xk , yk).
The Optimistic Gradient Descent Ascent (OGDA) method is
xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))
yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))
It has been observed that “negative momentum” or “optimism”helps in training GANs. ([Gidel et al., ’18] , [Daskalakis et al., ’18])
4
![Page 8: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/8.jpg)
Impact of Optimism
Consider the following images generated by GANs (from [Daskalakis et al. ’18])
Figure: Adam Figure: Optimistic Adam
Adam - Similar and inferior quality images.
Optimistic Adam - Diverse and higher quality images
5
![Page 9: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/9.jpg)
History of Optimism
Clear that optimism is helping in faster convergence.
There are several results which study Optimistic Methods
Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method
In the context of Online Learning in [Rakhlin et al. ’13]
Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]
Several works analyzing OGDA in special settings [Daskalakis et al. ’18],
[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]
We show that optimism or negative momentum helps to approximateproximal point method more accurately!
OGDA can be considered as an approximation of Proximal Point
6
![Page 10: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/10.jpg)
History of Optimism
Clear that optimism is helping in faster convergence.
There are several results which study Optimistic Methods
Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method
In the context of Online Learning in [Rakhlin et al. ’13]
Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]
Several works analyzing OGDA in special settings [Daskalakis et al. ’18],
[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]
We show that optimism or negative momentum helps to approximateproximal point method more accurately!
OGDA can be considered as an approximation of Proximal Point
6
![Page 11: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/11.jpg)
History of Optimism
Clear that optimism is helping in faster convergence.
There are several results which study Optimistic Methods
Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method
In the context of Online Learning in [Rakhlin et al. ’13]
Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]
Several works analyzing OGDA in special settings [Daskalakis et al. ’18],
[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]
We show that optimism or negative momentum helps to approximateproximal point method more accurately!
OGDA can be considered as an approximation of Proximal Point
6
![Page 12: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/12.jpg)
History of Optimism
Clear that optimism is helping in faster convergence.
There are several results which study Optimistic Methods
Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method
In the context of Online Learning in [Rakhlin et al. ’13]
Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]
Several works analyzing OGDA in special settings [Daskalakis et al. ’18],
[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]
We show that optimism or negative momentum helps to approximateproximal point method more accurately!
OGDA can be considered as an approximation of Proximal Point
6
![Page 13: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/13.jpg)
History of Optimism
Clear that optimism is helping in faster convergence.
There are several results which study Optimistic Methods
Introduced in [Popov ’80] as a variant of Extra-Gradient (EG) method
In the context of Online Learning in [Rakhlin et al. ’13]
Variant of Forward-Backward algorithm [Malitsky & Tam, ’19]
Several works analyzing OGDA in special settings [Daskalakis et al. ’18],
[Gidel et al. ’19], [Liang & Stokes ’19], [Mertikopoulos et al. ’19]
We show that optimism or negative momentum helps to approximateproximal point method more accurately!
OGDA can be considered as an approximation of Proximal Point
6
![Page 14: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/14.jpg)
Simple Example
Consider the following bilinear problem:
minx∈Rd
maxy∈Rd
f (x, y) = x>y
The solution is (x∗, y∗) = (0, 0).
The Gradient Descent Ascent (GDA) updates for this problem:
xk+1 = xk − η yk︸︷︷︸∇xf (xk ,yk )
, yk+1 = yk + η xk︸︷︷︸∇yf (xk ,yk )
where η is the stepsize.
7
![Page 15: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/15.jpg)
Simple Example
Consider the following bilinear problem:
minx∈Rd
maxy∈Rd
f (x, y) = x>y
The solution is (x∗, y∗) = (0, 0).
The Gradient Descent Ascent (GDA) updates for this problem:
xk+1 = xk − η yk︸︷︷︸∇xf (xk ,yk )
, yk+1 = yk + η xk︸︷︷︸∇yf (xk ,yk )
where η is the stepsize.
7
![Page 16: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/16.jpg)
GDA
On running GDA, after k iterations we have:
‖xk+1‖2 + ‖yk+1‖2 = (1 + η2)(‖xk‖2 + ‖yk‖2)
GDA diverges as (1 + η2) > 1
8
![Page 17: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/17.jpg)
OGDA
OGDA updates for the bilinear problem
xk+1 = xk − η( 2yk − yk−1︸ ︷︷ ︸2∇xf (xk ,yk )−∇xf (xk−1,yk−1)
)
yk+1 = yk + η( 2xk − xk−1︸ ︷︷ ︸2∇yf (xk ,yk )−∇yf (xk−1,yk−1)
)
9
![Page 18: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/18.jpg)
Proximal Point
The Proximal Point (PP) updates for the same problem:
xk+1 = xk − η yk+1︸︷︷︸∇xf (xk+1,yk+1)
yk+1 = yk + η xk+1︸︷︷︸∇yf (xk+1,yk+1)
where η is the stepsize.
The difference from GDA is that the gradient at the iterate (xk+1, yk+1)is used for the update instead of the gradient at (xk , yk).
Although for this problem it takes a simple form
xk+1 =1
1 + η2(xk − ηyk) , yk+1 =
1
1 + η2(yk + ηxk)
Proximal Point method in general involves operator inversion and is noteasy to implement.
10
![Page 19: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/19.jpg)
Proximal Point
The Proximal Point (PP) updates for the same problem:
xk+1 = xk − η yk+1︸︷︷︸∇xf (xk+1,yk+1)
yk+1 = yk + η xk+1︸︷︷︸∇yf (xk+1,yk+1)
where η is the stepsize.
The difference from GDA is that the gradient at the iterate (xk+1, yk+1)is used for the update instead of the gradient at (xk , yk).
Although for this problem it takes a simple form
xk+1 =1
1 + η2(xk − ηyk) , yk+1 =
1
1 + η2(yk + ηxk)
Proximal Point method in general involves operator inversion and is noteasy to implement.
10
![Page 20: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/20.jpg)
Proximal Point
The Proximal Point (PP) updates for the same problem:
xk+1 = xk − η yk+1︸︷︷︸∇xf (xk+1,yk+1)
yk+1 = yk + η xk+1︸︷︷︸∇yf (xk+1,yk+1)
where η is the stepsize.
The difference from GDA is that the gradient at the iterate (xk+1, yk+1)is used for the update instead of the gradient at (xk , yk).
Although for this problem it takes a simple form
xk+1 =1
1 + η2(xk − ηyk) , yk+1 =
1
1 + η2(yk + ηxk)
Proximal Point method in general involves operator inversion and is noteasy to implement.
10
![Page 21: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/21.jpg)
Proximal Point
On running PP, after k iterations we have:
‖xk+1‖2 + ‖yk+1‖2 =1
1 + η2(‖xk‖2 + ‖yk‖2)
PP converges as 1/(1 + η2) < 1
11
![Page 22: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/22.jpg)
OGDA vs Proximal Point
It seems like OGDA approximates Proximal Point method!
Their convergence paths are very similar
12
![Page 23: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/23.jpg)
Contributions & Outline
View OGDA as approximations of PP for finding a saddle point
We show the iterates of OGDA are o(η2) approx. of PP iterates
We focus on the convex-concave setting in this talk
We use the PP approximation viewpoint to show that for OGDA
The iterates are boundedFunction value converges at a rate of O(1/k)
At the end, we also provide convergence rate results for
Bilinear problemsStrongly convex-strongly concave problems
We revisit the Extra-gradient (EG) method using the same approach
13
![Page 24: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/24.jpg)
Contributions & Outline
View OGDA as approximations of PP for finding a saddle point
We show the iterates of OGDA are o(η2) approx. of PP iterates
We focus on the convex-concave setting in this talk
We use the PP approximation viewpoint to show that for OGDA
The iterates are boundedFunction value converges at a rate of O(1/k)
At the end, we also provide convergence rate results for
Bilinear problemsStrongly convex-strongly concave problems
We revisit the Extra-gradient (EG) method using the same approach
13
![Page 25: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/25.jpg)
Contributions & Outline
View OGDA as approximations of PP for finding a saddle point
We show the iterates of OGDA are o(η2) approx. of PP iterates
We focus on the convex-concave setting in this talk
We use the PP approximation viewpoint to show that for OGDA
The iterates are boundedFunction value converges at a rate of O(1/k)
At the end, we also provide convergence rate results for
Bilinear problemsStrongly convex-strongly concave problems
We revisit the Extra-gradient (EG) method using the same approach
13
![Page 26: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/26.jpg)
Contributions & Outline
View OGDA as approximations of PP for finding a saddle point
We show the iterates of OGDA are o(η2) approx. of PP iterates
We focus on the convex-concave setting in this talk
We use the PP approximation viewpoint to show that for OGDA
The iterates are boundedFunction value converges at a rate of O(1/k)
At the end, we also provide convergence rate results for
Bilinear problemsStrongly convex-strongly concave problems
We revisit the Extra-gradient (EG) method using the same approach
13
![Page 27: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/27.jpg)
Contributions & Outline
View OGDA as approximations of PP for finding a saddle point
We show the iterates of OGDA are o(η2) approx. of PP iterates
We focus on the convex-concave setting in this talk
We use the PP approximation viewpoint to show that for OGDA
The iterates are boundedFunction value converges at a rate of O(1/k)
At the end, we also provide convergence rate results for
Bilinear problemsStrongly convex-strongly concave problems
We revisit the Extra-gradient (EG) method using the same approach
13
![Page 28: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/28.jpg)
Problem
We consider finding the saddle point of the problem:
minx∈Rm
maxy∈Rn
f (x, y)
f is convex in x and concave in y.
f (x, y) is continuously differentiable in x and y.
∇xf and ∇yf are Lipschitz in x and y. L denotes the Lipschitz constant.
(x∗, y∗) ∈ Rm × Rn is a saddle point if it satisfies:
f (x∗, y) ≤ f (x∗, y∗) ≤ f (x, y∗),
for all x ∈ Rm, y ∈ Rn.
14
![Page 29: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/29.jpg)
Problem
We consider finding the saddle point of the problem:
minx∈Rm
maxy∈Rn
f (x, y)
f is convex in x and concave in y.
f (x, y) is continuously differentiable in x and y.
∇xf and ∇yf are Lipschitz in x and y. L denotes the Lipschitz constant.
(x∗, y∗) ∈ Rm × Rn is a saddle point if it satisfies:
f (x∗, y) ≤ f (x∗, y∗) ≤ f (x, y∗),
for all x ∈ Rm, y ∈ Rn.
14
![Page 30: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/30.jpg)
Problem
We consider finding the saddle point of the problem:
minx∈Rm
maxy∈Rn
f (x, y)
f is convex in x and concave in y.
f (x, y) is continuously differentiable in x and y.
∇xf and ∇yf are Lipschitz in x and y. L denotes the Lipschitz constant.
(x∗, y∗) ∈ Rm × Rn is a saddle point if it satisfies:
f (x∗, y) ≤ f (x∗, y∗) ≤ f (x, y∗),
for all x ∈ Rm, y ∈ Rn.
14
![Page 31: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/31.jpg)
Proximal Point
The PP method at each step solves the following:
(xk+1, yk+1) = arg minx∈Rm
maxy∈Rn
{f (x, y) +
1
2η‖x− xk‖2 −
1
2η‖y − yk‖2
}.
Using the first order optimality conditions leads to the following update:
xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).
Theorem (Convergence of Proximal Point )
The iterates generated by Proximal Point satisfy[maxy∈D
f (xk , y)− f ?]
+
[f ? −min
x∈Df (x, yk)
]≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2
ηk.
where D := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2}.
15
![Page 32: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/32.jpg)
Proximal Point
The PP method at each step solves the following:
(xk+1, yk+1) = arg minx∈Rm
maxy∈Rn
{f (x, y) +
1
2η‖x− xk‖2 −
1
2η‖y − yk‖2
}.
Using the first order optimality conditions leads to the following update:
xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).
Theorem (Convergence of Proximal Point )
The iterates generated by Proximal Point satisfy[maxy∈D
f (xk , y)− f ?]
+
[f ? −min
x∈Df (x, yk)
]≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2
ηk.
where D := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2}.
15
![Page 33: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/33.jpg)
Proximal Point
The PP method at each step solves the following:
(xk+1, yk+1) = arg minx∈Rm
maxy∈Rn
{f (x, y) +
1
2η‖x− xk‖2 −
1
2η‖y − yk‖2
}.
Using the first order optimality conditions leads to the following update:
xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).
Theorem (Convergence of Proximal Point )
The iterates generated by Proximal Point satisfy[maxy∈D
f (xk , y)− f ?]
+
[f ? −min
x∈Df (x, yk)
]≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2
ηk.
where D := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ ‖x0 − x∗‖2 + ‖y0 − y∗‖2}.
15
![Page 34: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/34.jpg)
OGDA updates - How prediction takes place
One way of approximating the Proximal Point update is as follows
∇xf (xk+1, yk+1)) ≈ ∇xf (xk , yk) + (∇xf (xk , yk)−∇xf (xk−1, yk−1))
∇yf (xk+1, yk+1)) ≈ ∇yf (xk , yk) + (∇yf (xk , yk)−∇yf (xk−1, yk−1))
This leads to the OGDA update
xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))
yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))
16
![Page 35: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/35.jpg)
OGDA updates - How prediction takes place
One way of approximating the Proximal Point update is as follows
∇xf (xk+1, yk+1)) ≈ ∇xf (xk , yk) + (∇xf (xk , yk)−∇xf (xk−1, yk−1))
∇yf (xk+1, yk+1)) ≈ ∇yf (xk , yk) + (∇yf (xk , yk)−∇yf (xk−1, yk−1))
This leads to the OGDA update
xk+1 = xk − η∇xf (xk , yk) − η(∇xf (xk , yk)−∇xf (xk−1, yk−1))
yk+1 = yk + η∇yf (xk , yk) + η(∇yf (xk , yk)−∇yf (xk−1, yk−1))
16
![Page 36: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/36.jpg)
OGDA vs PP
Given a point (xk , yk), let
(xk+1, yk+1) be the point obtained by performing PP on (xk , yk).(xk+1, yk+1) be the point obtained by performing OGDA on (xk , yk)
For a given stepsize η > 0 we have
‖xk+1 − xk+1‖ ≤ o(η2), ‖yk+1 − yk+1‖ ≤ o(η2).
Proof follows from Taylor’s expansion of ∇xf (xk+1, yk+1) and∇yf (xk+1, yk+1) at the point (xk , yk).
17
![Page 37: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/37.jpg)
OGDA vs PP
Given a point (xk , yk), let
(xk+1, yk+1) be the point obtained by performing PP on (xk , yk).(xk+1, yk+1) be the point obtained by performing OGDA on (xk , yk)
For a given stepsize η > 0 we have
‖xk+1 − xk+1‖ ≤ o(η2), ‖yk+1 − yk+1‖ ≤ o(η2).
Proof follows from Taylor’s expansion of ∇xf (xk+1, yk+1) and∇yf (xk+1, yk+1) at the point (xk , yk).
17
![Page 38: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/38.jpg)
Convergence rates
Theorem (Convex-Concave case)
Let the stepsize η satisfies the condition 0 < η ≤ 1/2L, then the iteratesgenerated by OGDA satisfy
[maxy∈D
f (xk , y)− f ?] + [f ?−minx∈D
f (x, yk)] ≤(‖x0 − x∗‖2 + ‖y0 − y∗‖2)(8L + 1
2η )
k
where D := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ 2(‖x0 − x∗‖2 + ‖y0 − y∗‖2
)}.
OGDA has an iteration complexity of O(1/k) (same as PP)
First convergence guarantee for OGDA
18
![Page 39: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/39.jpg)
Proximal Point with error
We study ‘inexact’ versions of the Proximal Point method, given by
xk+1 = xk − η∇xf (xk+1, yk+1) + εxk
yk+1 = yk + η∇yf (xk+1, yk+1)− εyk
The ‘error’ εxk , εyk should be such that
Algorithm should be easy to implement.Retains convergence properties of PP.
Lemma (3 point equality for Proximal Point with error)
F (zk+1)>(zk+1 − z)
=1
2η‖zk − z‖2 − 1
2η‖zk+1 − z‖2 − 1
2η‖zk+1 − zk‖2 +
1
ηεk
T (zk+1 − z),
Here z = [x; y], F (z) = [∇xf (x, y);−∇yf (x, y)] and εk = [εxk ;−εyk ].
19
![Page 40: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/40.jpg)
Proximal Point with error
We study ‘inexact’ versions of the Proximal Point method, given by
xk+1 = xk − η∇xf (xk+1, yk+1) + εxk
yk+1 = yk + η∇yf (xk+1, yk+1)− εyk
The ‘error’ εxk , εyk should be such that
Algorithm should be easy to implement.Retains convergence properties of PP.
Lemma (3 point equality for Proximal Point with error)
F (zk+1)>(zk+1 − z)
=1
2η‖zk − z‖2 − 1
2η‖zk+1 − z‖2 − 1
2η‖zk+1 − zk‖2 +
1
ηεk
T (zk+1 − z),
Here z = [x; y], F (z) = [∇xf (x, y);−∇yf (x, y)] and εk = [εxk ;−εyk ].
19
![Page 41: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/41.jpg)
Proximal Point with error
We study ‘inexact’ versions of the Proximal Point method, given by
xk+1 = xk − η∇xf (xk+1, yk+1) + εxk
yk+1 = yk + η∇yf (xk+1, yk+1)− εyk
The ‘error’ εxk , εyk should be such that
Algorithm should be easy to implement.Retains convergence properties of PP.
Lemma (3 point equality for Proximal Point with error)
F (zk+1)>(zk+1 − z)
=1
2η‖zk − z‖2 − 1
2η‖zk+1 − z‖2 − 1
2η‖zk+1 − zk‖2 +
1
ηεk
T (zk+1 − z),
Here z = [x; y], F (z) = [∇xf (x, y);−∇yf (x, y)] and εk = [εxk ;−εyk ].
19
![Page 42: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/42.jpg)
Convergence rates
Proof Sketch: (Here z = [x; y] and F (z) = [∇xf (x, y);−∇yf (x, y)].)
For OGDA, we have: εk = η[F (zk+1)− 2F (zk) + F (zk−1)].
Substitute this error in the 3 point lemma for PP with error to get:
N−1∑k=0
F (zk+1)>(zk+1 − z) ≤ 1
2η‖z0 − z‖2 − 1
2η‖zN − z‖2
− L
2‖zN − zN−1‖2 + (F (zN)− F (zN−1))>(zN − z).
20
![Page 43: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/43.jpg)
Convergence rates
Proof Sketch: (Here z = [x; y] and F (z) = [∇xf (x, y);−∇yf (x, y)].)
For OGDA, we have: εk = η[F (zk+1)− 2F (zk) + F (zk−1)].
Substitute this error in the 3 point lemma for PP with error to get:
N−1∑k=0
F (zk+1)>(zk+1 − z) ≤ 1
2η‖z0 − z‖2 −
(1
2η− L
2
)‖zN − z‖2
20
![Page 44: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/44.jpg)
Convergence rates
Proof Sketch: (Here z = [x; y] and F (z) = [∇xf (x, y);−∇yf (x, y)].)
For OGDA, we have: εk = η[F (zk+1)− 2F (zk) + F (zk−1)].
Substitute this error in the 3 point lemma for PP with error to get:
N−1∑k=0
F (zk+1)>(zk+1 − z) ≤ 1
2η‖z0 − z‖2 −
(1
2η− L
2
)‖zN − z‖2
F (z)>(z− z∗) ≥ 0 for all z ⇒ The iterates are bounded
‖zN − z∗‖2 ≤ c‖z0 − z∗‖2
Using convexity-concavity of f (x, y) we have:
[maxy∈D
f (xN , y)− f ?] + [f ? −minx∈D
f (x, yN)] ≤ 1
N
N−1∑k=0
F (zk)>(zk − z)
≤ C‖z0 − z∗‖2
N
20
![Page 45: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/45.jpg)
Other Approximation of Proximal Point
The Proximal Point updates are:
xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).
This can also be written as:
xk+1 = xk − η∇xf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)),
yk+1 = yk + η∇yf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)).
Approximate the inner gradient at (xk+1, yk+1) by the gradient at(xk , yk):
xk+1 = xk − η∇xf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)),
yk+1 = yk + η∇yf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)).
which is nothing but the Extra-gradient Algorithm!
21
![Page 46: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/46.jpg)
Other Approximation of Proximal Point
The Proximal Point updates are:
xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).
This can also be written as:
xk+1 = xk − η∇xf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)),
yk+1 = yk + η∇yf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)).
Approximate the inner gradient at (xk+1, yk+1) by the gradient at(xk , yk):
xk+1 = xk − η∇xf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)),
yk+1 = yk + η∇yf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)).
which is nothing but the Extra-gradient Algorithm!
21
![Page 47: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/47.jpg)
Other Approximation of Proximal Point
The Proximal Point updates are:
xk+1 = xk − η∇xf (xk+1, yk+1), yk+1 = yk + η∇yf (xk+1, yk+1).
This can also be written as:
xk+1 = xk − η∇xf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)),
yk+1 = yk + η∇yf (xk − η∇xf (xk+1, yk+1), yk + η∇yf (xk+1, yk+1)).
Approximate the inner gradient at (xk+1, yk+1) by the gradient at(xk , yk):
xk+1 = xk − η∇xf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)),
yk+1 = yk + η∇yf (xk − η∇xf (xk , yk), yk + η∇yf (xk , yk)).
which is nothing but the Extra-gradient Algorithm!
21
![Page 48: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/48.jpg)
EG updates - How prediction takes place
The updates of EG
xk+1/2 = xk − η∇xf (xk , yk), yk+1/2 = yk + η∇yf (xk , yk).
The gradients evaluated at the midpoints xk+1/2 and yk+1/2 are used tocompute the new iterates xk+1 and yk+1 by performing the updates
xk+1 = xk − η∇xf (xk+1/2, yk+1/2),
yk+1 = yk + η∇yf (xk+1/2, yk+1/2).
22
![Page 49: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/49.jpg)
EG vs PP
Given a point (xk , yk), let
(xk+1, yk+1) be the point we obtain by performing PP on (xk , yk).(xk+1, yk+1) be the point we obtain by performing EG on (xk , yk)
Then, for a given stepsize η > 0 we have
‖xk+1 − xk+1‖ ≤ o(η2), ‖yk+1 − yk+1‖ ≤ o(η2).
Proof follows from Taylor’s expansion of ∇xf (xk+1, yk+1) and∇yf (xk+1, yk+1), as well as ∇xf (xk+1/2, yk+1/2) and ∇yf (xk+1/2, yk+1/2)about the point (xk , yk).
23
![Page 50: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/50.jpg)
EG vs PP
Given a point (xk , yk), let
(xk+1, yk+1) be the point we obtain by performing PP on (xk , yk).(xk+1, yk+1) be the point we obtain by performing EG on (xk , yk)
Then, for a given stepsize η > 0 we have
‖xk+1 − xk+1‖ ≤ o(η2), ‖yk+1 − yk+1‖ ≤ o(η2).
Proof follows from Taylor’s expansion of ∇xf (xk+1, yk+1) and∇yf (xk+1, yk+1), as well as ∇xf (xk+1/2, yk+1/2) and ∇yf (xk+1/2, yk+1/2)about the point (xk , yk).
23
![Page 51: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/51.jpg)
EG vs PP
EG also does a good job in approximating Proximal Point method!
24
![Page 52: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/52.jpg)
Convergence rates
Theorem (Convex-Concave case)
For η = σL , where σ ∈ (0, 1), the iterates generated by the EG method satisfy[
maxy∈D
f (xk , y)− f ?]
+
[f ? −min
x∈Df (x, yk)
]≤
DL(
16 + 332(1−σ2)
)k
where D := ‖x0 − x∗‖2 + ‖y0 − y∗‖2 andD := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ (2 + 2
1−η2L2 )D}.
EG has an iteration complexity of O(1/k) (same as Proximal Point)
O(1/k) rate for the convex-concave case was shown in [Nemirovski ’04]
when the feasible set is compact.
[Monteiro & Svaiter ’10] extended to unbounded sets using a differenttermination criterion.
Our result shows
a convergence rate of O(1/k) in terms of function valuewithout assuming compactness
25
![Page 53: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/53.jpg)
Convergence rates
Theorem (Convex-Concave case)
For η = σL , where σ ∈ (0, 1), the iterates generated by the EG method satisfy[
maxy∈D
f (xk , y)− f ?]
+
[f ? −min
x∈Df (x, yk)
]≤
DL(
16 + 332(1−σ2)
)k
where D := ‖x0 − x∗‖2 + ‖y0 − y∗‖2 andD := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ (2 + 2
1−η2L2 )D}.
EG has an iteration complexity of O(1/k) (same as Proximal Point)
O(1/k) rate for the convex-concave case was shown in [Nemirovski ’04]
when the feasible set is compact.
[Monteiro & Svaiter ’10] extended to unbounded sets using a differenttermination criterion.
Our result shows
a convergence rate of O(1/k) in terms of function valuewithout assuming compactness
25
![Page 54: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/54.jpg)
Convergence rates
Theorem (Convex-Concave case)
For η = σL , where σ ∈ (0, 1), the iterates generated by the EG method satisfy[
maxy∈D
f (xk , y)− f ?]
+
[f ? −min
x∈Df (x, yk)
]≤
DL(
16 + 332(1−σ2)
)k
where D := ‖x0 − x∗‖2 + ‖y0 − y∗‖2 andD := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ (2 + 2
1−η2L2 )D}.
EG has an iteration complexity of O(1/k) (same as Proximal Point)
O(1/k) rate for the convex-concave case was shown in [Nemirovski ’04]
when the feasible set is compact.
[Monteiro & Svaiter ’10] extended to unbounded sets using a differenttermination criterion.
Our result shows
a convergence rate of O(1/k) in terms of function valuewithout assuming compactness
25
![Page 55: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/55.jpg)
Convergence rates
Theorem (Convex-Concave case)
For η = σL , where σ ∈ (0, 1), the iterates generated by the EG method satisfy[
maxy∈D
f (xk , y)− f ?]
+
[f ? −min
x∈Df (x, yk)
]≤
DL(
16 + 332(1−σ2)
)k
where D := ‖x0 − x∗‖2 + ‖y0 − y∗‖2 andD := {(x, y) | ‖x− x∗‖2 + ‖y − y∗‖2 ≤ (2 + 2
1−η2L2 )D}.
EG has an iteration complexity of O(1/k) (same as Proximal Point)
O(1/k) rate for the convex-concave case was shown in [Nemirovski ’04]
when the feasible set is compact.
[Monteiro & Svaiter ’10] extended to unbounded sets using a differenttermination criterion.
Our result shows
a convergence rate of O(1/k) in terms of function valuewithout assuming compactness
25
![Page 56: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/56.jpg)
Extensions
Bilinear:
f (x, y) = x>ByB ∈ Rd×d is a square full-rank matrix.(x∗, y∗) = (0, 0) is the unique saddle point.
The condition number is defined as κ := λmax(B>B)
λmin(B>B)
Strongly Convex - Strongly Concave:
f (x, y) is continuously differentiable in x and y.f is µx -strongly convex in x and µy -strongly concave in y.We define µ := min{µx , µy}.The condition number is defined as κ := L
µ
26
![Page 57: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/57.jpg)
Extensions
Bilinear:
f (x, y) = x>ByB ∈ Rd×d is a square full-rank matrix.(x∗, y∗) = (0, 0) is the unique saddle point.
The condition number is defined as κ := λmax(B>B)
λmin(B>B)
Strongly Convex - Strongly Concave:
f (x, y) is continuously differentiable in x and y.f is µx -strongly convex in x and µy -strongly concave in y.We define µ := min{µx , µy}.The condition number is defined as κ := L
µ
26
![Page 58: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/58.jpg)
Convergence rates
Theorem (Bilinear case (OGDA))
Let the stepsize η = (1/40√λmax(B>B)), then the iterates generated by the
OGDA method satisfy
‖xk+1‖2 + ‖yk+1‖2 ≤(
1− 1
cκ
)k
(‖x0‖2 + ‖y0‖2),
where c is a positive constant independent of the problem parameters.
Theorem (Bilinear case (EG))
Let the stepsize η = 1/(2√
2λmax(B>B)), then the iterates generated by theEG method satisfy
‖xk+1‖2 + ‖yk+1‖2 ≤(
1− 1
cκ
)(‖xk‖2 + ‖yk‖2),
where c is a positive constant independent of the problem parameters.
27
![Page 59: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/59.jpg)
Convergence rates
Theorem (Bilinear case (OGDA))
Let the stepsize η = (1/40√λmax(B>B)), then the iterates generated by the
OGDA method satisfy
‖xk+1‖2 + ‖yk+1‖2 ≤(
1− 1
cκ
)k
(‖x0‖2 + ‖y0‖2),
where c is a positive constant independent of the problem parameters.
Theorem (Bilinear case (EG))
Let the stepsize η = 1/(2√
2λmax(B>B)), then the iterates generated by theEG method satisfy
‖xk+1‖2 + ‖yk+1‖2 ≤(
1− 1
cκ
)(‖xk‖2 + ‖yk‖2),
where c is a positive constant independent of the problem parameters.
27
![Page 60: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/60.jpg)
Convergence rates
Theorem (Strongly convex-Strongly concave case (OGDA))
Let the stepsize η = (1/(4L)), then the iterates generated by OGDA satisfy
‖xk+1 − x∗‖2 + ‖yk+1 − x∗‖2 ≤(
1− 1
cκ
)k
(‖x0 − x∗‖2 + ‖y0 − y∗‖2),
where c is a positive constant independent of the problem parameters.
Theorem (Strongly convex-Strongly concave case (EG))
Let the stepsize η = 1/(4L), then the iterates generated by the EG methodsatisfy
‖xk+1 − x∗‖2 + ‖yk+1 − x∗‖2 ≤(
1− 1
cκ
)(‖xk − x∗‖2 + ‖yk − x∗‖2),
where c is a positive constant independent of the problem parameters.
28
![Page 61: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/61.jpg)
Convergence rates
Theorem (Strongly convex-Strongly concave case (OGDA))
Let the stepsize η = (1/(4L)), then the iterates generated by OGDA satisfy
‖xk+1 − x∗‖2 + ‖yk+1 − x∗‖2 ≤(
1− 1
cκ
)k
(‖x0 − x∗‖2 + ‖y0 − y∗‖2),
where c is a positive constant independent of the problem parameters.
Theorem (Strongly convex-Strongly concave case (EG))
Let the stepsize η = 1/(4L), then the iterates generated by the EG methodsatisfy
‖xk+1 − x∗‖2 + ‖yk+1 − x∗‖2 ≤(
1− 1
cκ
)(‖xk − x∗‖2 + ‖yk − x∗‖2),
where c is a positive constant independent of the problem parameters.
28
![Page 62: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/62.jpg)
Convergence rates
OGDA and EG have an iteration complexity of O(κ log(1/ε))
This is the same iteration complexity as the PP method.Similar rates obtained in [Tseng 95], [Liang & Stokes 19], [Gidel et al. 19]
Performance of GDA in these settings
Bilinear: GDA may not converge.Strongly convex-Strongly concave: GDA has an iterationcomplexity of O(κ2 log(1/ε))
29
![Page 63: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/63.jpg)
Generalized OGDA
Using the interpretation that OGDA is an approximation of PP, we canextend this algorithm to a more general set of parameters.
The Generalized OGDA updates are:
xk+1 = xk − α∇xf (xk , yk)− β(∇xf (xk , yk)−∇xf (xk−1, yk−1)
)yk+1 = yk + α∇yf (xk , yk) + β
(∇yf (xk , yk)−∇yf (xk−1, yk−1)
)
Note that
β = 0, this reduces to the GDA updates.β = α, this reduces to the standard OGDA updates.
30
![Page 64: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/64.jpg)
Generalized OGDA
Using the interpretation that OGDA is an approximation of PP, we canextend this algorithm to a more general set of parameters.
The Generalized OGDA updates are:
xk+1 = xk − α∇xf (xk , yk)− β(∇xf (xk , yk)−∇xf (xk−1, yk−1)
)yk+1 = yk + α∇yf (xk , yk) + β
(∇yf (xk , yk)−∇yf (xk−1, yk−1)
)
Note that
β = 0, this reduces to the GDA updates.β = α, this reduces to the standard OGDA updates.
30
![Page 65: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/65.jpg)
Generalized OGDA
Using the interpretation that OGDA is an approximation of PP, we canextend this algorithm to a more general set of parameters.
The Generalized OGDA updates are:
xk+1 = xk − α∇xf (xk , yk)− β(∇xf (xk , yk)−∇xf (xk−1, yk−1)
)yk+1 = yk + α∇yf (xk , yk) + β
(∇yf (xk , yk)−∇yf (xk−1, yk−1)
)
Note that
β = 0, this reduces to the GDA updates.β = α, this reduces to the standard OGDA updates.
30
![Page 66: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/66.jpg)
Conclusion
Conclusions and Future Work
Studied Optimism through the lens of Proximal Point method
Analyzed OGDA as an inexact version of Proximal Point method
Convex-concave, bilinear, strongly convex-strongly concave
Revisited EG as an approximation of the Proximal Point method
This interpretation also aided in the design of Generalized OGDA
Extension to more general settings:
nonconvex-concaveweakly convex-weakly concave
31
![Page 67: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/67.jpg)
Conclusion
Conclusions and Future Work
Studied Optimism through the lens of Proximal Point method
Analyzed OGDA as an inexact version of Proximal Point method
Convex-concave, bilinear, strongly convex-strongly concave
Revisited EG as an approximation of the Proximal Point method
This interpretation also aided in the design of Generalized OGDA
Extension to more general settings:
nonconvex-concaveweakly convex-weakly concave
31
![Page 68: Understanding the Role of Optimism in Minimax Optimization ... · min x2Rm max y2Rn f(x;y) Minimax optimization has been studied for a very long time: Game Theory: [Basar & Oldser,](https://reader036.vdocument.in/reader036/viewer/2022071101/5fd9cf81109d553d4c42a34d/html5/thumbnails/68.jpg)
References
Thanks!
Convergence Rate of O(1/k) for Optimistic Gradient and Extra-gradientMethods in Smooth Convex-Concave Saddle Point Problemshttps://arxiv.org/pdf/1906.01115.pdf
A Unified Analysis of Extra-gradient and Optimistic Gradient Methods forSaddle Point Problems: Proximal Point Approachhttps://arxiv.org/pdf/1901.08511.pdf
32