subgradient methods - princeton universityyc5/ele522_optimization/lectures/sub... · where...
TRANSCRIPT
![Page 1: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/1.jpg)
ELE 522: Large-Scale Optimization for Data Science
Subgradient methods
Yuxin Chen
Princeton University, Fall 2019
![Page 2: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/2.jpg)
Outline
• Steepest descent
• Subgradients
• Projected subgradient descent◦ Convex and Lipschitz problems◦ Strongly convex and Lipschitz problems
• Convex-concave saddle point problems
Subgradient methods 4-2
![Page 3: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/3.jpg)
Nondifferentiable problems
Differentiability of the objective function f is essential for the validityof gradient methods
However, there is no shortage of interesting cases (e.g. `1minimization, nuclear norm minimization) where non-differentiabilityis present at some points
Subgradient methods 4-3
![Page 4: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/4.jpg)
Generalizing steepest descent?
minimizex f(x) subject to x ∈ C
• find a search direction dt that minimizes the directional derivative
dt ∈ arg mind:‖d‖2≤1
f ′(xt;d)
where f ′(x;d) := limα↓0f(x+αd)−f(x)
α
• updatesxt+1 = xt + ηtd
t
Subgradient methods 4-4
![Page 5: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/5.jpg)
Issues
• Finding the steepest descent direction (or even finding a descentdirection) may involve expensive computation
• Stepsize rules are tricky to choose: for certain popular stepsizerules (like exact line search), steepest descent might converge tonon-optimal points
Subgradient methods 4-5
![Page 6: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/6.jpg)
Wolfe’s example
f(x1, x2) ={
5(9x21 + 16x2
2) 12 if x1 > |x2|
9x1 + 16|x2| if x1 ≤ |x2|
• (0,0) is a non-differentiable point• if one starts from x0 = (16
9 , 1) and uses exact line search, then◦ {xt} are all differentiable points◦ xt → (0, 0) as t→∞
• even though it never hits non-differentiable points, steepestdescent with exact line search gets stuck around a non-optimalpoint (i.e. (0, 0))• problem: steepest descent directions may undergo
large / discontinuous changes when close to convergence limits
Subgradient methods 4-6
![Page 7: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/7.jpg)
Wolfe’s example
f(x1, x2) ={
5(9x21 + 16x2
2) 12 if x1 > |x2|
9x1 + 16|x2| if x1 ≤ |x2|
• (0,0) is a non-differentiable point• if one starts from x0 = (16
9 , 1) and uses exact line search, then
◦ {xt} are all differentiable points◦ xt → (0, 0) as t→∞
• even though it never hits non-differentiable points, steepestdescent with exact line search gets stuck around a non-optimalpoint (i.e. (0, 0))• problem: steepest descent directions may undergo
large / discontinuous changes when close to convergence limits
Subgradient methods 4-6
![Page 8: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/8.jpg)
(Projected) subgradient method
Practically, a popular choice is “subgradient-based methods”
xt+1 = PC(xt − ηtgt
)(4.1)
where gt is any subgradient of f at xt
• the focus of this lecture• caution: this update rule does not necessarily yield reduction
w.r.t. the objective values
Subgradient methods 4-7
![Page 9: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/9.jpg)
Subgradients
![Page 10: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/10.jpg)
Subgradients
Subgradients
We say g is subgradient of f at point x if
f(z) Ø f(x) + g€(z ≠ x)¸ ˚˙ ˝linear under-estimate of f
, ’z (4.2)
• set of all subgradients of f at x is called subdi�erential of f atx, denoted by ˆf(x)
Subgradient methods 4-9
We say g is a subgradient of f at the point x if
f(z) ≥ f(x) + g>(z − x)︸ ︷︷ ︸a linear under-estimate of f
, ∀z (4.2)
• the set of all subgradients of f at x is called the subdifferentialof f at x, denoted by ∂f(x)
Subgradient methods 4-9
![Page 11: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/11.jpg)
Example: f(x) = |x|Example: f(x) = |x|
f(x) = |x|
ˆf(x) =
Y__]__[
{≠1}, if x < 0[≠1, 1], if x = 0{1}, if x > 0
Subgradient methods 4-10
Example: f(x) = |x|
f(x) = |x|
ˆf(x) =
Y__]__[
{≠1}, if x < 0[≠1, 1], if x = 0{1}, if x > 0
Subgradient methods 4-10
Example: f(x) = |x|
f(x) = |x|
ˆf(x) =
Y__]__[
{≠1}, if x < 0[≠1, 1], if x = 0{1}, if x > 0
Subgradient methods 4-10
Example: f(x) = |x|
f(x) = |x|
ˆf(x) =
Y__]__[
{≠1}, if x < 0[≠1, 1], if x = 0{1}, if x > 0
Subgradient methods 4-10
f(x) = |x| ∂f(x) =
{−1}, if x < 0[−1, 1], if x = 0{1}, if x > 0
Subgradient methods 4-10
![Page 12: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/12.jpg)
Example: a subgradient of norms at 0
Let f(x) = ‖x‖ for any norm ‖ · ‖, then for any g obeying ‖g‖∗ ≤ 1,
g ∈ ∂f(0)
where ‖ · ‖∗ is the dual norm of ‖ · ‖ (i.e. ‖x‖∗ := supz:‖z‖≤1〈z,x〉)
Proof: To see this, it suffices to prove that
f(z) ≥ f(0) + 〈g, z − 0〉, ∀z
⇐⇒ 〈g, z〉 ≤ ‖z‖, ∀zThis follows from generalized Cauchy-Schwarz, i.e.
〈g, z〉 ≤ ‖g‖∗‖z‖ ≤ ‖z‖
�Subgradient methods 4-11
![Page 13: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/13.jpg)
Example: max{f1(x), f2(x)}
Example: max{f1(x), f2(x)}
f(x) = max{f1(x), f2(x)} where f1 and f2 are di�erentiable
ˆf(x) =
Y__]__[
{f Õ1(x)}, if f1(x) > f2(x)
[f Õ1(x), f Õ
2(x)], if f1(x) = f2(x){f Õ
2(x)}, if f1(x) < f2(x)
Subgradient methods 4-11
Example: f(x) = |x|
f(x) = |x|
ˆf(x) =
Y__]__[
{≠1}, if x < 0[≠1, 1], if x = 0{1}, if x > 0
Subgradient methods 4-10
f(x) = max{f1(x), f2(x)} where f1 and f2 are differentiable
∂f(x) =
{f ′1(x)}, if f1(x) > f2(x)[f ′1(x), f ′2(x)], if f1(x) = f2(x){f ′2(x)}, if f1(x) < f2(x)
Subgradient methods 4-12
![Page 14: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/14.jpg)
Basic rules
• scaling: ∂(αf) = α∂f (for α > 0)
• summation: ∂(f1 + f2) = ∂f1 + ∂f2
Subgradient methods 4-13
![Page 15: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/15.jpg)
Example: `1 norm
f(x) = ‖x‖1 =n∑
i=1|xi|︸︷︷︸
=:fi(x)
since
∂fi(x) ={
sgn(xi)ei, if xi 6= 0[−1, 1] · ei, if xi = 0
we have∑
i:xi 6=0sgn(xi)ei ∈ ∂f(x)
Subgradient methods 4-14
![Page 16: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/16.jpg)
Basic rules (cont.)
• affine transformation: if h(x) = f(Ax + b), then
∂h(x) = A>∂f(Ax + b)
Subgradient methods 4-15
![Page 17: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/17.jpg)
Example: ‖Ax + b‖1
h(x) = ‖Ax + b‖1
letting f(x) = ‖x‖1 and A = [a1, · · · ,am]>, we have
g =∑
i:a>i x+bi 6=0
sgn(a>i x + bi)ei ∈ ∂f(Ax + b).
=⇒ A>g =∑
i:a>i x+bi 6=0
sgn(a>i x + bi)ai ∈ ∂h(x)
Subgradient methods 4-16
![Page 18: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/18.jpg)
Basic rules (cont.)
• chain rule: suppose f is convex, and g is differentiable,nondecreasing, and convex. Let h = g ◦ f , then
∂h(x) = g′(f(x))∂f(x)
• composition: suppose f(x) = h(f1(x), · · · , fn(x)), where fi’sare convex, and h is differentiable, nondecreasing, and convex.Let q = ∇h (y) |y=[f1(x),··· ,fn(x)], and gi ∈ ∂fi(x). Then
q1g1 + · · ·+ qngn ∈ ∂f(x)
Subgradient methods 4-17
![Page 19: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/19.jpg)
Basic rules (cont.)
• pointwise maximum: if f(x) = max1≤i≤k fi(x), then
∂f(x) = conv{⋃{
∂fi(x) | fi(x) = f(x)}}
︸ ︷︷ ︸convex hull of subdifferentials of all active functions
• pointwise supremum: if f(x) = supα∈F fα(x), then
∂f(x) = closure(
conv{⋃{
∂fα(x) | fα(x) = f(x)}})
Subgradient methods 4-18
![Page 20: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/20.jpg)
Example: piece-wise linear functions
f(x) = max1≤i≤m
{a>i x + bi
}
pick any aj s.t. a>j x + bj = maxi{a>i x + bi
}, then
aj ∈ ∂f(x)
Subgradient methods 4-19
![Page 21: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/21.jpg)
Example: the `∞ norm
f(x) = ‖x‖∞ = max1≤i≤n
|xi|
if x 6= 0, then pick any xj obeying |xj | = maxi |xi| to obtain
sgn(xj)ej ∈ ∂f(x)
Subgradient methods 4-20
![Page 22: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/22.jpg)
Example: the maximum eigenvalue
f(x) = λmax (x1A1 + · · ·+ xnAn)
where A1, · · · ,An are real symmetric matrices
Rewritef(x) = sup
y:‖y‖2=1y> (x1A1 + · · ·+ xnAn)y
as the supremum of some affine functions of x. Therefore, taking yas the leading eigenvector of x1A1 + · · ·+ xnAn, we have
[y>A1y, · · · ,y>Any
]> ∈ ∂f(x)
Subgradient methods 4-21
![Page 23: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/23.jpg)
Example: the nuclear norm
Let X ∈ Rm×n with SVD X = UΣV > and
f(X) =min{n,m}∑
i=1σi(X)
where σi(x) is the ith largest singular value of X
Rewrite
f(X) = suporthonormal A,B
⟨AB>,X
⟩:= sup
orthonormal A,BfA,B(X)
Recognizing that fA,B(X) is maximized by A = U and B = V andthat ∇fA,B(X) = AB>, we have
UV > ∈ ∂f(X)
Subgradient methods 4-22
![Page 24: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/24.jpg)
Negative subgradients are not necessarily descentdirections
Example: f(x) = |x1|+ 3|x2|
Negative subgradient is not necessarily descentdirection
Example: f(x) = |x1| + 3|x2|
At x = (1, 0):• g1 = ≠(1, 0) œ ˆf(x), and ≠g1 is descent direction• g2 = ≠(1, 3) œ ˆf(x), but ≠g2 is not a descent direction
Reason: lack of continuity — one can change direction significantlywithout violating validity of subgradient
Subgradient methods 4-23
Negative subgradient is not necessarily descentdirection
Example: f(x) = |x1| + 3|x2|
At x = (1, 0):• g1 = ≠(1, 0) œ ˆf(x), and ≠g1 is descent direction• g2 = ≠(1, 3) œ ˆf(x), but ≠g2 is not a descent direction
Reason: lack of continuity — one can change direction significantlywithout violating validity of subgradient
Subgradient methods 4-23
at x = (1, 0):• g1 = (1, 0) ∈ ∂f(x), and −g1 is a descent direction• g2 = (1, 3) ∈ ∂f(x), but −g2 is not a descent direction
Reason: lack of continuity — one can change directionssignificantly without violating the validity of subgradientsSubgradient methods 4-23
![Page 25: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/25.jpg)
Negative subgradient is not necessarily descentdirection
Since f(xt) is not necessarily monotone, we will keep track of thebest point
fbest,t := min1≤i≤t
f(xi)
We also denote by fopt := minx f(x) the optimal objective value
Subgradient methods 4-24
![Page 26: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/26.jpg)
Convex and Lipschitz problems
Clearly, we cannot analyze all nonsmooth functions. A nice (andwidely encountered) class to start with is Lipschitz functions, i.e. theset of all f obeying
|f(x)− f(z)| ≤ Lf‖x− z‖2 ∀x and z
Subgradient methods 4-25
![Page 27: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/27.jpg)
Fundamental inequality for projected subgradientmethods
We’d like to optimize ‖xt+1 − x∗‖22, but don’t have access to x∗
Key idea (majorization-minimization): find another function thatmajorizes ‖xt+1 − x∗‖22, and optimize the majorizing function
Lemma 4.1
Projected subgradient update rule (4.1) obeys
‖xt+1 − x∗‖22 ≤ ‖xt − x∗‖22︸ ︷︷ ︸fixed
− 2ηt(f(xt)− fopt)+ η2
t ‖gt‖22︸ ︷︷ ︸
majorizing function
(4.3)
Subgradient methods 4-26
![Page 28: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/28.jpg)
Proof of Lemma 4.1
‖xt+1 − x∗‖22 = ‖PC(xt − ηtgt
)− PC(x∗)‖22
≤ ‖xt − ηtgt − x∗‖22 (nonexpansiveness of projection)= ‖xt − x∗‖22 − 2ηt〈xt − x∗, gt〉+ η2
t ‖gt‖22≤ ‖xt − x∗‖22 − 2ηt
(f(xt)− f(x∗)
)+ η2
t ‖gt‖22
where the last line uses the subgradient inequality
f(x∗)− f(xt) ≥ 〈x∗ − xt, gt〉
Subgradient methods 4-27
![Page 29: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/29.jpg)
Polyak’s stepsize rule
The majorizing function in (4.3) suggests a stepsize (Polyak ’87)
ηt = f(xt)− fopt
‖gt‖22(4.4)
which leads to error reduction
‖xt+1 − x∗‖22 ≤ ‖xt − x∗‖22 −(f(xt)− f(x∗)
)2
‖gt‖22(4.5)
• useful if fopt is known
• the estimation error is monotonically decreasing with Polyak’sstepsize
Subgradient methods 4-28
![Page 30: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/30.jpg)
Example: projection onto intersection of convex setsExample: projection onto intersection of convex sets
[PICTURE]Let C1, C2 be closed convex sets and suppose C1 fl C2 ”= ÿ
find x œ C1 fl C2
Ì
minimizex max {distC1(x), distC2(x)}where distC(x) := minzœC Îx ≠ zÎ2
Subgradient methods 4-29
Example: projection onto intersection of convex sets
[PICTURE]Let C1, C2 be closed convex sets and suppose C1 fl C2 ”= ÿ
find x œ C1 fl C2
Ì
minimizex max {distC1(x), distC2(x)}where distC(x) := minzœC Îx ≠ zÎ2
Subgradient methods 4-29
Let C1, C2 be closed convex sets and suppose C1 ∩ C2 6= ∅
find x ∈ C1 ∩ C2
mminimizex max {distC1(x), distC2(x)}
where distC(x) := minz∈C ‖x− z‖2Subgradient methods 4-29
![Page 31: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/31.jpg)
Example: projection onto intersection of convex setsExample: projection onto intersection of convex sets
[PICTURE]Let C1, C2 be closed convex sets and suppose C1 fl C2 ”= ÿ
find x œ C1 fl C2
Ì
minimizex max {distC1(x), distC2(x)}where distC(x) := minzœC Îx ≠ zÎ2
Subgradient methods 4-29
Example: projection onto intersection of convex sets
[PICTURE]Let C1, C2 be closed convex sets and suppose C1 fl C2 ”= ÿ
find x œ C1 fl C2
Ì
minimizex max {distC1(x), distC2(x)}where distC(x) := minzœC Îx ≠ zÎ2
Subgradient methods 4-29
For this problem, the subgradient method with Polyak’s stepsize ruleis equivalent to alternating projection
xt+1 = PC1(xt), xt+2 = PC2(xt+1)
Subgradient methods 4-30
![Page 32: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/32.jpg)
Example: projection onto intersection of convex sets
Proof: Use the subgradient rule for pointwise max functions to get
gt ∈ ∂distCi(xt)
where i = arg maxj=1,2 distCj (xt)
If distCi(xt) 6= 0, then one has
gt = ∇distCi(xt) = xt − PCi(xt)distCi(xt)
which follows since ∇(
12dist2
Ci(xt))
= xt − PCi(xt) (homework) and
∇(
12dist2
Ci(xt))
= distCi(xt) · ∇distCi(xt)
Subgradient methods 4-31
![Page 33: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/33.jpg)
Example: projection onto intersection of convex sets
Proof (cont.): Adopting Polya’s stepsize rule and recognizing that‖gt‖2 = 1, we arrive at
xt+1 = xt − ηtgt = xt − distCi(xt)‖gt‖22︸ ︷︷ ︸
=ηt
xt − PCi(xt)distCi(xt)
= PCi(xt)
where i = arg maxj=1,2 distCj (xt)�
Subgradient methods 4-32
![Page 34: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/34.jpg)
Convergence rate with Polyak’s stepsize
Theorem 4.2 (Convergence of projected subgradient methodwith Polyak’s stepsize)
Suppose f is convex and Lf -Lipschitz continuous. Then the projectedsubgradient method (4.1) with Polyak’s stepsize rule obeys
fbest,t − fopt ≤ Lf‖x0 − x∗‖2√t+ 1
• sublinear convergence rate O(1/√t)
Subgradient methods 4-33
![Page 35: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/35.jpg)
Proof of Theorem 4.2
We have seen from (4.5) that(f(xt)− f(x∗)
)2 ≤{‖xt − x∗‖22 − ‖xt+1 − x∗‖22
}‖gt‖22
≤{‖xt − x∗‖22 − ‖xt+1 − x∗‖22
}L2f
Applying it recursively for all iterations (from 0th to tth) andsumming them up yield
t∑
k=0
(f(xk)− f(x∗)
)2 ≤{‖x0 − x∗‖22 − ‖xt+1 − x∗‖22
}L2f
=⇒ (t+ 1)(fbest,t − fopt)2 ≤ ‖x0 − x∗‖22L2
f
which concludes the proofSubgradient methods 4-34
![Page 36: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/36.jpg)
Other stepsize choices?
Unfortunately, Polyak’s stepsize rule requires knowledge of fopt,which is often unknown a priori
We might often need simpler rules for setting stepsizes
Subgradient methods 4-35
![Page 37: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/37.jpg)
Convex and Lipschitz problems
Theorem 4.3 (Subgradient methods for convex and Lipschitzfunctions)
Suppose f is convex and Lf -Lipschitz continuous. Then the projectedsubgradient update rule (4.1) obeys
fbest,t − fopt ≤‖x0 − x∗‖22 + L2
f
∑ti=0 η
2i
2∑ti=0 ηi
Subgradient methods 4-36
![Page 38: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/38.jpg)
Implications: stepsize rules
• Constant step size ηt ≡ η:
limt→∞
fbest,t ≤L2fη
2
i.e. may converge to non-optimal points
• Diminishing step size obeying ∑t η2t <∞ and ∑t ηt →∞:
limt→∞
fbest,t = 0
i.e. converges to optimal points
Subgradient methods 4-37
![Page 39: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/39.jpg)
Implications: stepsize rule
• Optimal choice? ηt = 1√t:
fbest,t − fopt .‖x0 − x∗‖22 + L2
f log t√t
i.e. attains ε-accuracy within about O(1/ε2) iterations (ignoringthe log factor)
Subgradient methods 4-38
![Page 40: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/40.jpg)
Proof of Theorem 4.5
Applying Lemma 4.1 recursively gives
‖xt+1 − x∗‖22 ≤ ‖x0 − x∗‖22 − 2t∑
i=0ηi(f(xi)− fopt)+
t∑
i=0η2i ‖gi‖22
Rearranging terms, we are left with
2t∑
i=0ηi(f(xi)− fopt) ≤ ‖x0 − x∗‖22 − ‖xt+1 − x∗‖22 +
t∑
i=0η2i ‖gi‖22
≤ ‖x0 − x∗‖22 + L2f
t∑
i=0η2i
=⇒ fbest,t−fopt ≤∑ti=0 ηi
(f(xi)− fopt)∑ti=0 ηi
≤‖x0 − x∗‖22 + L2
f
∑ti=0 η
2i
2∑ti=0 ηi
Subgradient methods 4-39
![Page 41: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/41.jpg)
Strongly convex and Lipschitz problems
If f is strongly convex, then the convergence guarantees can beimproved to O(1/t), as long as the stepsize dimishes at O(1/t)
Theorem 4.4 (Subgradient methods for strongly convex andLipschitz functions)
Let f be µ-strongly convex and Lf -Lipschitz continuous over C. Ifηt ≡ η = 2
µ(t+1) , then
fbest,t − fopt ≤2L2
f
µ· 1t+ 1
Subgradient methods 4-40
![Page 42: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/42.jpg)
Proof of Theorem 4.4
When f is µ-strongly convex, we can improve Lemma 4.1 to (exercise)
‖xt+1 − x∗‖22 ≤ (1− µηt)‖xt − x∗‖22 − 2ηt(f(xt)− fopt
)+ η2
t ‖gt‖22
=⇒ f(xt)−fopt ≤ 1− µηt2ηt
‖xt−x∗‖22−1
2ηt‖xt+1−x∗‖22+ ηt
2 ‖gt‖22
Since ηt = 2/(µ(t+ 1)), we have
f(xt)−fopt ≤ µ(t− 1)4 ‖xt−x∗‖22−
µ(t+ 1)4 ‖xt+1−x∗‖22+ 1
µ(t+ 1)‖gt‖22
and hence
t(f(xt)− fopt
)≤ µt(t− 1)
4 ‖xt−x∗‖22−µt(t+ 1)
4 ‖xt+1−x∗‖22+ 1µ‖gt‖22
Subgradient methods 4-41
![Page 43: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/43.jpg)
Proof of Theorem 4.4 (cont.)
Summing over all iterations before t, we get
t∑
k=0k(f(xk)− fopt
)≤ 0− µt(t+ 1)
4 ‖xt+1 − x∗‖22 + 1µ
t∑
k=0‖gk‖22
≤ t
µL2f
=⇒ fbest,k − fopt ≤L2f
µ
t∑tk=0 k
≤2L2
f
µ
1t+ 1
Subgradient methods 4-42
![Page 44: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/44.jpg)
Summary: subgradient methods
stepsize convergence iterationrule rate complexity
convex & Lipschitzηt � 1√
tO(
1√t
)O(
1ε2
)problems
strongly convex &ηt � 1
t O(
1t
)O(
1ε
)Lipschitz problems
Subgradient methods 4-43
![Page 45: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/45.jpg)
Convex-concave saddle point problems
![Page 46: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/46.jpg)
Convex-concave saddle point problems
minimizex∈X
maxy∈Y
f(x,y)
• f(x,y): convex in x and concave in y
• X , Y: bounded closed convex sets• arises in game theory, robust optimization, generative adversarial
network (GAN), ...• under mild conditions, it is equivalent to its dual formulation
maximizey∈Y
minx∈X
f(x,y)
Subgradient methods 4-45
![Page 47: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/47.jpg)
Saddle points
0.655
0.66
0.65
0.665
0.67
0.10.6
0.675
0.050.55 0
-0.050.5-0.1
Summing up these inequalities over ⌧ = 0, · · · , t gives
2
tX
⌧=0
�⌘⌧ hg⌧
x , x⌧ � xi + ⌘⌧ hg⌧y , y⌧ � yi
kx0 � xk22 + ky0 � yk2
2 + L2f
tX
⌧=0
⌘2⌧
D2X + D2
Y + L2f
tX
⌧=0
⌘2⌧
(x⇤, y⇤)
2
xt+1
yt+1
�= PX⇥Y
✓xt
yt
�� ⌘t
gt
x
�gty
�◆
where gtx 2 @xf(xt, yt) and �gt
y 2 @y
�� f(xt, yt)
�
where xt =Pt
j=0 ⌘jxj
Ptj=0 ⌘j
and yt =Pt
j=0 ⌘jyj
Ptj=0 ⌘j
✏(x, y) =
maxy2Y
f(x, y) � fopt
�+
fopt � min
x2Xf(x, y)
�
= maxy2Y
f(x, y) � minx2X
f(x, y)
where fopt = f(x⇤, y⇤) with x⇤ and y⇤ being optimal solutionsBy convexity (and concavity) of f ,
f(xt, yt) � f(x, yt) hgtx, xt � xi, x 2 X
f(xt, y) � f(xt, yt) hgty, y � yti, y 2 Y
thus indicating that
f(xt, y) � f(x, yt) hgtx, xt � xi + hgt
y, y � yti, x 2 X , y 2 Y
Therefore, invoking convexity-concavity of f again gives
"(xt, yt) = maxy2Y
f(xt, y) � minx2X
f(x, yt)
1Pt⌧=0 ⌘⌧
(maxy2Y
tX
⌧=0
⌘⌧f(x⌧ , y) � minx2X
tX
⌧=0
⌘⌧f(x, y⌧ )
)
1Pt⌧=0 ⌘⌧
maxx2Y,y2Y
tX
⌧=0
⌘⌧�hg⌧
x , x⌧ � xi + hg⌧y , y � y⌧ i
maxx2Y,y2Y
tX
⌧=0
⌘⌧�hg⌧
x , x⌧ � xi + hg⌧y , y � y⌧ i
2 +
5L2f
2
tX
⌧=0
⌘2⌧
2 +5L2
f
2
Pt⌧=0 ⌘
2⌧Pt
⌧=0 ⌘⌧
2⌘⌧ hg⌧x , x⌧ � xi + 2⌘⌧ hg⌧
y , y⌧ � yi kx⌧ � xk2
2 + ky⌧ � yk22 � kx⌧+1 � xk2
2 � ky⌧+1 � yk22 + ⌘2
⌧L2f
1
xt+1
yt+1
�= PX⇥Y
✓xt
yt
�� ⌘t
gt
x
�gty
�◆
where gtx 2 @xf(xt, yt) and �gt
y 2 @y
�� f(xt, yt)
�
where xt =Pt
j=0 ⌘jxj
Ptj=0 ⌘j
and yt =Pt
j=0 ⌘jyj
Ptj=0 ⌘j
✏(x, y) =
maxy2Y
f(x, y) � fopt
�+
fopt � min
x2Xf(x, y)
�
= maxy2Y
f(x, y) � minx2X
f(x, y)
where fopt = f(x⇤, y⇤) with x⇤ and y⇤ being optimal solutionsBy convexity (and concavity) of f ,
f(xt, yt) � f(x, yt) hgtx, xt � xi, x 2 X
f(xt, y) � f(xt, yt) hgty, y � yti, y 2 Y
thus indicating that
f(xt, y) � f(x, yt) hgtx, xt � xi + hgt
y, y � yti, x 2 X , y 2 Y
Therefore, invoking convexity-concavity of f again gives
"(xt, yt) = maxy2Y
f(xt, y) � minx2X
f(x, yt)
1Pt⌧=0 ⌘⌧
(maxy2Y
tX
⌧=0
⌘⌧f(x⌧ , y) � minx2X
tX
⌧=0
⌘⌧f(x, y⌧ )
)
1Pt⌧=0 ⌘⌧
maxx2Y,y2Y
tX
⌧=0
⌘⌧�hg⌧
x , x⌧ � xi + hg⌧y , y � y⌧ i
maxx2Y,y2Y
tX
⌧=0
⌘⌧�hg⌧
x , x⌧ � xi + hg⌧y , y � y⌧ i
2 +
5L2f
2
tX
⌧=0
⌘2⌧
2 +5L2
f
2
Pt⌧=0 ⌘
2⌧Pt
⌧=0 ⌘⌧
2⌘⌧ hg⌧x , x⌧ � xi + 2⌘⌧ hg⌧
y , y⌧ � yi kx⌧ � xk2
2 + ky⌧ � yk22 � kx⌧+1 � xk2
2 � ky⌧+1 � yk22 + ⌘2
⌧L2f
1
Optimal point (x∗,y∗) obeys
f(x∗,y) ≤ f(x∗,y∗) ≤ f(x,y∗), ∀x ∈ X , y ∈ Y
Subgradient methods 4-46
![Page 48: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/48.jpg)
Projected subgradient method
A natural strategy is to apply the subgradient-based approach
[xt+1
yt+1
]= PX×Y
([xt
yt
]− ηt
[gtx−gty
])(4.6)
= projection([
subgrad descent on xt
subgrad ascent on yt
])
where gtx ∈ ∂xf(xt,yt) and −gty ∈ ∂y(− f(xt,yt)
)
Subgradient methods 4-47
![Page 49: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/49.jpg)
Performance metric
One way to measure the quality of the solution is via the followingerror metric (think of it as a certain “duality gap”)
ε(x,y) :=[maxy∈Y
f(x, y)− fopt]
+[fopt −min
x∈Xf(x,y)
]
= maxy∈Y
f(x, y)−minx∈X
f(x,y)
where fopt := f(x∗,y∗) with (x∗,y∗) the optimal solution
Subgradient methods 4-48
![Page 50: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/50.jpg)
Convex-concave and Lipschitz problems
Theorem 4.5 (Subgradient methods for saddle point problems)
Suppose f is convex in x and concave in y, and is Lf -Lipschitzcontinuous over X × Y. Let DX (resp. DY) be the diameter of X(resp. Y). Then the projected subgradient method (4.6) obeys
ε(xt, yt) ≤D2X +D2
Y + L2f
∑tτ=0 η
2τ
2∑tτ=0 ητ
where xt =∑t
τ=0 ητxτ
∑t
τ=0 ητand yt =
∑t
τ=0 ητyτ
∑t
τ=0 ητ
• similar to our theory for convex problems• suggests varying stepsize ηt � 1/
√t
Subgradient methods 4-49
![Page 51: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/51.jpg)
Iterate averaging
Notably, it is crucial to output the weighted average (xt, yt) of theiterates of the subgradient methods
In fact, the original iterates (xt,yt) might not converge
Example (bilinear game): f(x, y) = xy
• When ηt → 0 (continuous limit), (xt, yt) exhibits cyclingbehavior around (x∗, y∗) = (0, 0) without converging to it
Subgradient methods 4-50
![Page 52: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/52.jpg)
Proof of Theorem 4.5By the convexity-concavity of f ,
f(xt,yt)− f(x,yt) ≤ 〈gtx,xt − x〉, x ∈ Xf(xt,y)− f(xt,yt) ≤ 〈gty,y − yt〉, y ∈ Y
Adding these two inequalities yields
f(xt,y)− f(x,yt) ≤ 〈gtx,xt − x〉 − 〈gty,yt − y〉, x ∈ X , y ∈ YTherefore, invoking the convexity-concavity of f once again givesε(xt, yt) = max
y∈Yf(xt,y)− min
x∈Xf(x, yt)
≤ 1∑tτ=0 ητ
{maxy∈Y
t∑
τ=0ητf(xτ ,y)− min
x∈X
t∑
τ=0ητf(x,yτ )
}
≤ 1∑tτ=0 ητ
maxx∈X ,y∈Y
t∑
τ=0ητ{〈gτx,xτ − x〉 − 〈gτy ,yτ − y〉
}(4.7)
Subgradient methods 4-51
![Page 53: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/53.jpg)
Proof of Theorem 4.5 (cont.)
It then suffices to control the RHS of (4.7) as follows:
Lemma 4.6
maxx∈Y,y∈Y
t∑
τ=0ητ{〈gτx,xτ − x〉 − 〈gτy ,yτ − y〉
}
≤D2X +D2
Y + L2f
∑tτ=0 η
2τ
2
This lemma together with (4.7) immediately establishes Theorem 4.5
Subgradient methods 4-52
![Page 54: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/54.jpg)
Proof of Lemma 4.6For any x ∈ X we have
‖xτ+1 − x‖22 = ‖PX (xτ − ητgτx)− PX (x)‖2
2
≤ ‖xτ − ητgτx − x‖22 (convexity of X )
= ‖xτ − x‖22 − 2ητ 〈xτ − x, gτx〉+ η2
τ‖gτx‖22
=⇒ 2ητ 〈xτ − x, gτx〉 ≤ ‖xτ − x‖22 − ‖xτ+1 − x‖2
2 + η2τ‖gτx‖2
2
Similarly, for any y ∈ Y one has
−2ητ 〈yτ − y, gτy 〉 ≤ ‖yτ − y‖22 − ‖yτ+1 − y‖2
2 + η2τ‖gτy‖2
2
Combining these two inequalities and using Lipschitz continuity yield
2ητ 〈gτx,xτ − x〉 − 2ητ 〈gτy ,yτ − y〉≤ ‖xτ − x‖2
2 + ‖yτ − y‖22 − ‖xτ+1 − x‖2
2 − ‖yτ+1 − y‖22 + η2
τL2f
Subgradient methods 4-53
![Page 55: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/55.jpg)
Proof of Lemma 4.6 (cont.)
Summing up these inequalities over τ = 0, · · · , t gives
2t∑
τ=0
{ητ 〈gτx,xτ − x〉 − ητ 〈gτy ,yτ − y〉
}
≤ ‖x0 − x‖22 + ‖y0 − y‖2
2 − ‖xt+1 − x‖22 − ‖yt+1 − y‖2
2 + L2f
∑t
τ=0η2τ
≤ ‖x0 − x‖22 + ‖y0 − y‖2
2 + L2f
∑t
τ=0η2τ
≤ D2X +D2
Y + L2f
∑t
τ=0η2τ
as claimed
Remark: this lemma does NOT rely on the convexity-concavity of f(·, ·)
Subgradient methods 4-54
![Page 56: Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · where k·k∗is the dual norm of k·k(i.e. kxk∗:= sup z:kk≤1hz,xi) Proof: To see this, it](https://reader035.vdocument.in/reader035/viewer/2022063008/5fbdd1df3519bf73a57a2f00/html5/thumbnails/56.jpg)
Reference
[1] ”Convex optimization, EE364B lecture notes,” S. Boyd, Stanford.[2] ”Convex optimization and algorithms,” D. Bertsekas, 2015.[3] ”First-order methods in optimization,” A. Beck, Vol. 25, SIAM, 2017.[4] ”Convex optimization: algorithms and complexity,” S. Bubeck,
Foundations and trends in machine learning, 2015.[5] ”Optimization methods for large-scale systems, EE236C lecture notes,”
L. Vandenberghe, UCLA.[6] ”Introduction to optimization,” B. Polyak, Optimization Software, 1987.[7] ”Robust stochastic approximation approach to stochastic programming,”
A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, SIAM Journal onoptimization, 2009.
Subgradient methods 4-55