l.vandenberghe ece236c(spring2020) 10

28
L. Vandenberghe ECE236C (Spring 2020) 10. Dual proximal gradient method proximal gradient method applied to the dual examples alternating minimization method 10.1

Upload: others

Post on 14-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: L.Vandenberghe ECE236C(Spring2020) 10

L. Vandenberghe ECE236C (Spring 2020)

10. Dual proximal gradient method

• proximal gradient method applied to the dual

• examples

• alternating minimization method

10.1

Page 2: L.Vandenberghe ECE236C(Spring2020) 10

Dual methods

Subgradient method: converges slowly, step size selection is difficult

Gradient method: requires differentiable dual cost function

• often the dual cost function is not differentiable, or has a nontrivial domain

• dual function can be smoothed by adding small strongly convex term to primal

Augmented Lagrangian method

• equivalent to gradient ascent on a smoothed dual problem

• quadratic penalty in augmented Lagrangian destroys separable primal structure

Proximal gradient method (this lecture): dual cost split in two terms

• one term is differentiable with Lipschitz continuous gradient

• other term has an inexpensive prox operator

Dual proximal gradient method 10.2

Page 3: L.Vandenberghe ECE236C(Spring2020) 10

Composite primal and dual problem

primal: minimize f (x) + g(Ax)dual: maximize −g∗(z) − f ∗(−AT z)

the dual problem has the right structure for the proximal gradient method if

• f is strongly convex: this implies f ∗(−AT z) has a Lipschitz continuous gradient

A∇ f ∗(−ATu) − A∇ f ∗(−ATv)

2≤‖A‖22µ‖u − v‖2

µ is the strong convexity constant of f (see page 5.19)

• prox operator of g (or g∗) is inexpensive (closed form or simple algorithm)

Dual proximal gradient method 10.3

Page 4: L.Vandenberghe ECE236C(Spring2020) 10

Dual proximal gradient update

minimize g∗(z) + f ∗(−AT z)

• proximal gradient update:

z+ = proxtg∗(z + t A∇ f ∗(−AT z))

• ∇ f ∗ can be computed by minimizing partial Lagrangian (from p. 5.15, p. 5.19):

x = argminx( f (x) + zT Ax)

z+ = proxtg∗(z + t Ax)

• partial Lagrangian is a separable function of x if f is separable

• step size t is constant (t ≤ µ/‖A‖22) or adjusted by backtracking

• faster variant uses accelerated proximal gradient method of lecture 7

Dual proximal gradient method 10.4

Page 5: L.Vandenberghe ECE236C(Spring2020) 10

Dual proximal gradient update

x = argminx( f (x) + zT Ax)

z+ = proxtg∗(z + t Ax)

• Moreau decomposition gives alternate expression for z-update:

z+ = z + t Ax − tproxt−1g(t−1z + Ax)

• right-hand side can be written as z + t(Ax − y) where

y = proxt−1g(t−1z + Ax)

= argminy(g(y) + t

2‖Ax − t−1z − y‖22)

= argminy(g(y) + zT(Ax − y) + t

2‖Ax − y‖22)

Dual proximal gradient method 10.5

Page 6: L.Vandenberghe ECE236C(Spring2020) 10

Alternating minimization interpretation

x = argminx( f (x) + zT Ax)

y = argminy(g(y) − zT y +

t2‖Ax − y‖22)

z+ = z + t(Ax − y)

• first minimize Lagrangian over x, then augmented Lagrangian over y

• compare with augmented Lagrangian method:

(x, y) = argminx,y( f (x) + g(y) + zT(Ax − y) + t

2‖Ax − y‖22)

• requires strongly convex f (in contrast to augmented Lagrangian method)

Dual proximal gradient method 10.6

Page 7: L.Vandenberghe ECE236C(Spring2020) 10

Outline

• proximal gradient method applied to the dual

• examples

• alternating minimization method

Page 8: L.Vandenberghe ECE236C(Spring2020) 10

Regularized norm approximation

primal: minimize f (x) + ‖Ax − b‖dual: maximize −bT z − f ∗(−AT z)

subject to ‖z‖∗ ≤ 1

(see page 5.23)

• we assume f is strongly convex with constant µ, not necessarily differentiable

• we assume projections on unit ‖ · ‖∗-ball are simple

• this is a special case of the problem on page 10.3 with g(y) = ‖y − b‖:

g∗(z) ={

bT z ‖z‖∗ ≤ 1+∞ otherwise, proxtg∗(z) = PC(z − tb)

Dual proximal gradient method 10.7

Page 9: L.Vandenberghe ECE236C(Spring2020) 10

Dual gradient projection

primal: minimize f (x) + ‖Ax − b‖dual: maximize −bT z − f ∗(−AT z)

subject to ‖z‖∗ ≤ 1

• dual gradient projection update (C = {z | ‖z‖∗ ≤ 1}):

z+ = PC

(z + t(A∇ f ∗(−AT z) − b)

)• gradient of f ∗ can be computed by minimizing the partial Lagrangian:

x = argminx( f (x) + zT Ax)

z+ = PC(z + t(Ax − b))

Dual proximal gradient method 10.8

Page 10: L.Vandenberghe ECE236C(Spring2020) 10

Example

primal: minimize f (x) +p∑

i=1‖Bix‖2

dual: maximize − f ∗(−BT1 z1 − · · · − BT

p zp)subject to ‖zi‖2 ≤ 1, i = 1, . . . , p

Dual gradient projection update (for strongly convex f ):

x = argminx( f (x) + (

p∑i=1

BTi zi)T x)

z+i = PCi (zi + tBi x) , i = 1, . . . , p

• Ci is unit Euclidean norm ball in Rmi, if Bi ∈ Rmi×n

• x-calculation decomposes if f is separable

Dual proximal gradient method 10.9

Page 11: L.Vandenberghe ECE236C(Spring2020) 10

Example

• we take f (x) = (1/2)‖Cx − d‖22• each iteration requires solution of linear equation with coefficient CTC

• randomly generated C ∈ R2000×1000, Bi ∈ R10×1000, p = 500

0 100 200 300 400 50010−6

10−5

10−4

10−3

10−2

10−1

100

iteration

rela

tive

dual

subo

ptim

ality

projected gradientFISTA

Dual proximal gradient method 10.10

Page 12: L.Vandenberghe ECE236C(Spring2020) 10

Minimization over intersection of convex sets

minimize f (x)subject to x ∈ C1 ∩ · · · ∩ Cp

• f is strongly convex with constant µ

• we assume each set Ci is closed, convex, and easy to project onto

• this is a special case of the problem on page 10.3 with

g(y1, . . . , yp) = δC1(y1) + · · · + δCp(yp)A =

[I I · · · I

]T

with this choice of g and A,

f (x) + g(Ax) = f (x) + δC1(x) + · · · + δCp(x)

Dual proximal gradient method 10.11

Page 13: L.Vandenberghe ECE236C(Spring2020) 10

Dual problem

primal: minimize f (x) + δC1(x) + · · · + δCp(x)dual: maximize −δ∗C1

(z1) − · · · − δ∗Cp(zp) − f ∗(−z1 − · · · − zp)

• proximal mapping of δ∗Ci: from Moreau decomposition (page 6.18),

proxtδ∗Ci(u) = u − tPCi(u/t)

• gradient of h(z1, . . . , zp) = f ∗(−z1 − · · · − zp):

∇h(z) = −A∇ f (−AT z) = −

I...I

∇ f ∗(−z1 − · · · − zp)

• ∇h(z) is Lipschitz continuous with constant ‖A‖22/µ = p/µ

Dual proximal gradient method 10.12

Page 14: L.Vandenberghe ECE236C(Spring2020) 10

Dual proximal gradient method

primal: minimize f (x) + δC1(x) + · · · + δCp(x)dual: maximize −δ∗C1

(z1) − · · · − δ∗Cp(zp) − f ∗(−z1 − · · · − zp)

• dual proximal gradient update

s = −z1 − · · · − zp

z+i = zi + t∇ f ∗(s) − tPCi(t−1zi + ∇ f ∗(s)), i = 1, . . . , p

• gradient of f ∗ can be computed by minimizing the partial Lagrangian

x = argminx( f (x) + (z1 + · · · + zp)T x)

z+i = zi + t x − tPCi (zi/t + x) , i = 1, . . . , p

• stepsize is fixed (t ≤ µ/p) or adjusted by backtracking

Dual proximal gradient method 10.13

Page 15: L.Vandenberghe ECE236C(Spring2020) 10

Euclidean projection on intersection of convex sets

minimize 12‖x − a‖22

subject to x ∈ C1 ∩ · · · ∩ Cp

• special case of previous problem with

f (x) = 12‖x − a‖22, f ∗(u) = 1

2‖u‖22 + aTu

• strong convexity constant µ = 1; hence stepsize t = 1/p works

• dual proximal gradient update (with change of variable wi = pzi):

x = a − 1p(w1 + · · · + wp)

w+i = wi + x − PCi(wi + x), i = 1, . . . , p

• the p projections in the second step can be computed in parallel

Dual proximal gradient method 10.14

Page 16: L.Vandenberghe ECE236C(Spring2020) 10

Nearest positive semidefinite unit-diagonal Z-matrix

projection in Frobenius norm of A ∈ S100 on the intersection of two sets:

C1 = S100+ , C2 = {X ∈ S100 | diag(X) = 1, Xi j ≤ 0 for i , j}

0 50 100 15010−6

10−5

10−4

10−3

10−2

10−1

100

iteration

rela

tive

dual

subo

ptim

ality

proximal gradientFISTA

Dual proximal gradient method 10.15

Page 17: L.Vandenberghe ECE236C(Spring2020) 10

Euclidean projection on polyhedron

• intersection of p halfspaces Ci = {x | aTi x ≤ bi}

PCi(x) = x − max{aTi x − bi,0}‖ai‖22

ai

• example with p = 2000 inequalities and n = 1000 variables

0 1000 2000 3000 400010−4

10−3

10−2

10−1

100

iteration

rela

tive

dual

subo

ptim

ality

proximal gradientFISTA

Dual proximal gradient method 10.16

Page 18: L.Vandenberghe ECE236C(Spring2020) 10

Decomposition of primal-dual separable problems

minimizen∑

j=1f j(x j) +

m∑i=1

gi(Ai1x1 + · · · + Ainxn)

• special case of f (x) + g(Ax) with (block-)separable f and g

• for example,minimize

n∑j=1

f j(x j)

subject ton∑

j=1A1 j x j ∈ C1

· · ·n∑

j=1Amj x j ∈ Cm

• we assume each fi is strongly convex; each gi has inexpensive prox operator

Dual proximal gradient method 10.17

Page 19: L.Vandenberghe ECE236C(Spring2020) 10

Decomposition of primal-dual separable problems

primal: minimizen∑

j=1f j(x j) +

m∑i=1

gi(Ai1x1 + · · · + Ainxn)

dual: maximize −m∑

i=1g∗i (zi) −

n∑j=1

f ∗j (−AT1 j z1 − · · · − AT

mj z j)

Dual proximal gradient update

x j = argminx j

( f j(x j) +m∑

i=1zTi Ai j x j), j = 1, . . . ,n

z+i = proxtg∗i(zi + t

n∑j=1

Ai j x j), i = 1, . . . ,m

Dual proximal gradient method 10.18

Page 20: L.Vandenberghe ECE236C(Spring2020) 10

Outline

• proximal gradient method applied to the dual

• examples

• alternating minimization method

Page 21: L.Vandenberghe ECE236C(Spring2020) 10

Separable structure with one strongly convex term

minimize f1(x1) + f2(x2) + g(A1x1 + A2x2)

• composite problem with separable f (two terms, for simplicity)

• if f1 and f2 are strongly convex, dual method of page 10.4 applies

x1 = argminx1

( f1(x1) + zT A1x1)

x2 = argminx2

( f2(x2) + zT A2x2)

z+ = proxtg∗(z + t(A1 x1 + A2 x2))

• we now assume that one function ( f2) is not strongly convex

Dual proximal gradient method 10.19

Page 22: L.Vandenberghe ECE236C(Spring2020) 10

Separable structure with one strongly convex term

primal: minimize f1(x1) + f2(x2) + g(A1x1 + A2x2)

dual: maximize −g∗(z) − f ∗1 (−AT1 z) − f ∗2 (−AT

2 z)

• we split dual objective in components − f ∗1 (−AT1 z) and −g∗(z) − f ∗2 (−AT

2 z)

• component f ∗1 (−AT1 z) is differentiable with Lipschitz continuous gradient

• proximal mapping of h(z) = g∗(z) + f ∗2 (−AT2 z) was discussed on page 8.7:

proxth(w) = w + t(A2 x2 − y)

where x2, y minimize a partial augmented Lagrangian

(x2, y) = argminx2,y

( f2(x2) + g(y) +t2‖A2x2 − y + w/t‖22)

Dual proximal gradient method 10.20

Page 23: L.Vandenberghe ECE236C(Spring2020) 10

Dual proximal gradient method

z+ = proxth(z + t A1∇ f ∗1 (−AT1 z))

• evaluate ∇ f ∗1 by minimizing partial Lagrangian:

x1 = argminx1

( f1(x1) + zT A1x1)

z+ = proxth(z + t A1 x1)

• evaluate proxth(z + t A1 x1) by minimizing augmented Lagrangian:

(x2, y) = argminx2,y

( f2(x2) + g(y) +t2‖A2x2 − y + z/t + A1 x‖22)

z+ = z + t(A1 x1 + A2 x2 − y)

Dual proximal gradient method 10.21

Page 24: L.Vandenberghe ECE236C(Spring2020) 10

Alternating minimization method

starting at some initial z, repeat the following iteration

1. minimize the Lagrangian over x1:

x1 = argminx1

( f1(x1) + zT A1x1)

2. minimize the augmented Lagrangian over x2, y:

(x2, y) = argminx2,y

(f2(x2) + g(y) +

t2‖A1 x1 + A2x2 − y + z/t‖22

)3. update dual variable:

z+ = z + t(A1 x1 + A2 x2 − y)

Dual proximal gradient method 10.22

Page 25: L.Vandenberghe ECE236C(Spring2020) 10

Comparison with augmented Lagrangian method

Augmented Lagrangian method (for problem on page 10.19)

1. compute minimizer x1, x2, y of the augmented Lagrangian

f1(x1) + f2(x2) + g(y) +t2‖A1x1 + A2x2 − y + z/t‖22

2. update dual variable:

z+ = z + t(A1 x1 + A2 x2 − y)

Differences with alternating minimization (dual proximal gradient method)

• augmented Lagrangian method does not require strong convexity of f1

• there is no upper limit on the step size t in augmented Lagrangian method

• quadratic term in step 1 of AL method destroys separability of f1(x1) + f2(x2)

Dual proximal gradient method 10.23

Page 26: L.Vandenberghe ECE236C(Spring2020) 10

Example

minimize 12xT

1 Px1 + qT1 x1 + qT

2 x2

subject to B1x1 � d1, B2x2 � d2A1x1 + A2x2 = b

• without equality constraint, problem would separate in independent QP and LP

• we assume P � 0

Formulation for dual decomposition

minimize f1(x1) + f2(x2)subject to A1x1 + A2x2 = b

• first function is strongly convex

f1(x) =12

xT1 Px1 + qT

1 x1, dom f1 = {x1 | B1x1 � d1}

• second function is not: f2(x) = qT2 x2 with domain {x2 | B2x2 � d2}

Dual proximal gradient method 10.24

Page 27: L.Vandenberghe ECE236C(Spring2020) 10

Example

Alternating minimization algorithm

1. compute the solution x1 of the QP

minimize (1/2)xT1 P1x1 + (q1 + AT

1 z)T x1

subject to B1x1 � d1

2. compute the solution x2 of the QP

minimize (q2 + AT2 z)T x2 + (t/2)‖A1 x1 + A2x2 − b‖22

subject to B2x2 � d2

3. dual update:z+ = z + t(A1 x1 + A2 x2 − b)

Dual proximal gradient method 10.25

Page 28: L.Vandenberghe ECE236C(Spring2020) 10

References

• P. Tseng, Applications of a splitting algorithm to decomposition in convex programming andvariational inequalities, SIAM J. Control and Optimization (1991).

• P. Tseng, Further applications of a splitting algorithm to decomposition in variational inequalitiesand convex programming, Mathematical Programming (1990).

Dual proximal gradient method 10.26