tutorial: algorithms for large-scale semide nite programmingaspremon/pdf/hpopt12.pdf · 2013. 9....

53
Tutorial: Algorithms for Large-Scale Semidefinite Programming Alexandre d’Aspremont, CNRS & Ecole Polytechnique. Support from NSF, ERC (project SIPA) and Google. A. d’Aspremont HPOPT, Delft, June 2012. 1/51

Upload: others

Post on 14-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Tutorial: Algorithms for Large-Scale

Semidefinite Programming

Alexandre d’Aspremont, CNRS & Ecole Polytechnique.

Support from NSF, ERC (project SIPA) and Google.

A. d’Aspremont HPOPT, Delft, June 2012. 1/51

Page 2: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Introduction

A semidefinite program (SDP) is written

minimize Tr(CX)subject to Tr(AiX) = bi, i = 1, . . . ,m

X � 0,

where X � 0 means that the matrix variable X ∈ Sn is positive semidefinite.

Its dual can be written

maximize bTysubject to C −∑m

i=1 yiAi � 0,

which is another semidefinite program in the variables y.

A. d’Aspremont HPOPT, Delft, June 2012. 2/51

Page 3: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Introduction

Classical algorithms for semidefinite programming

� Following [Nesterov and Nemirovskii, 1994], most of the attention was focusedon interior point methods.

� Basic idea: Newton’s method, with efficient linear algebra to compute theNewton step (or solve the KKT system).

� Fast, and robust on small problems (n ∼ 500).

� Computing the Hessian is too hard on larger problems. Exploiting structure(sparsity, etc.) is hard too.

Solvers

� Open source solvers: SDPT3, SEDUMI, SDPA, CSDP, . . .

� Very powerful modeling systems: CVX

A. d’Aspremont HPOPT, Delft, June 2012. 3/51

Page 4: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Introduction

Solving a MaxCut relaxation using CVX

max. Tr(XC)s.t. diag(X) = 1

X � 0,

is written as follows in CVX/MATLAB

cvx begin

. variable X(n,n) symmetric

. maximize trace(C*X)

. subject to

. diag(X)==1

. X==semidefinite(n)

cvx end

A. d’Aspremont HPOPT, Delft, June 2012. 4/51

Page 5: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Introduction

Algorithms for large-scale semidefinite programming.

Structure ⇒ algorithmic choices

Examples:

� SDPs with constant trace cast as max. eigenvalue minimization problems.

� Fast projection steps.

� Fast prox or affine minimization subproblems.

� Closed-form or efficiently solvable block minimization subproblems.

� Etc. . .

A. d’Aspremont HPOPT, Delft, June 2012. 5/51

Page 6: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Introduction

Example. In many semidefinite relaxations of combinatorial problems, we canimpose Tr(X) = 1 and solve

maximize Tr(CX)subject to Tr(AiX) = bi, i = 1, . . . ,m

Tr(X) = 1, X � 0,

The dual can be written as a maximum eigenvalue minimization problem

minxλmax

(C +

m∑i=1

xiAi

)− bTx

in the variable x ∈ Rm.

A. d’Aspremont HPOPT, Delft, June 2012. 6/51

Page 7: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Outline

� Introduction

� First-order methods

◦ Subgradient methods

◦ Smoothing & accelerated algorithms

◦ Improving iteration complexity

� Exploiting structure

◦ Frank-Wolfe

◦ Block coordinate descent

◦ Dykstra, alternating projection

◦ Localization, cutting-plane methods

A. d’Aspremont HPOPT, Delft, June 2012. 7/51

Page 8: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Subgradient methods

Solveminx∈Q

λmax (A(x)) + cTx

where A(x) = C +∑mi=1 xiAi, using the projected subgradient method.

Input: A starting point x0 ∈ Rm.1: for t = 0 to N − 1 do2: Set

xt+1 = PQ(xt − γ∂λmax(A(x))).

3: end forOutput: A point x = (1/N)

∑Nt=1 xt.

Here, γ > 0 and PQ(·) is the Euclidean projection on Q.

A. d’Aspremont HPOPT, Delft, June 2012. 8/51

Page 9: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Subgradient methods

� The number of iterations required to reach a target precision ε is

N =D2QM

2

ε2

where DQ is the diameter of Q and ‖∂λmax(A(x))‖ ≤M on Q.

� The cost per iteration is the sum of

◦ The cost pQ of computing the Euclidean projection on Q.

◦ The cost of computing ∂λmax(A(x)) which is e.g. v1vT1 where v1 is a

leading eigenvector of A(x).

Computing one leading eigenvector of a dense matrix X with relative precision ε,using a randomly started Lanczos method, with probability of failure 1− δ, costs

O

(n2 log(n/δ2)√

ε

)flops [Kuczynski and Wozniakowski, 1992, Th.4.2].

A. d’Aspremont HPOPT, Delft, June 2012. 9/51

Page 10: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Subgradient methods

Solving minX∈Q λmax(A(x)) using projected subgradient.

� Easy to implement.

� Very poor performance in practice. The 1/ε2 dependence is somewhatpunishing. . .

Example below on MAXCUT.

0 500 1000 1500 200010

−3

10−2

10−1

100

101

Iteration

f−

f∗

A. d’Aspremont HPOPT, Delft, June 2012. 10/51

Page 11: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Smoothing & accelerated algorithms

A. d’Aspremont HPOPT, Delft, June 2012. 11/51

Page 12: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Smoothing & accelerated algorithms

[Nesterov, 2007] We can regularize the objective and solve

minx∈Q

fµ(x) , µ logTr

(exp

(A(x)

µ

))for some regularization parameter µ > 0 (exp(·) is the matrix exponential here).

� If we set µ = ε/ log n we get

λmax(A(x)) ≤ fµ(x) ≤ λmax(A(x)) + ε

� The gradient ∇fµ(x) is Lipschitz continuous with constant

‖A‖2 log n

ε

where ‖A‖ = sup‖h‖≤1 ‖A(h)‖2.

A. d’Aspremont HPOPT, Delft, June 2012. 12/51

Page 13: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Smoothing & accelerated algorithms

� The number of iterations required to get an ε solution using the smoothminimization algorithm in Nesterov [1983] grows as

‖A‖√log n

ε

√d(x∗)

σ

where d(·) is strongly convex with parameter σ > 0.

� The cost per iteration is (usually) dominated by the cost of forming thematrix exponential

exp

(A(x)

µ

)which is O(n3) flops [Moler and Van Loan, 2003].

� Much better empirical performance.

A. d’Aspremont HPOPT, Delft, June 2012. 13/51

Page 14: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Smoothing & accelerated algorithms

This means that the two classical complexity options for solving

minX∈Q

λmax(A(x))

(assuming A(x) cheap)

� Subgradient methods

O

(D2Q(n2 log n+ pQ)

ε2

)

� Smooth optimization

O

(DQ

√log n(n3 + pQ)

ε

)if we pick ‖ · ‖22 in the prox term.

A. d’Aspremont HPOPT, Delft, June 2012. 14/51

Page 15: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Improving iteration complexity

A. d’Aspremont HPOPT, Delft, June 2012. 15/51

Page 16: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Approximate gradients

Approximate gradient is often enough. This means computing only a few leadingeigenvectors.

5 10 15 20 25 30

10−4

10−2

100

i

λi

Spectrum of exp((X − λmax(X)I)/0.1) at the MAXCUT solution.

A. d’Aspremont HPOPT, Delft, June 2012. 16/51

Page 17: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Approximate gradients

Convergence guarantees using approximate gradients: if ∇f(x) is theapproximate gradient oracle, we require

|〈∇f(x)−∇f(x), y − z〉| ≤ δ x, y, z ∈ Q,

(the condition depends on the diameter of Q). For example, to solve

minimize λmax(A+X)subject to |Xij| ≤ ρ

we only need to compute the j largest eigenvalues of A+X, with j such that

(n− j)eλj√∑j

i=1 e2λi

(∑ji=1 e

λi)2+

√n− j eλj∑ji=1 e

λi≤ δ

ρn.

The impact of the diameter makes these conditions quite conservative.

A. d’Aspremont HPOPT, Delft, June 2012. 17/51

Page 18: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Approximate gradients

Other possible conditions (often less stringent), when solving

minx∈Q

maxu∈U

Ψ(x, u)

If ux is an approximate solution to maxu∈U Ψ(x, u), we can check Vi(ux) ≤ δ

V1(ux) = maxu∈U ∇2Ψ(x, ux)T (u− ux)

V2(ux) = maxu∈U{

Ψ(x, u)−Ψ(x, ux) + κ‖u− ux‖2/2}

V3(ux) = maxu∈U Ψ(x, u)−Ψ(x, ux)

whereV1(ux) ≤ V2(ux) ≤ V3(ux) ≤ δ

The target accuracy δ on the oracle is a function of the target accuracy ε.

See [d’Aspremont, 2008a], [Devolder, Glineur, and Nesterov, 2011] for furtherdetails.

A. d’Aspremont HPOPT, Delft, June 2012. 18/51

Page 19: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Stochastic Smoothing

Max-rank one Gaussian smoothing. Suppose we pick ui ∈ Rn with i.i.d.uij ∼ N (0, 1) and define

f(X) = E

[max

i=1,...,kλmax(X + (ε/n)uiu

Ti )

]

� Approximation results are preserved up to a constant ck > 0

λmax(X) ≤ E[λmax(X + (ε/n)uuT )] ≤ λmax(X) + ckε

� The function f(X) is smooth and the Lipschitz constant of its gradient isbounded by

Lf ≤ E

[n

(min

i=1,...,k

1

u2i,1

)]≤ Ck

n

ε

where Ck = 1√2kk−2, is finite when k ≥ 3.

� Computing maxi=1,...,k λmax(X + (ε/n)uiuTi ) costs O(kn2 log n).

A. d’Aspremont HPOPT, Delft, June 2012. 19/51

Page 20: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Stochastic Smoothing

Optimal Stochastic Composite Optimization. The algorithm in Lan [2009]solves

minx∈Q

Ψ(x) , f(x) + h(x)

with the following assumptions

� f(x) has Lipschitz gradient with constant L and h(x) is Lipschitz withconstant M ,

� we have a stochastic oracle G(x, ξt) for the gradient, which satisfies

E[G(x, ξt)] = g(x) ∈ ∂Ψ(x) and E[‖G(x, ξt)− g(x)‖2∗] ≤ σ2

After N iterations, the iterate xN+1 satisfies

E[Ψ(xagN+1)−Ψ∗

]≤

8LD2ω,Q

N2+

4Dω,Q

√4M2 + σ2

√N

which is optimal. Additional assumptions guarantee convergence w.h.p.

A. d’Aspremont HPOPT, Delft, June 2012. 20/51

Page 21: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Maximum Eigenvalue Minimization

For maximum eigenvalue minimization

� We have σ ≤ 1, but we can reduce this by averaging q gradients, to control thetradeoff between smooth and non-smooth terms.

� If we set q = max{1, DQ/(ε√n)} and N = 2DQ

√n/ε we get the following

complexity picture

Complexity Num. of Iterations Cost per Iteration

Nonsmooth alg. O

(D2Q

ε2

)O(pQ + n2 log n)

Smooth stochastic alg. O(DQ√n

ε

)O(pQ + max

{1,

DQε√n

}n2 log n

)Smoothing alg. O

(DQ√logn

ε

)O(pQ + n3)

A. d’Aspremont HPOPT, Delft, June 2012. 21/51

Page 22: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Stochastic Smoothing

� Approximate gradients reduce empirical complexity. No a priori bounds oniteration cost.

� More efficient to run a lot of cheaper iterations, everything else being equal.

Many open questions. . .

� Not clear if rank one perturbations achieve the optimal complexity/smoothnesstradeoff. Can we replicate the exponential smoothing stochastically?

� Non monotonic line search for stochastic optimization?

� Bundle methods also improve the performance of subgradient techniques[Lemarechal et al., 1995, Kiwiel, 1995, Helmberg and Rendl, 2000, Oustry,2000, Ben-Tal and Nemirovski, 2005, Lan, 2010]...

A. d’Aspremont HPOPT, Delft, June 2012. 22/51

Page 23: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Outline

� Introduction

� First-order methods

◦ Subgradient methods

◦ Smoothing & accelerated algorithms

◦ Improving iteration complexity

� Exploiting structure

◦ Frank-Wolfe

◦ Block coordinate descent

◦ Dykstra, alternating projection

◦ Localization, cutting-plane methods

A. d’Aspremont HPOPT, Delft, June 2012. 23/51

Page 24: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Frank-Wolfe

A. d’Aspremont HPOPT, Delft, June 2012. 24/51

Page 25: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Frank-Wolfe

� Classical first order methods for solving

minimize f(x)subject to x ∈ C,

in x ∈ Rn, with C ⊂ Rn convex, relied on the assumption that the followingprox subproblem could be solved efficiently

minimize yTx+ d(x)subject to x ∈ C,

in the variable x ∈ Rn, where d(x) is a strongly convex function.

� The Franke-Wolfe alg. assumes that the affine minimization subproblem

minimize dTxsubject to x ∈ C

can be solved efficiently for any y ∈ Rn.

A. d’Aspremont HPOPT, Delft, June 2012. 25/51

Page 26: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Frank-Wolfe

Frank and Wolfe [1956] algorithm. See also [Jaggi, 2011].

Input: A starting point x0 ∈ C.1: for t = 0 to N − 1 do2: Compute ∇f(yk)3: Solve the affine minimization subproblem

minimize xT∇f(xk)subject to x ∈ C

in x ∈ Rn, call the solution xd.4: Update the current point

xk+1 = xk +2

k + 2(xd − xk)

5: end forOutput: A point xN .

Note that all iterates are feasible.

A. d’Aspremont HPOPT, Delft, June 2012. 26/51

Page 27: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Frank-Wolfe

� Complexity. Assume that f is differentiable. Define the curvature Cf of thefunction f(x) as

Cf , sups,x∈M, α∈[0,1],y=x+α(s−x)

1

α2(f(y)− f(x)− 〈y − x,∇f(x)〉).

The Franke-Wolfe algorithm will then produce an ε solution after

Nmax =4Cfε

iterations.

� Can use line search at each iteration to improve convergence.

A. d’Aspremont HPOPT, Delft, June 2012. 27/51

Page 28: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Frank-Wolfe

� Stopping criterion. At each iteration, we get a lower bound on the optimumas a byproduct of the affine minimization step. By convexity,

f(xk) +∇f(xk)T (xd − xk) ≤ f(x), for all x ∈ C

and finally, calling f∗ the optimal value of problem, we obtain

f(xk)− f∗ ≤ ∇f(xk)T (xk − xd).

This allows us to bound the suboptimality of iterate at no additional cost.

A. d’Aspremont HPOPT, Delft, June 2012. 28/51

Page 29: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Frank-Wolfe

Example. Semidefinite optimization with bounded trace.

minimize f(X)subject to Tr(X) = 1, X � 0,

in the variable X ∈ Sn.

The affine minimization subproblem is written

minimize Tr(∇f(X)Y )subject to Tr(Y ) = 1, Y � 0,

in the variable Y ∈ Sn, and can be solved by a partial eigenvalue decomposition,with the optimum value equal to λmin(∇f(X)) [cf. Jaggi, 2011]. Each iteration isa rank one update.

A. d’Aspremont HPOPT, Delft, June 2012. 29/51

Page 30: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Block coordinate descent methods

A. d’Aspremont HPOPT, Delft, June 2012. 30/51

Page 31: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Coordinate Descent

We seek to solveminimize f(x)subject to x ∈ C

in the variable x ∈ Rn, with C ⊂ Rn convex.

� Our main assumption here is that C is a product of simpler sets. We rewritethe problem

minimize f(x1, . . . , xp)subject to xi ∈ Ci, i = 1, . . . , p

where C = C1 × . . .× Cp.

� This helps if the minimization subproblems

minxi∈Ci

f(x1, . . . , xi, . . . , xp)

can be solved very efficiently (or in closed-form).

A. d’Aspremont HPOPT, Delft, June 2012. 31/51

Page 32: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Coordinate Descent

Algorithm. The algorithm simply computes the iterates x(k+1) as

x(k+1)i = argmin

xi∈Cif(x

(k)1 , . . . , x

(k)i , . . . , x(k)p )

x(k+1)j = x

(k)j , j 6= i

for a certain i ∈ [1, p], cycling over all indices in [1, p].

Convergence.

� Complexity analysis similar to coordinate-wise gradient descent (or steepestdescent in `1 norm).

� Need f(x) strongly convex to get explicit complexity bound [Nesterov, 2010].

� Generalization of block methods for SDP in “row-by-row” method of [Wen,Goldfarb, Ma, and Scheinberg, 2009].

A. d’Aspremont HPOPT, Delft, June 2012. 32/51

Page 33: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Coordinate Descent

Example. Covariance selection [d’Aspremont et al., 2006]. The dual of thecovariance selection problem is written

maximize log det(S + U)subject to ‖U‖∞ ≤ ρ

S + U � 0

Let C = S + U be the current iterate, after permutation we can always assumethat we optimize over the last column

maximize log det

(C11 C12 + u

C21 + uT C22

)subject to ‖u‖∞ ≤ ρ

where C12 is the last column of C (off-diag.).

A. d’Aspremont HPOPT, Delft, June 2012. 33/51

Page 34: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Coordinate Descent

We can use the block determinant formula

det

(A BC D

)= det(A) det(D − CA−1B)

to show that each row/column iteration reduces to a simple box-constrained QP

minimize uT (C11)−1usubject to ‖u‖∞ ≤ ρ

the dual of this last problem is a LASSO optimization problem.

A. d’Aspremont HPOPT, Delft, June 2012. 34/51

Page 35: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Dykstra, alternating projection

A. d’Aspremont HPOPT, Delft, June 2012. 35/51

Page 36: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Dykstra, alternating projection

We focus on a simple feasibility problem

find x ∈ C1 ∩ C2

in the variable x ∈ Rn with C1, C2 ⊂ Rn two convex sets.

We assume now that the projection problems on Ci are easier to solve

minimize ‖x− y‖2subject to x ∈ Ci

in x ∈ Rn.

A. d’Aspremont HPOPT, Delft, June 2012. 36/51

Page 37: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Dykstra, alternating projection

Algorithm (alternating projection)

� Choose x0 ∈ Rn.

� For k = 1, . . . , kmax iterate

1. Project on C1

xk+1/2 = argminx∈C1

‖x− xk‖2

2. Project on C2

xk+1 = argminx∈C2

‖x− xk+1/2‖2

Convergence. We can show dist(xk, C1 ∩C2)→ 0. Linear convergence providedsome additional regularity assumptions. See e.g. [Lewis, Malick, et al., 2008]

A. d’Aspremont HPOPT, Delft, June 2012. 37/51

Page 38: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Dykstra, alternating projection

Algorithm (Dykstra)

� Choose x0, z0 ∈ Rn.

� For k = 1, . . . , kmax iterate

1. Project on C1

xk+1/2 = argminx∈C1

‖x− zk‖2

2. Updatezk+1/2 = 2xk+1/2 − zk

3. Project on C2

xk+1 = argminx∈C2

‖x− zk+1/2‖2

4. Updatezk+1 = zk + xk+1 − xk+1/2

Convergence. Usually faster than simple alternating projection.

A. d’Aspremont HPOPT, Delft, June 2012. 38/51

Page 39: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Dykstra, alternating projection

Example. Matrix completion problem, given coefficients bij for (i, j) ∈ S

Find Xsuch that Xij = bij, (i, j) ∈ S

X � 0,

in the variable X ∈ Sn.Positive semidefinite matrix completion

20 40 60 80 100

10−4

10−2

100

102

iteration k

!Xk+

1"

Xk+

1/2! F

• blue: alternating projections; red: D-R

• Xk+1/2 # C, Xk+1 # D

Douglas-Rachford splitting 13

Blue: alternating projection. Red: Dykstra. (from EE364B)

A. d’Aspremont HPOPT, Delft, June 2012. 39/51

Page 40: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Dykstra, alternating projection

Countless variations. . .

� Proximal point algorithm

� Douglas-Rachford splitting

� Operator splitting methods

� Bregman iterative methods

� . . .

A. d’Aspremont HPOPT, Delft, June 2012. 40/51

Page 41: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Localization methods

A. d’Aspremont HPOPT, Delft, June 2012. 41/51

Page 42: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Localization methods

From EE364B course at Stanford. . .

� Function f : Rn → R convex (and for now, differentiable)

� problem: minimize f

� oracle model: for any x we can evaluate f and ∇f(x) (at some cost)

Main assumption: evaluating the gradient is very expensive.

Convexity means f(x) ≥ f(x0) +∇f(x0)T (x− x0), so

∇f(x0)T (x− x0) ≥ 0 =⇒ f(x) ≥ f(x0)

i.e., all points in halfspace ∇f(x0)T (x− x0) ≥ 0 are worse than x0

A. d’Aspremont HPOPT, Delft, June 2012. 42/51

Page 43: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Localization methods

Pk−1

x(k) x(k)

∇f(x(k)) ∇f(x(k))

Pk

� Pk gives our uncertainty of x? at iteration k

� want to pick x(k) so that Pk+1 is as small as possible

� clearly want x(k) near center of C(k)

A. d’Aspremont HPOPT, Delft, June 2012. 43/51

Page 44: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Localization methods

analytic center of polyhedron P = {z | aTi z � bi, i = 1, . . . ,m} is

AC(P) = argminz−

m∑i=1

log(bi − aTi z)

ACCPM is localization method with next query point x(k+1) = AC(Pk) (foundby Newton’s method)

A. d’Aspremont HPOPT, Delft, June 2012. 44/51

Page 45: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Localization methods

� let x∗ be analytic center of P = {z | aTi z � bi, i = 1, . . . ,m}� let H∗ be Hessian of barrier at x∗,

H∗ = −∇2m∑i=1

log(bi − aTi z)∣∣∣∣∣z=x∗

=

m∑i=1

aiaTi

(bi − aTi x∗)2

� then, P ⊆ E = {z | (z − x∗)TH∗(z − x∗) ≤ m2} (not hard to show)

� let E(k) be outer ellipsoid associated with x(k)

� a lower bound on optimal value p? is

p? ≥ infz∈E(k)

(f(x(k)) + g(k)T (z − x(k))

)= f(x(k))−mk

√g(k)TH(k)−1g(k)

(mk is number of inequalities in Pk)

� gives simple stopping criterion√g(k)TH(k)−1g(k) ≤ ε/mk

A. d’Aspremont HPOPT, Delft, June 2012. 45/51

Page 46: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Localization methods

ACCPM algorithm.

Input: Polyhedron P containing x?.1: for t = 0 to N − 1 do2: Compute x∗, the analytic center of P, and the Hessian H∗.3: Compute f(x∗) and g ∈ ∂f(x∗).

4: Set u := min{u, f(x∗)} and l := max{l, f(x∗)−m√gTH∗−1g}.

5: Add inequality gT (z − x∗) ≤ 0 to P.6: end for

Output: A localization set P.

A. d’Aspremont HPOPT, Delft, June 2012. 46/51

Page 47: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Localization methods

ACCPM adds an inequality to P each iteration, so centering gets harder, morestorage as algorithm progresses

Schemes for dropping constraints from P(k):

� remove all redundant constraints (expensive)

� remove some constraints known to be redundant

� remove constraints based on some relevance ranking

A. d’Aspremont HPOPT, Delft, June 2012. 47/51

Page 48: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Localization methods

Example. Classification with indefinite kernels. [Luss and d’Aspremont, 2008]Solve

min{K�0, ‖K−K0‖2F≤β}

max{αTy=0, 0≤α≤C}

αTe− 1

2Tr(K(Y α)(Y α)T )

in the variables K ∈ Sn and α ∈ Rn. This can be written

maximize αTe− 12

∑imax(0, λi(K0 + (Y α)(Y α)T/4ρ))(αTY vi)

2

+ρ∑i (max(0, λi(K0 + (Y α)(Y α)T/4ρ)))2 + ρTr(K0K0)

−2ρ∑iTr((viv

Ti )K0)max(0, λi(K0 + (Y α)(Y α)T/4ρ))

subject to αTy = 0, 0 ≤ α ≤ C

in the variable α ∈ Rn.

Computing the gradient at each iteration is expensive, but the feasible set is aPolyhedron.

A. d’Aspremont HPOPT, Delft, June 2012. 48/51

Page 49: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Localization methods

0 100 200 300 400 50010

−4

10−3

10−2

10−1

100

101

102

103

104

Duality

Gap

Iteration

ACCPM

0 200 400 600 800 100010

−4

10−3

10−2

10−1

100

101

102

103

104

Duality

Gap

Iteration

Projected Gradient Method

Convergence plots for ACCPM (left) and projected gradient method (right) onrandom subsets of the USPS-SS-3-5 data set (average gap versus iterationnumber, dashed lines at plus and minus one standard deviation).

A. d’Aspremont HPOPT, Delft, June 2012. 49/51

Page 50: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Conclusion

Countless other methods not discussed here. Some with no convergenceguarantees.

� Low Rank semidefinite programming. (Choose a factorization X = V V T andsolve in V ). [Burer and Monteiro, 2003, Journee et al., 2008]

� Row by row methods. Some solver variants for MAXCUT require only matrixvector products. [Wen et al., 2009]

� Multiplicative update methods [Arora and Kale, 2007]. No implementation orperformance details.

Some recent activity on subsampling.

� Variational inequality formulation [Juditsky et al., 2008, Baes et al., 2011].

� Columnwise or elementwise matrix subsampling [d’Aspremont, 2008b].

A. d’Aspremont HPOPT, Delft, June 2012. 50/51

Page 51: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

Conclusion

Large-scale semidefinite programs.

� First-order algorithms for solving (mostly) generic problems.

� For more specialized problems

Structure ⇒ algorithmic choices

What subproblem can you solve easily? Which algorithm exploits it best?

A. d’Aspremont HPOPT, Delft, June 2012. 51/51

Page 52: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

*

References

S. Arora and S. Kale. A combinatorial, primal-dual approach to semidefinite programs. In Proceedings of the thirty-ninth annual ACMsymposium on Theory of computing, pages 227–236, 2007.

M. Baes, M. Burgisser, and A. Nemirovski. A randomized mirror-prox method for solving structured large-scale matrix saddle-point problems.Arxiv preprint arXiv:1112.1274, 2011.

A. Ben-Tal and A. Nemirovski. Non-Euclidean restricted memory level method for large-scale convex optimization. MathematicalProgramming, 102(3):407–456, 2005.

S. Burer and R.D.C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. MathematicalProgramming, 95(2):329–357, 2003.

A. d’Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimization, 19(3):1171–1183, 2008a.

A. d’Aspremont. Subsampling algorithms for semidefinite programming. arXiv:0803.1990, 2008b.

A. d’Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance selection. SIAM Journal on Matrix Analysis andApplications, 30(1):56–66, 2006.

O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact oracle. CORE DiscussionPapers,(2011/02), 2011.

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.

C. Helmberg and F. Rendl. A spectral bundle method for semidefinite programming. SIAM Journal on Optimization, 10(3):673–696, 2000.

M. Jaggi. Convex optimization without projection steps. Arxiv preprint arXiv:1108.1170, 2011.

M. Journee, F. Bach, P.A. Absil, and R. Sepulchre. Low-rank optimization for semidefinite convex problems. Arxiv preprint arXiv:0807.4423,2008.

A. Juditsky, A.S. Nemirovskii, and C. Tauvel. Solving variational inequalities with Stochastic Mirror-Prox algorithm. Arxiv preprintarXiv:0809.0815, 2008.

K.C. Kiwiel. Proximal level bundle methods for convex nondifferentiable optimization, saddle-point problems and variational inequalities.Mathematical Programming, 69(1):89–109, 1995.

J. Kuczynski and H. Wozniakowski. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM J.Matrix Anal. Appl, 13(4):1094–1122, 1992.

G. Lan. An optimal method for stochastic composite optimization. Technical report, School of Industrial and Systems Engineering, GeorgiaInstitute of Technology, 2009, 2009.

A. d’Aspremont HPOPT, Delft, June 2012. 52/51

Page 53: Tutorial: Algorithms for Large-Scale Semide nite Programmingaspremon/PDF/HPOPT12.pdf · 2013. 9. 9. · 0 500 1000 1500 2000 10-3 10-2 10-1 100 101 Iteration f f A. d’Aspremont

G. Lan. Bundle-type methods uniformly optimal for smooth and non-smooth convex optimization. Manuscript, Department of Industrial andSystems Engineering, University of Florida, Gainesville, FL, 32611, 2010.

C. Lemarechal, A. Nemirovskii, and Y. Nesterov. New variants of bundle methods. Mathematical programming, 69(1):111–147, 1995.

A. Lewis, J. Malick, et al. Alternating projections on manifolds. Mathematics of Operations Research, 33(1):216–234, 2008.

R. Luss and A. d’Aspremont. Support vector machine classification with indefinite kernels. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis,editors, Advances in Neural Information Processing Systems 20, pages 953–960. MIT Press, Cambridge, MA, 2008.

C. Moler and C. Van Loan. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Review, 45(1):3–49,2003.

Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983.

Y. Nesterov. Smoothing technique and its applications in semidefinite optimization. Mathematical Programming, 110(2):245–259, 2007.

Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. CORE Discussion Papers, 2010.

Y. Nesterov and A. Nemirovskii. Interior-point polynomial algorithms in convex programming. Society for Industrial and AppliedMathematics, Philadelphia, 1994.

F. Oustry. A second-order bundle method to minimize the maximum eigenvalue function. Mathematical Programming, 89(1):1–33, 2000.

Z. Wen, D. Goldfarb, S. Ma, and K. Scheinberg. Row by row methods for semidefinite programming. Technical report, Technical report,Department of IEOR, Columbia University, 2009.

A. d’Aspremont HPOPT, Delft, June 2012. 53/51