algorithms: successive convex approximation (sca) · parallel optimization of multi-agent...

Algorithms:Successive Convex Approximation (SCA)

Prof. Daniel P. Palomar

ELEC5470/IEDA6100A - Convex OptimizationThe Hong Kong University of Science and Technology (HKUST)

Fall 2020-21

Outline

1 Successive Convex Approximation (SCA) in a Nutshell

2 Classical Algorithms as SCA Instances

3 Parallel SCA

4 Applications

5 Surrogate Functions∗

6 Extensions of Parallel SCA∗

7 Connection to MM

Successive Convex Approximation (SCA)Consider the following presumably difficult optimization problem:

minimizex

F (x)subject to x ∈ X ,

where the feasible set X is convex and F(x) is continuous.Basic idea of SCA: solve a difficult problem via solving a sequence of simplerproblems:

xk+1 = arg minx∈X

F (x | xk),

where F (x | xk) is a surrogate or approximation of the original function F (x).How to construct the surrogate functions F (x | xk) at each iteration k = 0, 1, . . . so thatthe iterates {xk} converge to an optimal x⋆?Answer: basic idea of SCA in (Scutari et al. 2014)1.

1G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.

D. Palomar (HKUST) Algorithms: SCA 4 / 48

Iterative algorithm


Surrogate or approximating function

The surrogate function F (x | xk) (apart from other technical conditions) has to satisfy:1 F (x | xk) is strongly convex on X ;2 F (x | xk) is differentiable with ∇F (x | xk) = ∇F (x).

Original method proposed in (Scutari et al. 2014)2.Nice overview of SCA in (Scutari and Sun 2018)3.


3G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.


Algorithm

Algorithm: SCASet k = 0, initialize with feasible point x0 ∈ X , {γk} ∈ (0, 1].repeat

Solve surrogate problem:x(xk) = arg min

x∈XF (x | xk)

Smooth it out to get the next iterate:

xk+1 = xk + γk(x(xk)− xk

)k← k + 1

until convergencereturn xk


References

Original paper with convergence:G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang (2014).

“Decomposition by Partial Linearization: Parallel Optimization of Multi-Agent Systems.”IEEE Trans. Signal Processing, 62(3), 641–656.

Overview book chapter:G. Scutari and Y. Sun (2018). Parallel and Distributed Successive Convex

Approximation Methods for Big-Data Optimization. C.I.M.E Lecture Notes inMathematics, Springer Verlag Series, 141–308.


Outline



3 Parallel SCA

4 Applications



7 Connection to MM

Gradient descent method as SCA

Consider the unconstrained problem

minimizex

F (x)

Let’s use the following surrogate function for F (x):

F (x | xk) = F (xk) +∇F (xk)T(x− xk) + 12αk ∥x− xk∥2.

To solve the surrogate problem, set the gradient of F (x | xk) to zero:

∇F (xk) + 1αk (x− xk) = 0.

The iterate is finally obtained as

xk+1 = xk − αk∇F (xk).D. Palomar (HKUST) Algorithms: SCA 12 / 48

Newton’s method as SCA

Consider again the unconstrained problem

minimizex

F (x)

Let’s use now the following better surrogate function for F (x):

F (x | xk) = F (xk) +∇F (xk)T(x− xk) + 12αk (x− xk)T∇2F (xk)(x− xk).

To solve the surrogate problem, set the gradient of F (x | xk) to zero:

∇F (xk) + 1αk∇

2F (xk)(x− xk) = 0

The iterate is finally obtained as

xk+1 = xk − αk∇2F (xk)−1∇F (xk).D. Palomar (HKUST) Algorithms: SCA 13 / 48

Outline



3 Parallel SCA

4 Applications



7 Connection to MM

Parallel SCA

Suppose we can partition the optimization variable into n blocks:

x = (x1, . . . , xn)

and that the feasible set X has a Cartesian product form:

X = X1 × · · · × Xn.

Then we can write our problem as

minimizex1,...,xn

F (x1, . . . , xn)subject to xi ∈ Xi, i = 1, . . . , n

where each Xi is a convex set and F(x) is continuous.


Parallel SCA

The most natural parallel (Jacoby type) solution method one can employ is to solve theproblem blockwise and in parallel, i.e., given xk, all the block variables xk

i are updatedsimultaneously:

xk+1i = arg min

xi∈XiF (xi, xk

−i), ∀i = 1, . . . , n.

Unfortunately, this method converges only under very restrictive conditions that areseldom verified in practice.Furthermore, the computation of xk+1

i is in general difficult due to the nonconvexity of F.To cope with those issues, the parallel SCA minimizes a strongly convex surrogatefunction instead:

xi(xk) = arg minxi∈Xi

Fi(xi | xk), ∀i = 1, . . . , n

and then setsxk+1

i = xki + γk

(xi(xk)− xk

i)

.


Surrogate functions

Each surrogate function Fi(xi | xk) (apart from other techcnical conditions) has to satisfy(Scutari et al. 2014)4 (Scutari and Sun 2018)5:

1 Fi(xi | xk) is strongly convex on Xi;2 Fi(xi | xk) is differentiable with ∇xi Fi(xi | xk) = ∇xiF (x).

Stepsize rules for {γk ∈ (0, 1]}:1 Bounded stepsize: γk sufficiently small;2 Diminishing stepsize:

∑∞k=1 γk = +∞ and

∑∞k=1(γk)2 < +∞;

3 Backtracking line search




Algorithm

Algorithm: Parallel SCASet k = 0, initialize with feasible point x0 ∈ X , {γk} ∈ (0, 1].repeat

For all i = 1, . . . , n, solve in parallel:


Fi(xi | xk)

Smooth it out to get the next iterate:

xk+1 = xk + γk(x(xk)− xk

)k← k + 1

until convergencereturn xk


Algorithm: Stepsize ruleThe bounded stepsize is difficult to use in practice since it has to be smaller than somebound which is difficult to know.The backtracking line search is effective in terms of iterations, but performing the linesearch requires evaluating the objective function multiple times per iteration, resulting inmore costly iterations.The diminishing stepsize is a very good option in practice. Two very effective examplesof diminishing stepsize rules are:

γk+1 = γk(1− ϵγk

), k = 0, 1, . . . , γ0 < 1/ϵ,

where ϵ ∈ (0, 1), and

γk+1 = γk + αk

1 + βk , k = 0, 1, . . . , γ0 = 1,

where αk and βk satisfy 0 ≤ αk ≤ βk and αk/βk → 0 as k→∞ while ∑k αk/βk =∞.

Examples of such αk and βk are: αk = α or log(k)α, and βk = β · k or βk = β ·√

k,where α and β are given constants satisfying α ∈ (0, 1), β ∈ (0, 1), and α ≤ β.


Convergence

The following gives the convergence of the parallel SCA algorithm to a stationary point(Scutari et al. 2014)6, (Scutari and Sun 2018)7.

TheoremSuppose each Fi satisfies the previous assumptions and {γk} is chosen according to thebounded stepsize, diminishing rule, or backtracking line search. Then,

limk→∞

∥x(xk)− xk∥ = 0.




Outline



3 Parallel SCA

4 Applications



7 Connection to MM

Recall: LASSO (ℓ2 − ℓ1 optimization) via BCD

Consider the problem

minimizex

f (x) ≜ 12∥y− Ax∥22 + λ∥x∥1

We can use BCD on each element of x = (x1, . . . , xN).The optimization w.r.t. each block xi at iteration k = 0, 1, . . . is

minimizexi

fi(xi) ≜12∥y

ki − aixi∥22 + λ|xi|

where yki ≜ y−∑

j<i ajxk+1j −

∑j>i ajxk

j .This leads to the iterates for k = 0, 1, . . .

xk+1i = softλ

(aT

i yki)

/∥ai∥2, i = 1, . . . , N

where softλ(u) ≜ sign(u) [|u| − λ]+ is the soft-thresholding operator ([·]+ ≜ max{·, 0}).D. Palomar (HKUST) Algorithms: SCA 22 / 48

Recall: LASSO (ℓ2 − ℓ1 optimization) via MM

The critical step in the application of MM is to find a convenient majorizer of thefunction f (x) ≜ 1

2∥y− Ax∥22 + λ∥x∥1.Consider the following majorizer of f (x):

u(x, xk) = f (x) + dist(x, xk)

where dist(x, xk) = c2∥x− xk∥22 − 1

2∥Ax− Axk∥22 and c > λmax(ATA).Note that u(x, xk) is a valid majorizer because it’s continuous, it is an upper-boundu(x, xk) ≥ f (x) with u(xk, xk) = f (xk), and ∇u(xk, xk) = ∇f (xk).The majorizer can be rewritten in a more convenient way as

u(x, xk) = c2∥x− xk∥22 + λ∥x∥1 + const.

where xk − 1c AT(y− Axk) + xk.


Recall: LASSO (ℓ2 − ℓ1 optimization) via MM

Now that we have the majorizer, we can formulate the problem to be solved at eachiteration k = 0, 1, . . .

minimizex≥0

c2∥x− xk∥22 + λ∥x∥1

This problem looks like the original one but without the matrix A mixing all thecomponents.As a consequence, this problem decouples into an optimization for each element, whichsolution we already known to be given by the soft-thresholding operator, leading to theiterates for k = 0, 1, . . .

xk+1 = softλ

(xk

),

where the soft-thresholding operator is applied elementwise.So what’s the difference between the algorithms obtained via BCD and MM?

BCD algorithm updates each element on a successive or cyclical way;MM algorithm updates all elements simultaneously.


LASSO (ℓ2 − ℓ1 optimization) via SCAWe will now use SCA to solve the problem

minimizex

F (x) ≜ 12∥y− Ax∥22 + λ∥x∥1.

We define the surrogate function for block (element) i as

Fi(xi | xk) = 12∥y

ki − aixi∥22 + λ|xi|+

τi2 (xi − xk

i )2


j=i ajxkj .

Thus, the resulting algorithm is to solve for all i = 1, . . . , N in parallel:minimize

xi

12∥yk

i − aixi∥22 + λ|xi|+ τi2 (xi − xk

i )2,

with solutionxi(xk) = 1

τi + ∥ai∥2softλ

(aT

i yki + τixk

i)

,

and then smooth it out as xk+1 = xk + γk(x(xk)− xk

).


LASSO (ℓ2 − ℓ1 optimization) via SCA

Recall the iterate from the BCD algorithm:

xk+1i = 1

∥ai∥2softλ

(aT

i yki)

, i = 1, . . . , N


j<i ajxk+1j −

∑j>i ajxk

j .Now compare with the iterate obtained with SCA:

xi(xk) = 1τi + ∥ai∥2

softλ

(aT

i yki + τixk

i)

,

yki ≜ y−∑

j=i ajxkj , and then xk+1 = xk + γk

(x(xk)− xk

).

There are several differences:1 The SCA iterate has additional terms with τi; if we set τi = 0 both iterates look similar.2 The updates in BCD are sequential (cyclical) whereas in SCA are simultanous (in parallel).3 The term yk

i is slightly different in both approaches due to the sequential vs parallel updates.D. Palomar (HKUST) Algorithms: SCA 26 / 48

LASSO (ℓ2 − ℓ1 optimization) via SCAConsider the BCD method but with parallel updates (it is then called nonlinear Jacobimethod):

xk+1i = 1

∥ai∥2softλ

(aT

i yki)

, i = 1, . . . , N


j=i ajxkj .

Unfortunately, this is not guaranteed to converge.Instead, consider the paralle SCA updates (setting τi = 0, which can be done since thesurrogate is already strongly convex):

xi(xk) = 1∥ai∥2

softλ

(aT

i yki)

, i = 1, . . . , N

and then smooth it xk+1 = xk + γk(x(xk)− xk

).

The parallel SCA is guaranteed to converge (assuming γk is properly chosen) thanks tothe smoothing operation.


Dictionary learning

Consider the dictionary learning problem

minimizeD,X

12∥Y−DX∥2F + λ∥X∥1

subject to ∥[D]:,i∥ ≤ 1 i = 1, . . . , m,

where ∥X∥F and ∥X∥1 denote the Frobenius norm and the ℓ1 matrix norm of X,respectively.Matrix D is the so-called dictionary and it is a fat matrix that contains a dictionary ofthe possible columns.Matrix X selects a few columns of the dictionary and that’s why it has to be sparse.This problem is not jointly convex in (D, X), but it is bi-convex: for a fixed D, it isconvex in X and, for a fixed X, it is convex in D.


Dictionary learning

We define the following two surrogate functions:

F1(D | Xk) = 12∥Y−DXk∥2F

F2(X | Dk) = 12∥Y−DkX∥2F

This leads to the following two convex problems:a normalized LS problem:

minimizeD

12∥Y−DXk∥2

F

subject to ∥[D]:,i∥ ≤ 1 i = 1, . . . , m,

a matrix LASSO problem:

minimizeX

12∥Y−DkX∥2

F + λ∥X∥1,

which can be decomposed as a set of regular LASSO problems for each column of X.D. Palomar (HKUST) Algorithms: SCA 29 / 48

Outline



3 Parallel SCA

4 Applications



7 Connection to MM

Surrogates based on block-wise convexity

Suppose F (x1, . . . , xn) is convex in each block xi separately (but not necessarily jointly).A natural approximation to exploit this partial convexity is

F (x | xk) =n∑

i=1Fi (xi | xk)

withFi (xi | xk) = F (xi, xk

−i) + τi2 (xi − xk

i )THi(xi − xki ),

where τi is any positive constant and Hi is any positive definite matrix (e.g., Hi = I).The quadradic term involving Hi can be removed if F (xi, xk

−i) is strongly convex on Xi.


Proximal gradient-like surrogates

If no convexity is present in F (x), one alternative is to mimick proximal-gradient methods:

F (x | xk) =n∑

i=1Fi (xi | xk)

withFi (xi | xk) = ∇xiF (x)T(xi − xk

i ) + τi2 ∥xi − xk

i ∥2.

Observe that if τi is large enough, then the surrogate function Fi (xi | xk) becomes amajorizer.But in this case τi can be any positive number and then Fi (xi | xk) may not be amajorizer.The reason why convergence can still be guaranteed is because γk < 1 and the smoothingstep allows convergence.


Surrogates based on sum-utility functionSuppose F (x) =

∑Ii=1 fi(x1, . . . , xn).

Such structure arises, e.g., in multi-agent systems wherein fi is the cost function of agenti that controls its own block variables xi ∈ Xi.It is common that the cost functions fi are convex in some agents’ variables. Define theset of indices of all the functions convex in xi as

Ci ≜ {j : fj(xi, x−i) is convex in xi}

Then we can construct the following surrogate function:

F (x | xk) =n∑

i=1F Ci (xi | xk)

where F Ci (xi | xk) =∑j∈Ci

fj(xi, xk−i) +

∑l/∈Ci

∇xifl(xk)T(xi − xki )

+ τi2 (xi − xk

i )THi(xi − xki ),

In words: each agent i keeps the convex part of F while it linearizes the nonconvex part.D. Palomar (HKUST) Algorithms: SCA 33 / 48

Surrogates based on product of functions

Suppose that F (x) is expressed as the product of functions. For simplicity, consider thecase of product of two functions:

F (x) = f1(x)f2(x)

with f1 and f2 convex and non-negative on X .Inspired by the form of the gradient of F, ∇F = f2∇f1 + f1∇f2, it seems natural toconsider the following surrogate function:

F (x | xk) = f1(x)f2(xk) + f1(xk)f2(x) + τi2 (xi − xk

i )THi(xi − xki ).

If f1 and f2 are not necessarily convex, we can instead use

F (x | xk) = f1(x | xk)f2(xk) + f1(xk)f2(x | xk),

where f1 and f2 are legitimate surrogates of f1 and f2, respectively.D. Palomar (HKUST) Algorithms: SCA 34 / 48

Surrogates based on composition of functions

LetF (x) = h(f(x)),

where h is a convex smooth function nondecreasing in each component, and f is a smoothmapping with each component not necessarily convex.Example: F (x) = ∥f(x)∥2.Observe that the gradient is

∇F (x) = ∇f(x)∇h(f(x))

where ∇f(x) = [∇f1(x) · · ·∇fq(x)] is the Jacobian of f.A surrogate function is

F (x | xk) = h(f(xk) +∇f(xk)T(x− xk)

)+ τi

2 ∥x− xk∥2.


Outline



3 Parallel SCA

4 Applications



7 Connection to MM

Extension: Nondifferentiable objective function

Consider now the following problem:

minimizex1,...,xn

F (x1, . . . , xn) + G (x1, . . . , xn)subject to xi ∈ Xi, i = 1, . . . , n

where each Xi is a convex set, F (x) is continuous, and G (x) is convex but possiblynonsmooth.Parallel SCA minimizes the following strongly convex surrogate functions in parallel for alli = 1, . . . , n (Scutari and Sun 2018)8:


Fi(xi | xk) + Gi(xi, xk−i)

and then setsxk+1 = xk + γk

(x(xk)− xk

).



Extension: Inexact updates

The parallel SCA at each iteration k = 0, 1, . . . solves:


Fi(xi | xk), ∀i = 1, . . . , n.

In practice, however, it may be too costly to obtain such minimizers.The inexact parallel SCA instead considers inexact solutions zk

i ≈ xi(xk), for alli = 1, . . . , n, that satisfy:

1 ∥zki − xi(xk)∥ ≤ εk

i with limk→∞ εki = 0;

2 Fi(zki | xk) ≤ Fi(xk

i | xk).In words, the inexact solutions are obtained with a desired accuracy of εk

i , which mustasymptotially vanish (increasing accuracy) in order to guarantee convergence (Scutari andSun 2018)9.



Extension: Selective updates

For problems with a large number of blocks n, it may be too costly to update all of themin parallel:


Fi(xi | xk), ∀i = 1, . . . , n

andxk+1

i = xki + γk

(xi(xk)− xk

i)

, ∀i = 1, . . . , n.

The SCA with selective update defines at each iteration k the blocks to be updated Sk sothat the iterate is

xk+1i =

xki + γk

(xi(xk)− xk

i)

if i ∈ Sk

xki otherwise.


Extension: Selective updatesSeveral options are possible for the block selection rule Sk:

according to some deterministic (cyclic) rule;according to some random-based rule;greedy-like schemes: updating only the blocks that are “far away” from the optimum.

Convergence can be still guaranteed for the following update rules (Scutari and Sun2018)10:

1 Essentially cyclic rule: Sk is selected so that∪T−1

s=0 Sk+s = {1, . . . , n} for some finite T.2 Greedy rule: Each Sk contains at least one index i such that

Ei(xk) ≥ ρ maxj

Ej(xk)

where ρ ∈ (0, 1] and Ei(xk) is an approximation Ei(xk) ≈ ∥xi(xk)− xki ∥.

3 Random-based rule: The sets Sk are realizations of independent random sets such thatPr(i ∈ Sk) ≥ p for some p > 0.



Outline



3 Parallel SCA

4 Applications



7 Connection to MM

Majorization-Minimization (MM)Consider the following problem:

minimizex

f (x)subject to x ∈ X ,

where X is the feasible set and f (x) a continuous function.The idea of MM is to iteratively approximate the problem by a simpler one (like in SCA).MM approximates f (x) by an upper-bound function (majorizer) u(x | xk) satisfying sometechnical conditions like the first-order condition ∇u(xk | xk) = ∇f (xk).At iteration k = 0, 1, . . . the surrogate problem is (Scutari et al. 2014)11

minimizex

u(x | xk)subject to x ∈ X .



MM vs SCA

Surrogate function:

MM requires the surrogate function to be a global upper-bound (which can be toodemanding in some cases), albeit not necessarily convex.SCA relaxes the upper-bound condition, but it requires the surrogate to be stronglyconvex.


MM vs SCA

Constraint set:In principle, both SCA and MM require the feasible set X to be convex.MM can be easily extended to nonconvex X on a case by case basis; for example: (Songet al. 2015)12, (Kumar et al. 2019)13, (Kumar et al. 2020)14.SCA can be extended to convexify the constraint functions, but cannot deal with anonconvex X directly, which limits its applicability in many real-world applications.

12J. Song, P. Babu, and D. P. Palomar, “Sparse generalized eigenvalue problem via smooth optimization,”IEEE Trans. Signal Processing, vol. 63, no. 7, pp. 1627–1642, 2015.

13S. Kumar, J. Ying, J. V. de M. Cardoso, and D. P. Palomar, “Structured graph learning via laplacianspectral constraints,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), Vancouver,Canada, 2019.

14S. Kumar, J. Ying, J. V. de M. Cardoso, and D. P. Palomar, “A unified framework for structured graphlearning via spectral constraints,” Journal of Machine Learning Research (JMLR), pp. 1–60, 2020.


MM vs SCA

Schedule of updates:

MM updates the whole variable x at each iteration (so in principle no distributedimplementation).If the majorizer in MM happens to be block separable in x = (x1, . . . , xN), then one canhave a parallel update.Block MM updates each block of x = (x1, . . . , xN) sequentially.SCA, on the other hand, naturally has a parallel update (assuming the constraints areseparable), which can be useful for distributed implementation.


Thanks

For more information visit:

https://www.danielppalomar.com

References I

Kumar, S., Ying, J., M. Cardoso, J. V. de, & Palomar, D. P. (2019). Structured graphlearning via laplacian spectral constraints. In Proc. Advances in neural information processingsystems (neurips). Vancouver, Canada.Kumar, S., Ying, J., M. Cardoso, J. V. de, & Palomar, D. P. (2020). A unified framework forstructured graph learning via spectral constraints. Journal of Machine Learning Research(JMLR), 1–60.Scutari, G., Facchinei, F., Song, P., Palomar, D. P., & Pang, J.-S. (2014). Decomposition bypartial linearization: Parallel optimization of multi-agent systems. IEEE Trans. SignalProcessing, 62(3), 641–656.Scutari, G., & Sun, Y. (2018). Parallel and distributed successive convex approximationmethods for big-data optimization. In C.I.M.E lecture notes in mathematics (pp. 141–308).Springer Verlag Series.


References II

Song, J., Babu, P., & Palomar, D. P. (2015). Sparse generalized eigenvalue problem viasmooth optimization. IEEE Trans. Signal Processing, 63(7), 1627–1642.


algorithms: successive convex approximation (sca) · parallel optimization of multi-agent...

Documents