algorithms: successive convex approximation (sca) · parallel optimization of multi-agent...

48
Algorithms: Successive Convex Approximation (SCA) Prof. Daniel P. Palomar ELEC5470/IEDA6100A - Convex Optimization The Hong Kong University of Science and Technology (HKUST) Fall 2020-21

Upload: others

Post on 29-Jul-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Algorithms:Successive Convex Approximation (SCA)

Prof. Daniel P. Palomar

ELEC5470/IEDA6100A - Convex OptimizationThe Hong Kong University of Science and Technology (HKUST)

Fall 2020-21

Page 2: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Outline

1 Successive Convex Approximation (SCA) in a Nutshell

2 Classical Algorithms as SCA Instances

3 Parallel SCA

4 Applications

5 Surrogate Functions∗

6 Extensions of Parallel SCA∗

7 Connection to MM

Page 3: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Outline

1 Successive Convex Approximation (SCA) in a Nutshell

2 Classical Algorithms as SCA Instances

3 Parallel SCA

4 Applications

5 Surrogate Functions∗

6 Extensions of Parallel SCA∗

7 Connection to MM

Page 4: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Successive Convex Approximation (SCA)Consider the following presumably difficult optimization problem:

minimizex

F (x)subject to x ∈ X ,

where the feasible set X is convex and F(x) is continuous.Basic idea of SCA: solve a difficult problem via solving a sequence of simplerproblems:

xk+1 = arg minx∈X

F (x | xk),

where F (x | xk) is a surrogate or approximation of the original function F (x).How to construct the surrogate functions F (x | xk) at each iteration k = 0, 1, . . . so thatthe iterates {xk} converge to an optimal x⋆?Answer: basic idea of SCA in (Scutari et al. 2014)1.

1G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.

D. Palomar (HKUST) Algorithms: SCA 4 / 48

Page 5: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Iterative algorithm

D. Palomar (HKUST) Algorithms: SCA 5 / 48

Page 6: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Iterative algorithm

D. Palomar (HKUST) Algorithms: SCA 6 / 48

Page 7: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Iterative algorithm

D. Palomar (HKUST) Algorithms: SCA 7 / 48

Page 8: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Surrogate or approximating function

The surrogate function F (x | xk) (apart from other technical conditions) has to satisfy:1 F (x | xk) is strongly convex on X ;2 F (x | xk) is differentiable with ∇F (x | xk) = ∇F (x).

Original method proposed in (Scutari et al. 2014)2.Nice overview of SCA in (Scutari and Sun 2018)3.

2G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.

3G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.

D. Palomar (HKUST) Algorithms: SCA 8 / 48

Page 9: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Algorithm

Algorithm: SCASet k = 0, initialize with feasible point x0 ∈ X , {γk} ∈ (0, 1].repeat

Solve surrogate problem:x(xk) = arg min

x∈XF (x | xk)

Smooth it out to get the next iterate:

xk+1 = xk + γk(x(xk)− xk

)k← k + 1

until convergencereturn xk

D. Palomar (HKUST) Algorithms: SCA 9 / 48

Page 10: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

References

Original paper with convergence:G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang (2014).

“Decomposition by Partial Linearization: Parallel Optimization of Multi-Agent Systems.”IEEE Trans. Signal Processing, 62(3), 641–656.

Overview book chapter:G. Scutari and Y. Sun (2018). Parallel and Distributed Successive Convex

Approximation Methods for Big-Data Optimization. C.I.M.E Lecture Notes inMathematics, Springer Verlag Series, 141–308.

D. Palomar (HKUST) Algorithms: SCA 10 / 48

Page 11: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Outline

1 Successive Convex Approximation (SCA) in a Nutshell

2 Classical Algorithms as SCA Instances

3 Parallel SCA

4 Applications

5 Surrogate Functions∗

6 Extensions of Parallel SCA∗

7 Connection to MM

Page 12: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Gradient descent method as SCA

Consider the unconstrained problem

minimizex

F (x)

Let’s use the following surrogate function for F (x):

F (x | xk) = F (xk) +∇F (xk)T(x− xk) + 12αk ∥x− xk∥2.

To solve the surrogate problem, set the gradient of F (x | xk) to zero:

∇F (xk) + 1αk (x− xk) = 0.

The iterate is finally obtained as

xk+1 = xk − αk∇F (xk).D. Palomar (HKUST) Algorithms: SCA 12 / 48

Page 13: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Newton’s method as SCA

Consider again the unconstrained problem

minimizex

F (x)

Let’s use now the following better surrogate function for F (x):

F (x | xk) = F (xk) +∇F (xk)T(x− xk) + 12αk (x− xk)T∇2F (xk)(x− xk).

To solve the surrogate problem, set the gradient of F (x | xk) to zero:

∇F (xk) + 1αk∇

2F (xk)(x− xk) = 0

The iterate is finally obtained as

xk+1 = xk − αk∇2F (xk)−1∇F (xk).D. Palomar (HKUST) Algorithms: SCA 13 / 48

Page 14: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Outline

1 Successive Convex Approximation (SCA) in a Nutshell

2 Classical Algorithms as SCA Instances

3 Parallel SCA

4 Applications

5 Surrogate Functions∗

6 Extensions of Parallel SCA∗

7 Connection to MM

Page 15: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Parallel SCA

Suppose we can partition the optimization variable into n blocks:

x = (x1, . . . , xn)

and that the feasible set X has a Cartesian product form:

X = X1 × · · · × Xn.

Then we can write our problem as

minimizex1,...,xn

F (x1, . . . , xn)subject to xi ∈ Xi, i = 1, . . . , n

where each Xi is a convex set and F(x) is continuous.

D. Palomar (HKUST) Algorithms: SCA 15 / 48

Page 16: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Parallel SCA

The most natural parallel (Jacoby type) solution method one can employ is to solve theproblem blockwise and in parallel, i.e., given xk, all the block variables xk

i are updatedsimultaneously:

xk+1i = arg min

xi∈XiF (xi, xk

−i), ∀i = 1, . . . , n.

Unfortunately, this method converges only under very restrictive conditions that areseldom verified in practice.Furthermore, the computation of xk+1

i is in general difficult due to the nonconvexity of F.To cope with those issues, the parallel SCA minimizes a strongly convex surrogatefunction instead:

xi(xk) = arg minxi∈Xi

Fi(xi | xk), ∀i = 1, . . . , n

and then setsxk+1

i = xki + γk

(xi(xk)− xk

i)

.

D. Palomar (HKUST) Algorithms: SCA 16 / 48

Page 17: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Surrogate functions

Each surrogate function Fi(xi | xk) (apart from other techcnical conditions) has to satisfy(Scutari et al. 2014)4 (Scutari and Sun 2018)5:

1 Fi(xi | xk) is strongly convex on Xi;2 Fi(xi | xk) is differentiable with ∇xi Fi(xi | xk) = ∇xiF (x).

Stepsize rules for {γk ∈ (0, 1]}:1 Bounded stepsize: γk sufficiently small;2 Diminishing stepsize:

∑∞k=1 γk = +∞ and

∑∞k=1(γk)2 < +∞;

3 Backtracking line search

4G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.

5G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.

D. Palomar (HKUST) Algorithms: SCA 17 / 48

Page 18: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Algorithm

Algorithm: Parallel SCASet k = 0, initialize with feasible point x0 ∈ X , {γk} ∈ (0, 1].repeat

For all i = 1, . . . , n, solve in parallel:

xi(xk) = arg minxi∈Xi

Fi(xi | xk)

Smooth it out to get the next iterate:

xk+1 = xk + γk(x(xk)− xk

)k← k + 1

until convergencereturn xk

D. Palomar (HKUST) Algorithms: SCA 18 / 48

Page 19: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Algorithm: Stepsize ruleThe bounded stepsize is difficult to use in practice since it has to be smaller than somebound which is difficult to know.The backtracking line search is effective in terms of iterations, but performing the linesearch requires evaluating the objective function multiple times per iteration, resulting inmore costly iterations.The diminishing stepsize is a very good option in practice. Two very effective examplesof diminishing stepsize rules are:

γk+1 = γk(1− ϵγk

), k = 0, 1, . . . , γ0 < 1/ϵ,

where ϵ ∈ (0, 1), and

γk+1 = γk + αk

1 + βk , k = 0, 1, . . . , γ0 = 1,

where αk and βk satisfy 0 ≤ αk ≤ βk and αk/βk → 0 as k→∞ while ∑k αk/βk =∞.

Examples of such αk and βk are: αk = α or log(k)α, and βk = β · k or βk = β ·√

k,where α and β are given constants satisfying α ∈ (0, 1), β ∈ (0, 1), and α ≤ β.

D. Palomar (HKUST) Algorithms: SCA 19 / 48

Page 20: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Convergence

The following gives the convergence of the parallel SCA algorithm to a stationary point(Scutari et al. 2014)6, (Scutari and Sun 2018)7.

TheoremSuppose each Fi satisfies the previous assumptions and {γk} is chosen according to thebounded stepsize, diminishing rule, or backtracking line search. Then,

limk→∞

∥x(xk)− xk∥ = 0.

6G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.

7G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.

D. Palomar (HKUST) Algorithms: SCA 20 / 48

Page 21: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Outline

1 Successive Convex Approximation (SCA) in a Nutshell

2 Classical Algorithms as SCA Instances

3 Parallel SCA

4 Applications

5 Surrogate Functions∗

6 Extensions of Parallel SCA∗

7 Connection to MM

Page 22: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Recall: LASSO (ℓ2 − ℓ1 optimization) via BCD

Consider the problem

minimizex

f (x) ≜ 12∥y− Ax∥22 + λ∥x∥1

We can use BCD on each element of x = (x1, . . . , xN).The optimization w.r.t. each block xi at iteration k = 0, 1, . . . is

minimizexi

fi(xi) ≜12∥y

ki − aixi∥22 + λ|xi|

where yki ≜ y−∑

j<i ajxk+1j −

∑j>i ajxk

j .This leads to the iterates for k = 0, 1, . . .

xk+1i = softλ

(aT

i yki)

/∥ai∥2, i = 1, . . . , N

where softλ(u) ≜ sign(u) [|u| − λ]+ is the soft-thresholding operator ([·]+ ≜ max{·, 0}).D. Palomar (HKUST) Algorithms: SCA 22 / 48

Page 23: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Recall: LASSO (ℓ2 − ℓ1 optimization) via MM

The critical step in the application of MM is to find a convenient majorizer of thefunction f (x) ≜ 1

2∥y− Ax∥22 + λ∥x∥1.Consider the following majorizer of f (x):

u(x, xk) = f (x) + dist(x, xk)

where dist(x, xk) = c2∥x− xk∥22 − 1

2∥Ax− Axk∥22 and c > λmax(ATA).Note that u(x, xk) is a valid majorizer because it’s continuous, it is an upper-boundu(x, xk) ≥ f (x) with u(xk, xk) = f (xk), and ∇u(xk, xk) = ∇f (xk).The majorizer can be rewritten in a more convenient way as

u(x, xk) = c2∥x− xk∥22 + λ∥x∥1 + const.

where xk − 1c AT(y− Axk) + xk.

D. Palomar (HKUST) Algorithms: SCA 23 / 48

Page 24: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Recall: LASSO (ℓ2 − ℓ1 optimization) via MM

Now that we have the majorizer, we can formulate the problem to be solved at eachiteration k = 0, 1, . . .

minimizex≥0

c2∥x− xk∥22 + λ∥x∥1

This problem looks like the original one but without the matrix A mixing all thecomponents.As a consequence, this problem decouples into an optimization for each element, whichsolution we already known to be given by the soft-thresholding operator, leading to theiterates for k = 0, 1, . . .

xk+1 = softλ

(xk

),

where the soft-thresholding operator is applied elementwise.So what’s the difference between the algorithms obtained via BCD and MM?

BCD algorithm updates each element on a successive or cyclical way;MM algorithm updates all elements simultaneously.

D. Palomar (HKUST) Algorithms: SCA 24 / 48

Page 25: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

LASSO (ℓ2 − ℓ1 optimization) via SCAWe will now use SCA to solve the problem

minimizex

F (x) ≜ 12∥y− Ax∥22 + λ∥x∥1.

We define the surrogate function for block (element) i as

Fi(xi | xk) = 12∥y

ki − aixi∥22 + λ|xi|+

τi2 (xi − xk

i )2

where yki ≜ y−∑

j=i ajxkj .

Thus, the resulting algorithm is to solve for all i = 1, . . . , N in parallel:minimize

xi

12∥yk

i − aixi∥22 + λ|xi|+ τi2 (xi − xk

i )2,

with solutionxi(xk) = 1

τi + ∥ai∥2softλ

(aT

i yki + τixk

i)

,

and then smooth it out as xk+1 = xk + γk(x(xk)− xk

).

D. Palomar (HKUST) Algorithms: SCA 25 / 48

Page 26: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

LASSO (ℓ2 − ℓ1 optimization) via SCA

Recall the iterate from the BCD algorithm:

xk+1i = 1

∥ai∥2softλ

(aT

i yki)

, i = 1, . . . , N

where yki ≜ y−∑

j<i ajxk+1j −

∑j>i ajxk

j .Now compare with the iterate obtained with SCA:

xi(xk) = 1τi + ∥ai∥2

softλ

(aT

i yki + τixk

i)

,

yki ≜ y−∑

j=i ajxkj , and then xk+1 = xk + γk

(x(xk)− xk

).

There are several differences:1 The SCA iterate has additional terms with τi; if we set τi = 0 both iterates look similar.2 The updates in BCD are sequential (cyclical) whereas in SCA are simultanous (in parallel).3 The term yk

i is slightly different in both approaches due to the sequential vs parallel updates.D. Palomar (HKUST) Algorithms: SCA 26 / 48

Page 27: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

LASSO (ℓ2 − ℓ1 optimization) via SCAConsider the BCD method but with parallel updates (it is then called nonlinear Jacobimethod):

xk+1i = 1

∥ai∥2softλ

(aT

i yki)

, i = 1, . . . , N

where yki ≜ y−∑

j=i ajxkj .

Unfortunately, this is not guaranteed to converge.Instead, consider the paralle SCA updates (setting τi = 0, which can be done since thesurrogate is already strongly convex):

xi(xk) = 1∥ai∥2

softλ

(aT

i yki)

, i = 1, . . . , N

and then smooth it xk+1 = xk + γk(x(xk)− xk

).

The parallel SCA is guaranteed to converge (assuming γk is properly chosen) thanks tothe smoothing operation.

D. Palomar (HKUST) Algorithms: SCA 27 / 48

Page 28: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Dictionary learning

Consider the dictionary learning problem

minimizeD,X

12∥Y−DX∥2F + λ∥X∥1

subject to ∥[D]:,i∥ ≤ 1 i = 1, . . . , m,

where ∥X∥F and ∥X∥1 denote the Frobenius norm and the ℓ1 matrix norm of X,respectively.Matrix D is the so-called dictionary and it is a fat matrix that contains a dictionary ofthe possible columns.Matrix X selects a few columns of the dictionary and that’s why it has to be sparse.This problem is not jointly convex in (D, X), but it is bi-convex: for a fixed D, it isconvex in X and, for a fixed X, it is convex in D.

D. Palomar (HKUST) Algorithms: SCA 28 / 48

Page 29: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Dictionary learning

We define the following two surrogate functions:

F1(D | Xk) = 12∥Y−DXk∥2F

F2(X | Dk) = 12∥Y−DkX∥2F

This leads to the following two convex problems:a normalized LS problem:

minimizeD

12∥Y−DXk∥2

F

subject to ∥[D]:,i∥ ≤ 1 i = 1, . . . , m,

a matrix LASSO problem:

minimizeX

12∥Y−DkX∥2

F + λ∥X∥1,

which can be decomposed as a set of regular LASSO problems for each column of X.D. Palomar (HKUST) Algorithms: SCA 29 / 48

Page 30: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Outline

1 Successive Convex Approximation (SCA) in a Nutshell

2 Classical Algorithms as SCA Instances

3 Parallel SCA

4 Applications

5 Surrogate Functions∗

6 Extensions of Parallel SCA∗

7 Connection to MM

Page 31: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Surrogates based on block-wise convexity

Suppose F (x1, . . . , xn) is convex in each block xi separately (but not necessarily jointly).A natural approximation to exploit this partial convexity is

F (x | xk) =n∑

i=1Fi (xi | xk)

withFi (xi | xk) = F (xi, xk

−i) + τi2 (xi − xk

i )THi(xi − xki ),

where τi is any positive constant and Hi is any positive definite matrix (e.g., Hi = I).The quadradic term involving Hi can be removed if F (xi, xk

−i) is strongly convex on Xi.

D. Palomar (HKUST) Algorithms: SCA 31 / 48

Page 32: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Proximal gradient-like surrogates

If no convexity is present in F (x), one alternative is to mimick proximal-gradient methods:

F (x | xk) =n∑

i=1Fi (xi | xk)

withFi (xi | xk) = ∇xiF (x)T(xi − xk

i ) + τi2 ∥xi − xk

i ∥2.

Observe that if τi is large enough, then the surrogate function Fi (xi | xk) becomes amajorizer.But in this case τi can be any positive number and then Fi (xi | xk) may not be amajorizer.The reason why convergence can still be guaranteed is because γk < 1 and the smoothingstep allows convergence.

D. Palomar (HKUST) Algorithms: SCA 32 / 48

Page 33: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Surrogates based on sum-utility functionSuppose F (x) =

∑Ii=1 fi(x1, . . . , xn).

Such structure arises, e.g., in multi-agent systems wherein fi is the cost function of agenti that controls its own block variables xi ∈ Xi.It is common that the cost functions fi are convex in some agents’ variables. Define theset of indices of all the functions convex in xi as

Ci ≜ {j : fj(xi, x−i) is convex in xi}

Then we can construct the following surrogate function:

F (x | xk) =n∑

i=1F Ci (xi | xk)

where F Ci (xi | xk) =∑j∈Ci

fj(xi, xk−i) +

∑l/∈Ci

∇xifl(xk)T(xi − xki )

+ τi2 (xi − xk

i )THi(xi − xki ),

In words: each agent i keeps the convex part of F while it linearizes the nonconvex part.D. Palomar (HKUST) Algorithms: SCA 33 / 48

Page 34: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Surrogates based on product of functions

Suppose that F (x) is expressed as the product of functions. For simplicity, consider thecase of product of two functions:

F (x) = f1(x)f2(x)

with f1 and f2 convex and non-negative on X .Inspired by the form of the gradient of F, ∇F = f2∇f1 + f1∇f2, it seems natural toconsider the following surrogate function:

F (x | xk) = f1(x)f2(xk) + f1(xk)f2(x) + τi2 (xi − xk

i )THi(xi − xki ).

If f1 and f2 are not necessarily convex, we can instead use

F (x | xk) = f1(x | xk)f2(xk) + f1(xk)f2(x | xk),

where f1 and f2 are legitimate surrogates of f1 and f2, respectively.D. Palomar (HKUST) Algorithms: SCA 34 / 48

Page 35: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Surrogates based on composition of functions

LetF (x) = h(f(x)),

where h is a convex smooth function nondecreasing in each component, and f is a smoothmapping with each component not necessarily convex.Example: F (x) = ∥f(x)∥2.Observe that the gradient is

∇F (x) = ∇f(x)∇h(f(x))

where ∇f(x) = [∇f1(x) · · ·∇fq(x)] is the Jacobian of f.A surrogate function is

F (x | xk) = h(f(xk) +∇f(xk)T(x− xk)

)+ τi

2 ∥x− xk∥2.

D. Palomar (HKUST) Algorithms: SCA 35 / 48

Page 36: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Outline

1 Successive Convex Approximation (SCA) in a Nutshell

2 Classical Algorithms as SCA Instances

3 Parallel SCA

4 Applications

5 Surrogate Functions∗

6 Extensions of Parallel SCA∗

7 Connection to MM

Page 37: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Extension: Nondifferentiable objective function

Consider now the following problem:

minimizex1,...,xn

F (x1, . . . , xn) + G (x1, . . . , xn)subject to xi ∈ Xi, i = 1, . . . , n

where each Xi is a convex set, F (x) is continuous, and G (x) is convex but possiblynonsmooth.Parallel SCA minimizes the following strongly convex surrogate functions in parallel for alli = 1, . . . , n (Scutari and Sun 2018)8:

xi(xk) = arg minxi∈Xi

Fi(xi | xk) + Gi(xi, xk−i)

and then setsxk+1 = xk + γk

(x(xk)− xk

).

8G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.

D. Palomar (HKUST) Algorithms: SCA 37 / 48

Page 38: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Extension: Inexact updates

The parallel SCA at each iteration k = 0, 1, . . . solves:

xi(xk) = arg minxi∈Xi

Fi(xi | xk), ∀i = 1, . . . , n.

In practice, however, it may be too costly to obtain such minimizers.The inexact parallel SCA instead considers inexact solutions zk

i ≈ xi(xk), for alli = 1, . . . , n, that satisfy:

1 ∥zki − xi(xk)∥ ≤ εk

i with limk→∞ εki = 0;

2 Fi(zki | xk) ≤ Fi(xk

i | xk).In words, the inexact solutions are obtained with a desired accuracy of εk

i , which mustasymptotially vanish (increasing accuracy) in order to guarantee convergence (Scutari andSun 2018)9.

9G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.

D. Palomar (HKUST) Algorithms: SCA 38 / 48

Page 39: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Extension: Selective updates

For problems with a large number of blocks n, it may be too costly to update all of themin parallel:

xi(xk) = arg minxi∈Xi

Fi(xi | xk), ∀i = 1, . . . , n

andxk+1

i = xki + γk

(xi(xk)− xk

i)

, ∀i = 1, . . . , n.

The SCA with selective update defines at each iteration k the blocks to be updated Sk sothat the iterate is

xk+1i =

xki + γk

(xi(xk)− xk

i)

if i ∈ Sk

xki otherwise.

D. Palomar (HKUST) Algorithms: SCA 39 / 48

Page 40: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Extension: Selective updatesSeveral options are possible for the block selection rule Sk:

according to some deterministic (cyclic) rule;according to some random-based rule;greedy-like schemes: updating only the blocks that are “far away” from the optimum.

Convergence can be still guaranteed for the following update rules (Scutari and Sun2018)10:

1 Essentially cyclic rule: Sk is selected so that∪T−1

s=0 Sk+s = {1, . . . , n} for some finite T.2 Greedy rule: Each Sk contains at least one index i such that

Ei(xk) ≥ ρ maxj

Ej(xk)

where ρ ∈ (0, 1] and Ei(xk) is an approximation Ei(xk) ≈ ∥xi(xk)− xki ∥.

3 Random-based rule: The sets Sk are realizations of independent random sets such thatPr(i ∈ Sk) ≥ p for some p > 0.

10G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.

D. Palomar (HKUST) Algorithms: SCA 40 / 48

Page 41: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Outline

1 Successive Convex Approximation (SCA) in a Nutshell

2 Classical Algorithms as SCA Instances

3 Parallel SCA

4 Applications

5 Surrogate Functions∗

6 Extensions of Parallel SCA∗

7 Connection to MM

Page 42: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Majorization-Minimization (MM)Consider the following problem:

minimizex

f (x)subject to x ∈ X ,

where X is the feasible set and f (x) a continuous function.The idea of MM is to iteratively approximate the problem by a simpler one (like in SCA).MM approximates f (x) by an upper-bound function (majorizer) u(x | xk) satisfying sometechnical conditions like the first-order condition ∇u(xk | xk) = ∇f (xk).At iteration k = 0, 1, . . . the surrogate problem is (Scutari et al. 2014)11

minimizex

u(x | xk)subject to x ∈ X .

11G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.

D. Palomar (HKUST) Algorithms: SCA 42 / 48

Page 43: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

MM vs SCA

Surrogate function:

MM requires the surrogate function to be a global upper-bound (which can be toodemanding in some cases), albeit not necessarily convex.SCA relaxes the upper-bound condition, but it requires the surrogate to be stronglyconvex.

D. Palomar (HKUST) Algorithms: SCA 43 / 48

Page 44: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

MM vs SCA

Constraint set:In principle, both SCA and MM require the feasible set X to be convex.MM can be easily extended to nonconvex X on a case by case basis; for example: (Songet al. 2015)12, (Kumar et al. 2019)13, (Kumar et al. 2020)14.SCA can be extended to convexify the constraint functions, but cannot deal with anonconvex X directly, which limits its applicability in many real-world applications.

12J. Song, P. Babu, and D. P. Palomar, “Sparse generalized eigenvalue problem via smooth optimization,”IEEE Trans. Signal Processing, vol. 63, no. 7, pp. 1627–1642, 2015.

13S. Kumar, J. Ying, J. V. de M. Cardoso, and D. P. Palomar, “Structured graph learning via laplacianspectral constraints,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), Vancouver,Canada, 2019.

14S. Kumar, J. Ying, J. V. de M. Cardoso, and D. P. Palomar, “A unified framework for structured graphlearning via spectral constraints,” Journal of Machine Learning Research (JMLR), pp. 1–60, 2020.

D. Palomar (HKUST) Algorithms: SCA 44 / 48

Page 45: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

MM vs SCA

Schedule of updates:

MM updates the whole variable x at each iteration (so in principle no distributedimplementation).If the majorizer in MM happens to be block separable in x = (x1, . . . , xN), then one canhave a parallel update.Block MM updates each block of x = (x1, . . . , xN) sequentially.SCA, on the other hand, naturally has a parallel update (assuming the constraints areseparable), which can be useful for distributed implementation.

D. Palomar (HKUST) Algorithms: SCA 45 / 48

Page 46: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

Thanks

For more information visit:

https://www.danielppalomar.com

Page 47: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

References I

Kumar, S., Ying, J., M. Cardoso, J. V. de, & Palomar, D. P. (2019). Structured graphlearning via laplacian spectral constraints. In Proc. Advances in neural information processingsystems (neurips). Vancouver, Canada.Kumar, S., Ying, J., M. Cardoso, J. V. de, & Palomar, D. P. (2020). A unified framework forstructured graph learning via spectral constraints. Journal of Machine Learning Research(JMLR), 1–60.Scutari, G., Facchinei, F., Song, P., Palomar, D. P., & Pang, J.-S. (2014). Decomposition bypartial linearization: Parallel optimization of multi-agent systems. IEEE Trans. SignalProcessing, 62(3), 641–656.Scutari, G., & Sun, Y. (2018). Parallel and distributed successive convex approximationmethods for big-data optimization. In C.I.M.E lecture notes in mathematics (pp. 141–308).Springer Verlag Series.

D. Palomar (HKUST) Algorithms: SCA 47 / 48

Page 48: Algorithms: Successive Convex Approximation (SCA) · Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014. 7 G. Scutari

References II

Song, J., Babu, P., & Palomar, D. P. (2015). Sparse generalized eigenvalue problem viasmooth optimization. IEEE Trans. Signal Processing, 63(7), 1627–1642.

D. Palomar (HKUST) Algorithms: SCA 48 / 48