algorithms: successive convex approximation (sca) · parallel optimization of multi-agent...
TRANSCRIPT
Algorithms:Successive Convex Approximation (SCA)
Prof. Daniel P. Palomar
ELEC5470/IEDA6100A - Convex OptimizationThe Hong Kong University of Science and Technology (HKUST)
Fall 2020-21
Outline
1 Successive Convex Approximation (SCA) in a Nutshell
2 Classical Algorithms as SCA Instances
3 Parallel SCA
4 Applications
5 Surrogate Functions∗
6 Extensions of Parallel SCA∗
7 Connection to MM
Outline
1 Successive Convex Approximation (SCA) in a Nutshell
2 Classical Algorithms as SCA Instances
3 Parallel SCA
4 Applications
5 Surrogate Functions∗
6 Extensions of Parallel SCA∗
7 Connection to MM
Successive Convex Approximation (SCA)Consider the following presumably difficult optimization problem:
minimizex
F (x)subject to x ∈ X ,
where the feasible set X is convex and F(x) is continuous.Basic idea of SCA: solve a difficult problem via solving a sequence of simplerproblems:
xk+1 = arg minx∈X
F (x | xk),
where F (x | xk) is a surrogate or approximation of the original function F (x).How to construct the surrogate functions F (x | xk) at each iteration k = 0, 1, . . . so thatthe iterates {xk} converge to an optimal x⋆?Answer: basic idea of SCA in (Scutari et al. 2014)1.
1G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.
D. Palomar (HKUST) Algorithms: SCA 4 / 48
Iterative algorithm
D. Palomar (HKUST) Algorithms: SCA 5 / 48
Iterative algorithm
D. Palomar (HKUST) Algorithms: SCA 6 / 48
Iterative algorithm
D. Palomar (HKUST) Algorithms: SCA 7 / 48
Surrogate or approximating function
The surrogate function F (x | xk) (apart from other technical conditions) has to satisfy:1 F (x | xk) is strongly convex on X ;2 F (x | xk) is differentiable with ∇F (x | xk) = ∇F (x).
Original method proposed in (Scutari et al. 2014)2.Nice overview of SCA in (Scutari and Sun 2018)3.
2G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.
3G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.
D. Palomar (HKUST) Algorithms: SCA 8 / 48
Algorithm
Algorithm: SCASet k = 0, initialize with feasible point x0 ∈ X , {γk} ∈ (0, 1].repeat
Solve surrogate problem:x(xk) = arg min
x∈XF (x | xk)
Smooth it out to get the next iterate:
xk+1 = xk + γk(x(xk)− xk
)k← k + 1
until convergencereturn xk
D. Palomar (HKUST) Algorithms: SCA 9 / 48
References
Original paper with convergence:G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang (2014).
“Decomposition by Partial Linearization: Parallel Optimization of Multi-Agent Systems.”IEEE Trans. Signal Processing, 62(3), 641–656.
Overview book chapter:G. Scutari and Y. Sun (2018). Parallel and Distributed Successive Convex
Approximation Methods for Big-Data Optimization. C.I.M.E Lecture Notes inMathematics, Springer Verlag Series, 141–308.
D. Palomar (HKUST) Algorithms: SCA 10 / 48
Outline
1 Successive Convex Approximation (SCA) in a Nutshell
2 Classical Algorithms as SCA Instances
3 Parallel SCA
4 Applications
5 Surrogate Functions∗
6 Extensions of Parallel SCA∗
7 Connection to MM
Gradient descent method as SCA
Consider the unconstrained problem
minimizex
F (x)
Let’s use the following surrogate function for F (x):
F (x | xk) = F (xk) +∇F (xk)T(x− xk) + 12αk ∥x− xk∥2.
To solve the surrogate problem, set the gradient of F (x | xk) to zero:
∇F (xk) + 1αk (x− xk) = 0.
The iterate is finally obtained as
xk+1 = xk − αk∇F (xk).D. Palomar (HKUST) Algorithms: SCA 12 / 48
Newton’s method as SCA
Consider again the unconstrained problem
minimizex
F (x)
Let’s use now the following better surrogate function for F (x):
F (x | xk) = F (xk) +∇F (xk)T(x− xk) + 12αk (x− xk)T∇2F (xk)(x− xk).
To solve the surrogate problem, set the gradient of F (x | xk) to zero:
∇F (xk) + 1αk∇
2F (xk)(x− xk) = 0
The iterate is finally obtained as
xk+1 = xk − αk∇2F (xk)−1∇F (xk).D. Palomar (HKUST) Algorithms: SCA 13 / 48
Outline
1 Successive Convex Approximation (SCA) in a Nutshell
2 Classical Algorithms as SCA Instances
3 Parallel SCA
4 Applications
5 Surrogate Functions∗
6 Extensions of Parallel SCA∗
7 Connection to MM
Parallel SCA
Suppose we can partition the optimization variable into n blocks:
x = (x1, . . . , xn)
and that the feasible set X has a Cartesian product form:
X = X1 × · · · × Xn.
Then we can write our problem as
minimizex1,...,xn
F (x1, . . . , xn)subject to xi ∈ Xi, i = 1, . . . , n
where each Xi is a convex set and F(x) is continuous.
D. Palomar (HKUST) Algorithms: SCA 15 / 48
Parallel SCA
The most natural parallel (Jacoby type) solution method one can employ is to solve theproblem blockwise and in parallel, i.e., given xk, all the block variables xk
i are updatedsimultaneously:
xk+1i = arg min
xi∈XiF (xi, xk
−i), ∀i = 1, . . . , n.
Unfortunately, this method converges only under very restrictive conditions that areseldom verified in practice.Furthermore, the computation of xk+1
i is in general difficult due to the nonconvexity of F.To cope with those issues, the parallel SCA minimizes a strongly convex surrogatefunction instead:
xi(xk) = arg minxi∈Xi
Fi(xi | xk), ∀i = 1, . . . , n
and then setsxk+1
i = xki + γk
(xi(xk)− xk
i)
.
D. Palomar (HKUST) Algorithms: SCA 16 / 48
Surrogate functions
Each surrogate function Fi(xi | xk) (apart from other techcnical conditions) has to satisfy(Scutari et al. 2014)4 (Scutari and Sun 2018)5:
1 Fi(xi | xk) is strongly convex on Xi;2 Fi(xi | xk) is differentiable with ∇xi Fi(xi | xk) = ∇xiF (x).
Stepsize rules for {γk ∈ (0, 1]}:1 Bounded stepsize: γk sufficiently small;2 Diminishing stepsize:
∑∞k=1 γk = +∞ and
∑∞k=1(γk)2 < +∞;
3 Backtracking line search
4G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.
5G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.
D. Palomar (HKUST) Algorithms: SCA 17 / 48
Algorithm
Algorithm: Parallel SCASet k = 0, initialize with feasible point x0 ∈ X , {γk} ∈ (0, 1].repeat
For all i = 1, . . . , n, solve in parallel:
xi(xk) = arg minxi∈Xi
Fi(xi | xk)
Smooth it out to get the next iterate:
xk+1 = xk + γk(x(xk)− xk
)k← k + 1
until convergencereturn xk
D. Palomar (HKUST) Algorithms: SCA 18 / 48
Algorithm: Stepsize ruleThe bounded stepsize is difficult to use in practice since it has to be smaller than somebound which is difficult to know.The backtracking line search is effective in terms of iterations, but performing the linesearch requires evaluating the objective function multiple times per iteration, resulting inmore costly iterations.The diminishing stepsize is a very good option in practice. Two very effective examplesof diminishing stepsize rules are:
γk+1 = γk(1− ϵγk
), k = 0, 1, . . . , γ0 < 1/ϵ,
where ϵ ∈ (0, 1), and
γk+1 = γk + αk
1 + βk , k = 0, 1, . . . , γ0 = 1,
where αk and βk satisfy 0 ≤ αk ≤ βk and αk/βk → 0 as k→∞ while ∑k αk/βk =∞.
Examples of such αk and βk are: αk = α or log(k)α, and βk = β · k or βk = β ·√
k,where α and β are given constants satisfying α ∈ (0, 1), β ∈ (0, 1), and α ≤ β.
D. Palomar (HKUST) Algorithms: SCA 19 / 48
Convergence
The following gives the convergence of the parallel SCA algorithm to a stationary point(Scutari et al. 2014)6, (Scutari and Sun 2018)7.
TheoremSuppose each Fi satisfies the previous assumptions and {γk} is chosen according to thebounded stepsize, diminishing rule, or backtracking line search. Then,
limk→∞
∥x(xk)− xk∥ = 0.
6G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.
7G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.
D. Palomar (HKUST) Algorithms: SCA 20 / 48
Outline
1 Successive Convex Approximation (SCA) in a Nutshell
2 Classical Algorithms as SCA Instances
3 Parallel SCA
4 Applications
5 Surrogate Functions∗
6 Extensions of Parallel SCA∗
7 Connection to MM
Recall: LASSO (ℓ2 − ℓ1 optimization) via BCD
Consider the problem
minimizex
f (x) ≜ 12∥y− Ax∥22 + λ∥x∥1
We can use BCD on each element of x = (x1, . . . , xN).The optimization w.r.t. each block xi at iteration k = 0, 1, . . . is
minimizexi
fi(xi) ≜12∥y
ki − aixi∥22 + λ|xi|
where yki ≜ y−∑
j<i ajxk+1j −
∑j>i ajxk
j .This leads to the iterates for k = 0, 1, . . .
xk+1i = softλ
(aT
i yki)
/∥ai∥2, i = 1, . . . , N
where softλ(u) ≜ sign(u) [|u| − λ]+ is the soft-thresholding operator ([·]+ ≜ max{·, 0}).D. Palomar (HKUST) Algorithms: SCA 22 / 48
Recall: LASSO (ℓ2 − ℓ1 optimization) via MM
The critical step in the application of MM is to find a convenient majorizer of thefunction f (x) ≜ 1
2∥y− Ax∥22 + λ∥x∥1.Consider the following majorizer of f (x):
u(x, xk) = f (x) + dist(x, xk)
where dist(x, xk) = c2∥x− xk∥22 − 1
2∥Ax− Axk∥22 and c > λmax(ATA).Note that u(x, xk) is a valid majorizer because it’s continuous, it is an upper-boundu(x, xk) ≥ f (x) with u(xk, xk) = f (xk), and ∇u(xk, xk) = ∇f (xk).The majorizer can be rewritten in a more convenient way as
u(x, xk) = c2∥x− xk∥22 + λ∥x∥1 + const.
where xk − 1c AT(y− Axk) + xk.
D. Palomar (HKUST) Algorithms: SCA 23 / 48
Recall: LASSO (ℓ2 − ℓ1 optimization) via MM
Now that we have the majorizer, we can formulate the problem to be solved at eachiteration k = 0, 1, . . .
minimizex≥0
c2∥x− xk∥22 + λ∥x∥1
This problem looks like the original one but without the matrix A mixing all thecomponents.As a consequence, this problem decouples into an optimization for each element, whichsolution we already known to be given by the soft-thresholding operator, leading to theiterates for k = 0, 1, . . .
xk+1 = softλ
(xk
),
where the soft-thresholding operator is applied elementwise.So what’s the difference between the algorithms obtained via BCD and MM?
BCD algorithm updates each element on a successive or cyclical way;MM algorithm updates all elements simultaneously.
D. Palomar (HKUST) Algorithms: SCA 24 / 48
LASSO (ℓ2 − ℓ1 optimization) via SCAWe will now use SCA to solve the problem
minimizex
F (x) ≜ 12∥y− Ax∥22 + λ∥x∥1.
We define the surrogate function for block (element) i as
Fi(xi | xk) = 12∥y
ki − aixi∥22 + λ|xi|+
τi2 (xi − xk
i )2
where yki ≜ y−∑
j=i ajxkj .
Thus, the resulting algorithm is to solve for all i = 1, . . . , N in parallel:minimize
xi
12∥yk
i − aixi∥22 + λ|xi|+ τi2 (xi − xk
i )2,
with solutionxi(xk) = 1
τi + ∥ai∥2softλ
(aT
i yki + τixk
i)
,
and then smooth it out as xk+1 = xk + γk(x(xk)− xk
).
D. Palomar (HKUST) Algorithms: SCA 25 / 48
LASSO (ℓ2 − ℓ1 optimization) via SCA
Recall the iterate from the BCD algorithm:
xk+1i = 1
∥ai∥2softλ
(aT
i yki)
, i = 1, . . . , N
where yki ≜ y−∑
j<i ajxk+1j −
∑j>i ajxk
j .Now compare with the iterate obtained with SCA:
xi(xk) = 1τi + ∥ai∥2
softλ
(aT
i yki + τixk
i)
,
yki ≜ y−∑
j=i ajxkj , and then xk+1 = xk + γk
(x(xk)− xk
).
There are several differences:1 The SCA iterate has additional terms with τi; if we set τi = 0 both iterates look similar.2 The updates in BCD are sequential (cyclical) whereas in SCA are simultanous (in parallel).3 The term yk
i is slightly different in both approaches due to the sequential vs parallel updates.D. Palomar (HKUST) Algorithms: SCA 26 / 48
LASSO (ℓ2 − ℓ1 optimization) via SCAConsider the BCD method but with parallel updates (it is then called nonlinear Jacobimethod):
xk+1i = 1
∥ai∥2softλ
(aT
i yki)
, i = 1, . . . , N
where yki ≜ y−∑
j=i ajxkj .
Unfortunately, this is not guaranteed to converge.Instead, consider the paralle SCA updates (setting τi = 0, which can be done since thesurrogate is already strongly convex):
xi(xk) = 1∥ai∥2
softλ
(aT
i yki)
, i = 1, . . . , N
and then smooth it xk+1 = xk + γk(x(xk)− xk
).
The parallel SCA is guaranteed to converge (assuming γk is properly chosen) thanks tothe smoothing operation.
D. Palomar (HKUST) Algorithms: SCA 27 / 48
Dictionary learning
Consider the dictionary learning problem
minimizeD,X
12∥Y−DX∥2F + λ∥X∥1
subject to ∥[D]:,i∥ ≤ 1 i = 1, . . . , m,
where ∥X∥F and ∥X∥1 denote the Frobenius norm and the ℓ1 matrix norm of X,respectively.Matrix D is the so-called dictionary and it is a fat matrix that contains a dictionary ofthe possible columns.Matrix X selects a few columns of the dictionary and that’s why it has to be sparse.This problem is not jointly convex in (D, X), but it is bi-convex: for a fixed D, it isconvex in X and, for a fixed X, it is convex in D.
D. Palomar (HKUST) Algorithms: SCA 28 / 48
Dictionary learning
We define the following two surrogate functions:
F1(D | Xk) = 12∥Y−DXk∥2F
F2(X | Dk) = 12∥Y−DkX∥2F
This leads to the following two convex problems:a normalized LS problem:
minimizeD
12∥Y−DXk∥2
F
subject to ∥[D]:,i∥ ≤ 1 i = 1, . . . , m,
a matrix LASSO problem:
minimizeX
12∥Y−DkX∥2
F + λ∥X∥1,
which can be decomposed as a set of regular LASSO problems for each column of X.D. Palomar (HKUST) Algorithms: SCA 29 / 48
Outline
1 Successive Convex Approximation (SCA) in a Nutshell
2 Classical Algorithms as SCA Instances
3 Parallel SCA
4 Applications
5 Surrogate Functions∗
6 Extensions of Parallel SCA∗
7 Connection to MM
Surrogates based on block-wise convexity
Suppose F (x1, . . . , xn) is convex in each block xi separately (but not necessarily jointly).A natural approximation to exploit this partial convexity is
F (x | xk) =n∑
i=1Fi (xi | xk)
withFi (xi | xk) = F (xi, xk
−i) + τi2 (xi − xk
i )THi(xi − xki ),
where τi is any positive constant and Hi is any positive definite matrix (e.g., Hi = I).The quadradic term involving Hi can be removed if F (xi, xk
−i) is strongly convex on Xi.
D. Palomar (HKUST) Algorithms: SCA 31 / 48
Proximal gradient-like surrogates
If no convexity is present in F (x), one alternative is to mimick proximal-gradient methods:
F (x | xk) =n∑
i=1Fi (xi | xk)
withFi (xi | xk) = ∇xiF (x)T(xi − xk
i ) + τi2 ∥xi − xk
i ∥2.
Observe that if τi is large enough, then the surrogate function Fi (xi | xk) becomes amajorizer.But in this case τi can be any positive number and then Fi (xi | xk) may not be amajorizer.The reason why convergence can still be guaranteed is because γk < 1 and the smoothingstep allows convergence.
D. Palomar (HKUST) Algorithms: SCA 32 / 48
Surrogates based on sum-utility functionSuppose F (x) =
∑Ii=1 fi(x1, . . . , xn).
Such structure arises, e.g., in multi-agent systems wherein fi is the cost function of agenti that controls its own block variables xi ∈ Xi.It is common that the cost functions fi are convex in some agents’ variables. Define theset of indices of all the functions convex in xi as
Ci ≜ {j : fj(xi, x−i) is convex in xi}
Then we can construct the following surrogate function:
F (x | xk) =n∑
i=1F Ci (xi | xk)
where F Ci (xi | xk) =∑j∈Ci
fj(xi, xk−i) +
∑l/∈Ci
∇xifl(xk)T(xi − xki )
+ τi2 (xi − xk
i )THi(xi − xki ),
In words: each agent i keeps the convex part of F while it linearizes the nonconvex part.D. Palomar (HKUST) Algorithms: SCA 33 / 48
Surrogates based on product of functions
Suppose that F (x) is expressed as the product of functions. For simplicity, consider thecase of product of two functions:
F (x) = f1(x)f2(x)
with f1 and f2 convex and non-negative on X .Inspired by the form of the gradient of F, ∇F = f2∇f1 + f1∇f2, it seems natural toconsider the following surrogate function:
F (x | xk) = f1(x)f2(xk) + f1(xk)f2(x) + τi2 (xi − xk
i )THi(xi − xki ).
If f1 and f2 are not necessarily convex, we can instead use
F (x | xk) = f1(x | xk)f2(xk) + f1(xk)f2(x | xk),
where f1 and f2 are legitimate surrogates of f1 and f2, respectively.D. Palomar (HKUST) Algorithms: SCA 34 / 48
Surrogates based on composition of functions
LetF (x) = h(f(x)),
where h is a convex smooth function nondecreasing in each component, and f is a smoothmapping with each component not necessarily convex.Example: F (x) = ∥f(x)∥2.Observe that the gradient is
∇F (x) = ∇f(x)∇h(f(x))
where ∇f(x) = [∇f1(x) · · ·∇fq(x)] is the Jacobian of f.A surrogate function is
F (x | xk) = h(f(xk) +∇f(xk)T(x− xk)
)+ τi
2 ∥x− xk∥2.
D. Palomar (HKUST) Algorithms: SCA 35 / 48
Outline
1 Successive Convex Approximation (SCA) in a Nutshell
2 Classical Algorithms as SCA Instances
3 Parallel SCA
4 Applications
5 Surrogate Functions∗
6 Extensions of Parallel SCA∗
7 Connection to MM
Extension: Nondifferentiable objective function
Consider now the following problem:
minimizex1,...,xn
F (x1, . . . , xn) + G (x1, . . . , xn)subject to xi ∈ Xi, i = 1, . . . , n
where each Xi is a convex set, F (x) is continuous, and G (x) is convex but possiblynonsmooth.Parallel SCA minimizes the following strongly convex surrogate functions in parallel for alli = 1, . . . , n (Scutari and Sun 2018)8:
xi(xk) = arg minxi∈Xi
Fi(xi | xk) + Gi(xi, xk−i)
and then setsxk+1 = xk + γk
(x(xk)− xk
).
8G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.
D. Palomar (HKUST) Algorithms: SCA 37 / 48
Extension: Inexact updates
The parallel SCA at each iteration k = 0, 1, . . . solves:
xi(xk) = arg minxi∈Xi
Fi(xi | xk), ∀i = 1, . . . , n.
In practice, however, it may be too costly to obtain such minimizers.The inexact parallel SCA instead considers inexact solutions zk
i ≈ xi(xk), for alli = 1, . . . , n, that satisfy:
1 ∥zki − xi(xk)∥ ≤ εk
i with limk→∞ εki = 0;
2 Fi(zki | xk) ≤ Fi(xk
i | xk).In words, the inexact solutions are obtained with a desired accuracy of εk
i , which mustasymptotially vanish (increasing accuracy) in order to guarantee convergence (Scutari andSun 2018)9.
9G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.
D. Palomar (HKUST) Algorithms: SCA 38 / 48
Extension: Selective updates
For problems with a large number of blocks n, it may be too costly to update all of themin parallel:
xi(xk) = arg minxi∈Xi
Fi(xi | xk), ∀i = 1, . . . , n
andxk+1
i = xki + γk
(xi(xk)− xk
i)
, ∀i = 1, . . . , n.
The SCA with selective update defines at each iteration k the blocks to be updated Sk sothat the iterate is
xk+1i =
xki + γk
(xi(xk)− xk
i)
if i ∈ Sk
xki otherwise.
D. Palomar (HKUST) Algorithms: SCA 39 / 48
Extension: Selective updatesSeveral options are possible for the block selection rule Sk:
according to some deterministic (cyclic) rule;according to some random-based rule;greedy-like schemes: updating only the blocks that are “far away” from the optimum.
Convergence can be still guaranteed for the following update rules (Scutari and Sun2018)10:
1 Essentially cyclic rule: Sk is selected so that∪T−1
s=0 Sk+s = {1, . . . , n} for some finite T.2 Greedy rule: Each Sk contains at least one index i such that
Ei(xk) ≥ ρ maxj
Ej(xk)
where ρ ∈ (0, 1] and Ei(xk) is an approximation Ei(xk) ≈ ∥xi(xk)− xki ∥.
3 Random-based rule: The sets Sk are realizations of independent random sets such thatPr(i ∈ Sk) ≥ p for some p > 0.
10G. Scutari and Y. Sun, “Parallel and distributed successive convex approximation methods for big-dataoptimization,” in C.I.M.E Lecture Notes in Mathematics, Springer Verlag Series, 2018, pp. 141–308.
D. Palomar (HKUST) Algorithms: SCA 40 / 48
Outline
1 Successive Convex Approximation (SCA) in a Nutshell
2 Classical Algorithms as SCA Instances
3 Parallel SCA
4 Applications
5 Surrogate Functions∗
6 Extensions of Parallel SCA∗
7 Connection to MM
Majorization-Minimization (MM)Consider the following problem:
minimizex
f (x)subject to x ∈ X ,
where X is the feasible set and f (x) a continuous function.The idea of MM is to iteratively approximate the problem by a simpler one (like in SCA).MM approximates f (x) by an upper-bound function (majorizer) u(x | xk) satisfying sometechnical conditions like the first-order condition ∇u(xk | xk) = ∇f (xk).At iteration k = 0, 1, . . . the surrogate problem is (Scutari et al. 2014)11
minimizex
u(x | xk)subject to x ∈ X .
11G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decomposition by partial linearization:Parallel optimization of multi-agent systems,” IEEE Trans. Signal Processing, vol. 62, no. 3, pp. 641–656, 2014.
D. Palomar (HKUST) Algorithms: SCA 42 / 48
MM vs SCA
Surrogate function:
MM requires the surrogate function to be a global upper-bound (which can be toodemanding in some cases), albeit not necessarily convex.SCA relaxes the upper-bound condition, but it requires the surrogate to be stronglyconvex.
D. Palomar (HKUST) Algorithms: SCA 43 / 48
MM vs SCA
Constraint set:In principle, both SCA and MM require the feasible set X to be convex.MM can be easily extended to nonconvex X on a case by case basis; for example: (Songet al. 2015)12, (Kumar et al. 2019)13, (Kumar et al. 2020)14.SCA can be extended to convexify the constraint functions, but cannot deal with anonconvex X directly, which limits its applicability in many real-world applications.
12J. Song, P. Babu, and D. P. Palomar, “Sparse generalized eigenvalue problem via smooth optimization,”IEEE Trans. Signal Processing, vol. 63, no. 7, pp. 1627–1642, 2015.
13S. Kumar, J. Ying, J. V. de M. Cardoso, and D. P. Palomar, “Structured graph learning via laplacianspectral constraints,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), Vancouver,Canada, 2019.
14S. Kumar, J. Ying, J. V. de M. Cardoso, and D. P. Palomar, “A unified framework for structured graphlearning via spectral constraints,” Journal of Machine Learning Research (JMLR), pp. 1–60, 2020.
D. Palomar (HKUST) Algorithms: SCA 44 / 48
MM vs SCA
Schedule of updates:
MM updates the whole variable x at each iteration (so in principle no distributedimplementation).If the majorizer in MM happens to be block separable in x = (x1, . . . , xN), then one canhave a parallel update.Block MM updates each block of x = (x1, . . . , xN) sequentially.SCA, on the other hand, naturally has a parallel update (assuming the constraints areseparable), which can be useful for distributed implementation.
D. Palomar (HKUST) Algorithms: SCA 45 / 48
Thanks
For more information visit:
https://www.danielppalomar.com
References I
Kumar, S., Ying, J., M. Cardoso, J. V. de, & Palomar, D. P. (2019). Structured graphlearning via laplacian spectral constraints. In Proc. Advances in neural information processingsystems (neurips). Vancouver, Canada.Kumar, S., Ying, J., M. Cardoso, J. V. de, & Palomar, D. P. (2020). A unified framework forstructured graph learning via spectral constraints. Journal of Machine Learning Research(JMLR), 1–60.Scutari, G., Facchinei, F., Song, P., Palomar, D. P., & Pang, J.-S. (2014). Decomposition bypartial linearization: Parallel optimization of multi-agent systems. IEEE Trans. SignalProcessing, 62(3), 641–656.Scutari, G., & Sun, Y. (2018). Parallel and distributed successive convex approximationmethods for big-data optimization. In C.I.M.E lecture notes in mathematics (pp. 141–308).Springer Verlag Series.
D. Palomar (HKUST) Algorithms: SCA 47 / 48
References II
Song, J., Babu, P., & Palomar, D. P. (2015). Sparse generalized eigenvalue problem viasmooth optimization. IEEE Trans. Signal Processing, 63(7), 1627–1642.
D. Palomar (HKUST) Algorithms: SCA 48 / 48