lecture notes on admm-type methodsernestryu/papers/admm_early_release.pdf · 2019-12-31 · the \ad...

$: Lecture Notes on ADMM-Type Methodsernestryu/papers/ADMM_early_release.pdf · 2019-12-31 · The \AD (Alternating Direction)" in FLiP-ADMM describes the solving of the two x- and y-subproblems$
Lecture Notes on ADMM-Type Methods

Ernest K. Ryu Wotao Yin

December 2019

This chapter on the Alternating Direction Method of Multipliers (ADMM)is an excerpt from an upcoming textbook by the authors.

This chapter introduces modern ADMM-type methods that extend and gen-eralize the classical ADMM of the 1970s. Specifically, we present a single unifiedADMM capturing many modern variants and provide a self-contained conver-gence proof. We furthermore show techniques expanding the applicability ofthe presented ADMM and highlight connections to a wide range of classicaland modern methods. However, the chapter does not focus on distributed opti-mization, non-convex optimization, or convergence rates. (The presented proofestablishes convergence, but does not discuss convergence rates.)

Basic definitions

We quickly provide key definitions so that the chapter can be used as a set ofself-contained lecture notes on ADMM.

Unless otherwise specified, functions in these notes map to the extended realline R ∪ {±∞}. A function f : Rn → R ∪ {±∞} is convex if

f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y), ∀x, y ∈ Rn, α ∈ (0, 1). (1)

A function is proper if its value is never −∞ and is finite somewhere. We saya convex and proper function is closed if it is lower semi-continuous. We say afunction is CCP if it is closed, convex, and proper. Most convex functions ofinterest are closed and proper. We say f is concave if −f is convex.

Let f be a CCP function on Rn. Let α > 0. We define the proximal operatorwith respect to αf as

Proxαf (y) = argminx∈Rn

{αf(x) +

1

2‖x− y‖2

}.

If f is CCP, then Proxαf is well defined, i.e., the argmin uniquely exists.Let L(x, u) : Rn × Rm → R ∪ {±∞} be convex-concave, i.e., L is convex in

x when u is fixed and concave in u when x is fixed. We say (x?, u?) is a saddlepoint of L if

L(x?, u) ≤ L(x?, u?) ≤ L(x, u?) ∀x ∈ Rn, u ∈ Rm.

We callminimizex∈Rn

supu∈Rm L(x, u)

the primal problem generated by L and write p? = infx supu L(x, u) for theprimal optimal value. We call

maximizeu∈Rm

infx∈Rn L(x, u)

the dual problem generated by L and write d? = supu infx L(x, u) for the dualoptimal value. Weak duality, which states d? ≤ p?, always holds. Strong duality,which states d? = p?, holds often but not always in convex optimization.

Total duality states a primal solution exists, a dual solution exists, andstrong duality holds. Total duality holds if and only if L has a saddle point.So solving the primal and dual optimization problems is equivalent to finding asaddle point of L, provided that total duality holds. Although the term “totalduality” is not commonly used, the assumption is commonly used throughoutthe ADMM literature, often through assuming L has a saddle point.

If all eigenvalues of a symmetric matrix A are nonnegative, we say A issymmetric positive semidefinite and write A � 0. We write A � B if A−B � 0.Given A � 0, define the A-seminorm as

‖x‖A = (xTAx)1/2.

Chapter 10

ADMM-type methods

In this chapter, we present the alternating direction method of multipliers (ADMM)and its variants, which we loosely refer to as ADMM-type methods. We first presentFLiP-ADMM, a general and highly versatile variant, and establish its convergencedirectly without relying on the machinery of monotone operators. We then derivea wide range of ADMM-type methods from FLiP-ADMM.

10.1 Function-linearized proximal ADMM

Let f1 and f2 be CCP functions on Rp and g1 and g2 be CCP functions on Rq. Letf2 and g2 be differentiable. Let A ∈ Rn×p, B ∈ Rn×q, and c ∈ Rn. For notationalconvenience, write

f = f1 + f2, g = g1 + g2.

Consider the primal problem

minimizex∈Rp, y∈Rq

f1(x) + f2(x) + g1(y) + g2(y)

subject to Ax+By = c,

generated by the Lagrangian

L(x, y, u) = f(x) + g(y) + 〈u,Ax+By − c〉.

The method function-linearized proximal alternating direction method of multipliers(FLiP-ADMM) is

xk+1 ∈ argminx∈Rp

{f1(x) + 〈∇f2(xk) +ATuk, x〉+

ρ

2‖Ax+Byk − c‖2 +

1

2‖x− xk‖2P

}yk+1 ∈ argmin

y∈Rq

{g1(y) + 〈∇g2(yk) +BTuk, y〉+

ρ

2‖Axk+1 +By − c‖2 +

1

2‖y − yk‖2Q

}uk+1 = uk + ϕρ(Axk+1 +Byk+1 − c),

where ρ > 0, ϕ > 0, P ∈ Rp×p and P � 0, and Q ∈ Rq×q and Q � 0.

178 10 ADMM-type methods

Theorem 17. Consider FLiP-ADMM. Assume total duality. Assume the x- and y-subproblems always have solutions. Assume f2 is Lf -smooth and g2 is Lg-smooth,where Lf ≥ 0 and Lg ≥ 0. Assume there is an ε ∈ (0, 2− ϕ) such that

P � LfI, Q � 0, ρ

(1− (1− ϕ)2

2− ϕ− ε

)BTB +Q � 3LgI.

Thenf(xk) + g(yk)→ f(x?) + g(y?), Axk +Byk − c→ 0,

where (x?, y?) is a primal solution.

To clarify, we set Lf = 0 or Lg = 0 respectively when f2 = 0 or g2 = 0. Thecondition

ρ

(1− (1− ϕ)2

2− ϕ− ε

)BTB +Q � 3LgI (10.1)

imposes constraints on ϕ and ρ. Note

1− (1− ϕ)2

2− ϕ

> 0, ϕ ∈ (0,

√5+12 )

= 0, ϕ =√

5+12

< 0, ϕ ∈ (√

5+12 , 2).

For each ϕ ∈ (0,√

5+12 ), there is a small enough ε > 0 such that

1− (1− ϕ)2

2− ϕ− ε> 0,

so large ρ helps satisfy (10.1). For each ϕ ∈ [√

5+12 , 2) and all ε ∈ (0, 2− ϕ),

1− (1− ϕ)2

2− ϕ− ε< 0,

so small ρ helps satisfy (10.1). In the classic ADMM setup, where Q = 0 and

Lg = 0, (10.1) is satisfied with ϕ ∈ (0,√

5+12 ) only.

10.1.1 Parameter choices

FLiP-ADMM has several algorithmic parameters, ϕ, ρ, P , Q, f2, and g2. Theirchoices affect the number of iterations required to achieve a desired accuracy andthe computational cost per iteration. The optimal choice for a given problembalances the number of iterations and cost per iteration.

The dual extrapolation parameter ϕ

While the choice ϕ = 1 is most common, larger values for ϕ can provide a speedup.In the classical ADMM setup, where f2 = 0, g2 = 0, P = 0, and Q = 0, FLiP-

10.1 Function-linearized proximal ADMM 179

ADMM reduces to


Lρ(x, yk, uk)

yk+1 ∈ argminy∈Rq

Lρ(xk+1, y, uk)

uk+1 = uk + ϕρ(Axk+1 +Byk+1 − c),

where

Lρ(x, y, u) = f(x) + g(y) + 〈u,Ax+By − c〉+ρ

2‖Ax+By − c‖2. (10.2)

This is called the golden ratio ADMM, since the stepsize requirement is 0 < ϕ <(1 +

√5), and (1 +

√5)/2 ≈ 1.618 is the golden ratio.

Penalty parameter ρ

The parameter ρ controls the relative priority between primal and dual convergence.The Lyapunov function used to prove convergence of FLiP-ADMM contains theterms ρ‖B(yk − y?)‖2 (primal error) and 1

ϕρ‖uk − u?‖2 (dual error), and large ρ

prioritizes primal accuracy while small ρ prioritizes dual accuracy. When 0 < ϕ <√5+12 , we can use large ρ. When

√5+12 ≤ ϕ < 2, we can use small ρ.

Proximal terms via P and Q

The “P” in FLiP-ADMM describes the presence of the proximal terms

1

2‖x− xk‖2P ,

1

2‖y − yk‖2Q.

Empirically, smaller P and Q leads to fewer required iterations. When f2 = 0 andg2 = 0, the choice P = 0 and Q = 0 is often optimal in the number of requirediterations. In the cases considered in §10.2.1 and §10.2.3, however, we can use anon-zero P and Q to cancel out (linearize) unwieldy quadratic terms and therebyreduce the cost per iteration.

Linearized functions f2 and g2

When f2 = 0, the x-update of FLiP-ADMM is


{Lρ(x, y

k, uk) +1

2‖x− xk‖2P

},

where Lρ is the augmented Lagrangian (10.2). When f2 6= 0, we have


{f1(x) + f2(xk) + 〈∇f2(xk), x− xk〉+ g(yk)

+ 〈uk, Ax+Byk − c〉+ρ

2‖Ax+Byk − c‖2 +

1

2‖x− xk‖2P

},

i.e., replace f2(x) with its first-order approximation f2(xk) + 〈∇f2(xk), x− xk〉 inLρ(x, y

k, uk) and minimize with respect to x. We call this linearizing the function


or function-linearization. The same discussion holds for the y-update. The “FLi(Function-Linearized)” in FLiP-ADMM describes this feature of accessing f2 andg2 through their gradients.

FLiP-ADMM presents the choice of whether or not to use function-linearization.Often, the choices f2 = 0 and g2 = 0 leads to fewer required iterations. In somecases, however, nonzero choices of f2 and g2 reduces the cost per iteration of solvingthe x- and y-subproblems.

10.1.2 Further discussion

Solvability of subproblems

We say the x- and y-subproblems are solvable if they have solutions (not necessarilyunique). Solvability of the subproblems is not automatic, even when total dualityholds and when f and g are CCP. In the derivation of ADMM in §3, solvability wasensured by assuming the additional regularity condition (3.6). In this section, wedirectly assume solvability instead. See the notes and references of §3 for furtherdiscussion.

Notion of convergence

The notion convergence in Theorem 17 is different from what we had previously seenin Part I. The result establishes that the objective value converges to the optimalvalue and that the constraint violation converges to 0, rather than showing theiterates converge to a solution.

Relation to method of multipliers

The “AD (Alternating Direction)” in FLiP-ADMM describes the solving of thetwo x- and y-subproblems in an alternating fashion. The “MM” in FLiP-ADMMdescribes the method’s similarity to the method of multipliers, which has only oneprimal subproblem.

When q = 0, the y-subproblem and B-matrix vanish, and the ADMM setupreduces to the method of multipliers setup, and we obtain methods similar to whatwe had seen in Exercise 3.6. (Theorem 17 applies when q = 0.) In particular, whenq = 0, f2 = 0, g2 = 0, B = 0, P = 0, and Q = 0, FLiP-ADMM reduces to theclassical method of multipliers

xk+1 ∈ argminx

{f(x) + 〈uk, Ax〉+

ρ

2‖Ax− c‖2

}uk+1 = uk + ϕρ(Axk+1 − c),

which converges for ϕ ∈ (0, 2).


10.1.3 Scaled form

With the substitution vk = (1/ρ)uk for k = 0, 1, . . . , FLiP-ADMM becomes


{f1(x) + 〈∇f2(xk), x〉+

ρ

2‖Ax+Byk − c+ vk‖2 +

1

2‖x− xk‖2P

}yk+1 ∈ argmin

y∈Rq

{g1(y) + 〈∇g2(yk), y〉+

ρ

2‖Axk+1 +By − c+ vk‖2 +

1

2‖y − yk‖2Q

}vk+1 = vk + ϕ(Axk+1 +Byk+1 − c).

We call this the scaled form of FLiP-ADMM. In some cases, the scaled form ofADMM-type methods are simpler than the original form and is therefore pre-ferred. In this chapter, we use the unscaled form to make clear that the uk-iteratesrepresent (unscaled) dual variables.

10.1.4 Proof of Theorem 17

Proof. The assumption of total duality means L has a saddle point (x?, y?, u?).Define

w? =

x?y?u?

, wk =

xkykuk

for k = 0, 1, . . . .

Define η = 2− ϕ− ε. Define the symmetric positive semidefinite matrices

M0 =1

2

P 0 00 ρBTB +Q 00 0 1

ϕρI

, M1 =1

2

0 0 00 Q+ LgI 00 0 η

ϕ2ρI

,M2 =

1

2

P − LfI 0 0

0 ρ(

1− (1−ϕ)2

η

)BTB +Q− 3LgI 0

0 0 2−ϕ−ηϕ2ρ I

.Define the Lyapunov function

V k = ‖wk − w?‖2M0+ ‖wk − wk−1‖2M1

.

Proof outline. In stage 1, we use the definition of xk+1 and yk+1 as minimizers toobtain certain inequalities respectively relating xk+1 with x? and yk+1 with y?. Instage 2, we use the definition of yk and yk+1 as minimizers to obtain an inequalityrelating yk with yk+1. In stage 3, we use the inequalities of the previous stages toestablish the key inequality

V k+1 ≤ V k − ‖wk+1 − wk‖2M2−(L(xk+1, yk+1, u?)− L(x?, y?, u?)

). (10.3)

In stage 4, we use the summability argument to show convergence.


Stage 1. The solution xk+1 to the x-subproblem in FLiP-ADMM satisfies

0 ≤ f1(x)− f1(xk+1)

+⟨∇f2(xk) +AT (uk + ρ(Axk+1 +Byk − c)) + P (xk+1 − xk), x− xk+1

⟩for any x ∈ Rp. By convexity of f2 and Lf -smoothness of f2,

〈∇f2(xk), x− xk+1〉 = 〈∇f2(xk), x− xk〉+ 〈∇f2(xk), xk − xk+1〉

≤ f2(x)− f2(xk) + f2(xk)− f2(xk+1) +Lf2‖xk+1 − xk‖2

= f2(x)− f2(xk+1) +Lf2‖xk+1 − xk‖2.

Adding the two inequalities, we get

0 ≤ f(x)− f(xk+1) +Lf2‖xk+1 − xk‖2

+⟨AT (uk + ρ(Axk+1 +Byk − c)) + P (xk+1 − xk), x− xk+1

⟩.

To simplify notation, define

uk+1 = uk + ρ(Axk+1 +Byk+1 − c)

and rewrite the inequality as

f(xk+1)− f(x) +⟨uk+1, A(xk+1 − x)

⟩(10.4)

≤ Lf2‖xk+1 − xk‖2 + ρ

⟨B(yk+1 − yk), A(xk+1 − x)

⟩−⟨xk+1 − xk, xk+1 − x

⟩P.

Repeating analogous steps with the y-update, we get

g(yk+1)− g(y) +⟨uk+1, B(yk+1 − y)

⟩≤ Lg

2‖yk+1 − yk‖2 −

⟨yk+1 − yk, yk+1 − y

⟩Q.

(10.5)

for any y ∈ Rq.Set x = x? in (10.4), set y = y? in (10.5), add the two inequalities, add the

identity

〈u? − uk+1, Axk+1 +Byk+1 − c〉 =1

ρ〈u? − uk+1, uk+1 − uk〉,

and substitute

A(xk+1 − x?) =1

ϕρ(uk+1 − uk)−B(yk+1 − y?)

u? − uk+1 =

(1− 1

ϕ

)(uk+1 − uk)− (uk+1 − u?)

uk+1 − uk =1

ϕ

(uk+1 − uk

)


to get

L(xk+1, yk+1, u?)− L(x?, y?, u?)

≤ Lf2‖xk+1 − xk‖2 +

Lg2‖yk+1 − yk‖2 +

(1− 1

ϕ

)1

ϕρ‖uk+1 − uk‖2

− 2〈wk+1 − wk, wk+1 − w?〉M0+

1

ϕ〈uk+1 − uk, B(yk+1 − yk)〉.

(10.6)

Stage 2. Consider

g(yk+1)− g(yk) +⟨uk+1, B(yk+1 − yk)

⟩≤ Lg

2‖yk+1 − yk‖2 −

⟨yk+1 − yk, yk+1 − yk

⟩Q,

which follows from (10.5) with y = yk, and

g(yk)− g(yk+1) +⟨uk, B(yk − yk+1)

⟩≤ Lg

2‖yk − yk−1‖2 −

⟨yk − yk−1, yk − yk+1

⟩Q,

which follows from decrementing the indices (k + 1, k) to (k, k − 1) in (10.5) andusing y = yk+1. We add the two inequalities and reorganize to get

1

ϕ〈uk+1 − uk, B(yk+1 − yk)〉

≤ Lg2‖yk+1 − yk‖2 +

Lg2‖yk − yk−1‖2 − ‖yk+1 − yk‖2Q

+ 〈yk+1 − yk, yk − yk−1〉Q −(

1− 1

ϕ

)〈uk − uk−1, B(yk+1 − yk)〉.

We apply Young’s inequality

〈a, b〉 ≤ ζ

2‖a‖2 +

1

2ζ‖b‖2, ∀ a, b ∈ Rn, ζ > 0

to the two inner products on the right-hand side and reorganize to get

1

ϕ〈uk+1 − uk, B(yk+1 − yk)〉 (10.7)

≤ 1

2‖yk+1 − yk‖2

LgI−Q+(1−ϕ)2

η ρBTB+

1

2‖yk − yk−1‖2LgI+Q +

η

2ϕ2ρ‖uk − uk−1‖2

for any η > 0. The left-hand side of this inequality is the last term on the right-handside of (10.6)

Stage 3. Using

‖wk+1 − w?‖2M0= ‖wk − w?‖2M0

− ‖wk+1 − wk‖2M0+ 2〈wk+1 − wk, wk+1 − w?〉M0


on the diffences between V k+1 and V k, we get

V k+1 = V k − ‖wk − wk−1‖2M1+ ‖wk+1 − wk‖2M1

− ‖wk+1 − wk‖2M0

+ 2〈wk+1 − wk, wk+1 − w?〉M0.

To this identity, we add the inequalities (10.7) and (10.6) to get

V k+1 ≤ V k − 1

2‖xk+1 − xk‖2P−Lf I −

1

2‖yk+1 − yk‖2(

1− (1−ϕ)2

η

)ρBTB+Q−3LgI

− 2− η − ϕ2ϕ2ρ

‖uk+1 − uk‖2 −(L(xk+1, yk+1, u?)− L(x?, y?, u?)

)= V k − ‖wk+1 − wk‖2M2

−(L(xk+1, yk+1, u?)− L(x?, y?, u?)

)for any η > 0, which is the key inequality (10.3). Note that

L(xk+1, yk+1, u?)− L(x?, y?, u?) ≥ 0,

since (x?, y?, u?) is a saddle point of L.

Stage 4. Applying the summability argument on (10.3) tells us ‖wk+1−wk‖2M2→

0 and L(xk+1, yk+1, u?)−L(x?, y?, u?)→ 0. Note that ‖wk+1−wk‖2M2→ 0 implies

uk+1 − uk → 0 and thus Axk +Bxk − c→ 0. Since

L(xk+1, yk+1, u?) = f(xk+1) + g(yk+1) + 〈u?, Axk+1 +Byk+1 − c〉︸︷︷︸→0

→ L(x?, y?, u?) = f(x?) + g(y?),

we conclude f(xk) + g(yk)→ f(x?) + g(y?).

The inequalities we show in stages 1 and 2 are sometimes referred to as varia-tional inequalities due to their connection to variational inequality problems. Thekey technical difficulty of the proof is the construction of the Lyapunov functionV k, which comes from the insights accumulated over the many papers studyingvarious generalizations of ADMM.

10.2 Derived ADMM-type methods

10.2.1 Linearized methods

Consider the problem


f1(x) + g1(y)


10.2 Derived ADMM-type methods 185

where f2 = 0 and g2 = 0. We can use the linearization technique of §3.5 withFLiP-ADMM. With P = (1/α)I − ρATA and Q = (1/β)I − ρBTB, we get

xk+1 = Proxαf(xk − αAT (uk + ρ(Axk +Byk − c))

)yk+1 = Proxβg

(yk − βBT (uk + ρ(Axk+1 +Byk − c))

)uk+1 = uk + ϕρ(Axk+1 +Byk+1 − c),

linearized ADMM. The stepsize requirement is satisfied with 1 ≥ αρλmax(ATA),1 ≥ βρλmax(BTB), and 0 < ϕ < (1 +

√5)/2. The linearized ADMM we had seen

in §3.5 corresponds to the case ϕ = 1.“Linearization” in the context of ADMM-type methods is an ambiguous term

referring to more than one technique. While it most often refers to the techniqueof canceling out inconvenient quadratic terms, there are other “linearizations” aswe will see in §10.2.2 and §10.2.6.

PDHG

Consider the problemminimizex∈Rp, y∈Rq

f1(x) + g1(y)

subject to −Ix+By = 0.

As discussed in §3.5, PDHG is an instance of FLiP-ADMM. With ϕ = 1, P = 0,and Q = (1/β)I − ρBTB, we recover PDHG:

µk+1 = Proxρf∗1(µk + ρB(2yk − yk−1)

)yk+1 = Proxβg1

(yk − βBTµk+1

).

The stepsize requirement is 1 ≥ βρλmax(BTB).

10.2.2 Function-linearized methods

FLiP-ADMM linearizes the functions f2 and g2, i.e., it accesses f2 and g2 throughtheir gradient evaluations, rather than through minimization subproblems. Thisfeature provides great flexibility.

Condat–Vu



f1(x) + g1(y) + g2(y)

subject to −Ix+By = 0.

FLiP-ADMM with ϕ = 1, P = 0, and Q = (1/β)I − ρBTB is

xk+1 = Prox(1/ρ)f1

((1/ρ)uk +Byk

)yk+1 = Proxβg1

(yk − β∇g2(yk)− βBT (uk − ρ(xk+1 −Byk))

)uk+1 = uk − ρ(xk+1 −Byk+1).


Use the Moreau identity (2.11) to get

xk+1 = (1/ρ)uk +Byk − (1/ρ) Proxρf∗1(uk + ρByk

)︸︷︷︸=µk+1

yk+1 = Proxβg1(yk − β∇g2(xk)− βBTµk+1

)uk+1 = µk+1 + ρB(yk+1 − yk),

and we recover Condat–Vu

µk+1 = Proxρf∗1(µk + ρB(2yk − yk−1)

)yk+1 = Proxβg1

(yk − β∇g2(yk)− βBTµk+1

).

Therefore, Condat–Vu is a special case of FLiP-ADMM. The stepsize requirementof FLiP-ADMM translates to the requirement 1 ≥ βρλmax(BTB) + 3βLg, which isworse than what we had seen in §3.3.

Doubly-linearized ADMM

Consider the general problem


f1(x) + f2(x) + g1(y) + g2(y)

subject to Ax+By = c.

FLiP-ADMM with P = (1/α)I − ρATA and Q = (1/β)I − ρBTB is

xk+1 = Proxαf1(xk − α

(∇f2(xk) +ATuk + ρAT (Axk +Byk − c)

))yk+1 = Proxβg1

(yk − β

(∇g2(yk) +BTuk + ρBT (Axk+1 +Byk − c)

))uk+1 = uk + ϕρ(Axk+1 +Byk+1 − c).

We call this method doubly-linearized ADMM as it linearizes both the quadraticterms and the functions f2 and g2. The stepsize requirement is satisfied with1 ≥ αρλmax(ATA) + αLf , 1 ≥ βρλmax(BTB) + 3βLg, and 0 < ϕ < (1 +

√5)/2.

This method generalizes PDHG and Condat–Vu.

Partial linearization



f2(x) + g1(y) + g2(y)


Assume γI + ρATA is not easily invertible, but there is a C ≈ ρATA such thatγI + C is easily invertible, for γ > 0. Choose P = γI + C − ρATA where γ >λmax(ρATA− C). Since C ≈ ρATA is a close approximation, γ can be small.

In this case, the x-update of FLiP-ADMM is

xk+1 = xk − (γI + C)−1(∇f2(xk) +ATuk + ρAT (Axk +Byk − c)).


The x-update is easy to compute, and we say it has been partially linearized. Incontrast, x-update of the doubly linearized ADMM is

xk+1 = xk − 1

δ(∇f2(xk) +ATuk + ρAT (Axk +Byk − c))

for some δ > λmax(ρATA). When C ≈ ρATA, partial linearization reduces thenumber of required iterations compared to (full) linearization. This setup ariseswhen ATA is diagonally dominant in the regular basis or the discrete Fourier basis.

Example 10.1 CT imaging with total variation regularization. Consider the problem

minimizex∈Rp

`(Ax− b) + λ‖Dx‖1,

where x represents an unknown 2D or 3D image reshaped into a vector, A is thediscrete Radon transform operator, b represents the measurements, D is a finitedifference operator, and ` is a CCP function. For simplicity, assume A has fullcolumn rank, i.e. assume ATA is invertible. The problem is equivalent to

minimizex, y, z

1

2`(y) + λ‖z‖1

subject to

[AD

]x−

[yz

]=

[b0

].

PDHG applied to the equivalent problem

uk+1 = Proxρ`∗(uk + ρA(2xk − xk−1)− ρb

)vk+1 = P[−λ,λ]

(vk+1 + ρD(2xk − xk−1)

)xk+1 = xk − 1

γ

(ATuk+1 +Dvk+1

)has a small cost per iteration, but sometimes requires too many iterations to converge.Classical ADMM with ϕ = 1 applied to the equivalent problem


)vk+1 = P[−λ,λ]

(vk+1 + ρD(2xk − xk−1)

)xk+1 = xk − (ρATA+ ρDTD)−1

(ATuk+1 +Dvk+1

)cannot be implemented as (ρATA + ρDTD)−1 is too expensive to compute. FLiP-ADMM where (i) we update the y- and z-variables first, x-variable second, and updatethe dual variable last and (ii) we use the proximal term P = γI+C−ρATA−ρDTDfor the x-update and ϕ = 1 for the dual update simplifies to


)vk+1 = P[−λ,λ]

(vk+1 + ρD(2xk − xk−1)

)xk+1 = xk − (γI + C)−1

(ATuk+1 +Dvk+1

).

See Exercise 10.2 for the derivation.


This partially linearized method can provide a significant speedup over PDHG. Wecan compute (γI + C)−1 efficiently with the fast Fourier transform; since ATA andDTD are discretizations of shift-invariant continuous operators, they are closely ap-proximated by circulant matrices, which are diagonalizable by the discrete Fourierbasis. A small γ > λmax(ρATA+ ρDTD−C) makes P small and thereby minimizesthe slowdown caused by the proximal term.

10.2.3 Block splitting

Partition x ∈ Rp into m non-overlapping blocks of sizes p1, . . . , pm. Write x =(x1, . . . , xm), so xi ∈ Rpi for i = 1, . . . ,m. Consider the problem

minimize(x1,...,xm)∈Rp

m∑i=1

fi(xi)

subject to Ax = c,

(10.8)

where

A =[A:,1 A:,2 · · · A:,m

], Ax = A:,1x1 +A:,2x2 + · · ·+A:,mxm.

This problem is known as the multi-block ADMM problem or the extended monotropicprogram.

The objective function splits across the blocks x1, . . . , xm. However, the x-updates of FLiP-ADMM couples the m blocks in general; FLiP-ADMM with nofunction linearization and no y-block is


{m∑i=1

fi(xi) + 〈ATuk, x〉+ρ

2‖Ax− c‖2 +

1

2‖x− xk‖2P

}uk+1 = uk + ρ(Ax− c),

and the blocks xk+11 , . . . , xk+1

m cannot be computed independently. In this section,we present techniques to obtain ADMM-type methods with split x-updates thatcan be computed independently in parallel.

Orthogonal blocks

Consider problem (10.8). When the columns of A are block-wise orthogonal, i.e.,AT:,iA:,j = 0 for all i 6= j, the x-updates of FLiP-ADMM with P = 0 split:

xk+1i ∈ argmin

xi∈Rpi

{fi(xi) + 〈uk − ρc,A:,ixi〉+

ρ

2‖A:,ixi‖2

}for i = 1, . . . ,m

uk+1 = uk + ϕρ(Ax− c).


Jacobi ADMM

Consider problem (10.8). Consider the matrix

P =

γI −ρAT:,1A:,2 · · · · · · −ρAT:,1A:,m

−ρAT:,2A:,1 γI · · · · · · −ρAT:,2A:,m

.... . .

...

.... . .

...

−ρAT:,mA:,1 −ρAT:,mA:,2 · · · −ρAT:,mA:,(m−1) γI

,

which is positive semidefinite for γ ≥ ρλmax(ATA). Let

Lρ(x, u) =

m∑i=1

fi(xi) + 〈u,Ax− c〉+ρ

2‖Ax− c‖2.

Let xk6=i denote all components of xk excluding xki . Then FLiP-ADMM is

xk+1i = argmin

xi∈Rpi

{Lρ(xi, x

k6=i, u

k) +γ

2‖xi − xki ‖2

}for i = 1, . . . ,m

uk+1 = uk + ϕρ(Axk+1 − c

).

This method is called Jacobi ADMM in analogy to the Jacobi method of numericallinear algebra; for i = 1, . . . ,m, the update xk+1

i is computed with the other blocksfixed to the older copies xkj for j 6= i. The stepsize requirement is satisfied with

γ ≥ ρλmax(ATA) and ϕ ∈ (0, 2).The off-diagonal blocks of P remove the interaction between the x-blocks and

thereby allow the x-update to split. Although we used γI for the diagonal blocksof P , other choices are possible; they just need to be, loosely speaking, sufficientlypositive to ensure P is positive semidefinite. In Exercise 10.3, we use differentdiagonal blocks to perform linearizations with Jacobi ADMM.

Dummy variables



y∈Rn

m∑i=1

fi(xi) + g(y)

subject to Ax+ y = c.

Introduce dummy variables z1, . . . , zm and eliminate y to get the equivalent problem

minimize(x1,...,xm)∈Rpz1,...,zm∈Rn

m∑i=1

fi(xi) + g

(c−

m∑i=1

zi

)subject to A:,ixi − zi = 0 for i = 1, . . . ,m.


We apply FLiP-ADMM with P = 0, Q = 0, no function linearization, and initialu-variables satisfying u0

1 = · · · = u0m. Then we can show uk1 = · · · = ukm for

k = 1, . . . ,m, and the iteration simplifies to

xk+1i ∈ argmin

xi∈Rpi

{fi(xi) +

⟨uk +

ρ

m(Axk − zksum), A:,ixi

⟩+ρ

2

∥∥A:,i(xi − xki )∥∥2}

for i = 1, . . . ,mzk+1

sum = c− Proxmρ g

(c−Axk+1 − m

ρuk)

uk+1 = uk +ϕρ

m

(Axk+1 − zk+1

sum

).

The stepsize requirement is ϕ ∈ (0, (1 +√

5)/2).

10.2.4 Consensus technique


minimizex∈Rp

n∑i=1

fi(x).

Use the consensus technique to get the equivalent problem

minimizex1,...,xn,z∈Rp

n∑i=1

fi(xi)

subject to xi = z, for i = 1, . . . , n.

Apply FLiP-ADMM with P = 0, Q = 0, no function linearization, and initialu-variables satisfying u0

1 + · · ·+ u0n = 0 to get

xk+1i = argmin

x∈Rp

{fi(xi) + 〈uki , xi〉+

ρ

2‖xi − zk‖2

}for i = 1, . . . , n

zk+1 =1

n

n∑i=1

xk+1i

uk+1i = uki + ϕρ(xk+1

i − zk+1) for i = 1, . . . , n.

The stepsize requirement is ϕ ∈ (0, (1 +√

5)/2). To clarify, each xi represents acopy of the entire x and therefore has the same dimension. This contrasts with theblock splitting of §10.2.3, where each xi represented a single block of x.

In this version of the consensus technique, we constrain x1, . . . , xn to equal asingle z. In general, one can have multiple z-variables related through a graphstructure. We explore this technique further in §11, in the context of decentralizedoptimization.

10.2.5 2-1-2 ADMM

Consider the problemminimizex∈Rp, y∈Rq

f(x) + g(y)



Assume g is a strongly convex quadratic function with affine constraints, i.e.,

g(y) = yTMy + µT y + δ{y∈Rq |Ny=ν}(y)

for a positive definite M ∈ Rq×q, N ∈ Rs×q, and ν ∈ R(N). If there is affineconstraint, we set s = 0. Define

Lρ(x, y, u) = f(x) + g(y) + 〈u,Ax+By − c〉+ρ

2‖Ax+By − c‖2.

We call the method

yk+1/2 = argminy∈Rq

Lρ(xk, y, uk)


Lρ(x, yk+ 1

2 , uk)

yk+1 = argminy∈Rq

Lρ(xk+1, y, uk)

uk+1 = uk + ϕρ(Axk+1 +Byk+1 − c)

2-1-2 ADMM. As we soon show, 2-1-2 ADMM is an instance of FLiP-ADMM andhas the stepsize requirement of ϕ ∈ (0, 2).

Derivation

As we show in Exercise 10.10, the y update can be expressed as

y(x) = argminy

Lρ(x, y, uk) = −TBTAx+ t(uk),

for a symmetric positive semidefinite T ∈ Rq×q and t(uk) ∈ Rq. Consider theFLiP-ADMM

(xk+1, yk+1) ∈ argminx∈Rp, y∈Rq

{Lρ(x, y, u

k) +ρ

2‖x− xk‖2P

}uk+1 = uk + ϕρ(Axk+1 +Byk+1 − c)

(10.9)

with P = ATBTBTA. (In this instance of FLiP-ADMM, there is second primalblock, so there is no alternating update.) The stepsize requirement is ϕ ∈ (0, 2).The optimality condition is

0 ∈ ∂f(xk+1) +AT(uk + ρ(Axk+1 +Byk+1 − c)

)+ ρATBTBTA(xk+1 − xk)

yk+1 = −TBTAxk+1 + t(uk).

On the other hand, the optimality condition of 2-1-2 ADMM is

yk+1/2 = −TBTAxk + t(uk)

0 ∈ ∂f(xk+1) +AT(uk + ρ(Axk+1 +Byk+1/2 − c)

)yk+1 = −TBTAxk+1 + t(uk).

Eliminating yk+1/2 gives us the same optimality condition as that of (10.9). There-fore, 2-1-2 ADMM and FLiP-ADM (10.9) are equivalent in the sense that they sharethe same set of iterates (xk, yk).


10.2.6 Trip-ADMM

Consider the more general problem


f1(Cx) + f2(x) + g1(Dy) + g2(y)


where C ∈ Rr×p and D ∈ Rs×q. We can solve this problem with

xk+1/2 = xk − σ

(CT vk +∇f2(xk) +ATuk + ρAT (Axk +Byk − c)

)vk+1 = Proxτf∗1

(vk + τCxk+1/2

)xk+1 = xk+1/2 − σCT

(vk+1 − vk

)yk+1/2 = yk − σ

(DTwk +∇g2(yk) +BTuk + ρBT (Axk+1 +Byk − c)

)wk+1 = Proxτg∗1

(wk + τDyk+1/2

)yk+1 = yk+1/2 − σDT

(wk+1 − wk

)uk+1 = uk + ρ

(Axk+1 +Byk+1 − c

),

which we call Triple-linearized ADMM (Trip-ADMM). The stepsize requirement issatisfied with ρ > 0, σ > 0, τ > 0,

1 ≥ σρλmax(ATA) + σLf , 1 ≥ σρλmax(BTB) + 3σLg,

1 ≥ στλmax(CCT ), 1 ≥ στλmax(DDT ),

and under total duality, we have

f1(Cxk − σCT C(vk+1 − vk)) + f2(xk) + g1(Dyk − σDT D(wk+1 − wk)) + g2(yk)→ f1(Cx?) + f2(x?) + g1(Dy?) + g2(y?)

CT C(vk+1 − vk)→ 0, DT D(wk+1 − wk)→ 0

Axk +Bxk − c→ 0.

Derivation

First, we show that for any CCP function h on Rn and A ∈ Rm×n,

µ+ ∈ argminµ∈Rm

{h∗(µ)− 〈z0, A

Tµ〉+α

2‖ATµ‖22

}x+ = z0 − αATµ+

(10.10)

⇒ x+ = argminx∈Rn

{h(Ax) +

1

2α‖x− z0‖22

},


provided that an argmin µ+ exists. This follows from

x+ =z0 − αATµ+, µ+ is an argmin

⇔ x+ = z0 − αATµ+, ∂h∗(µ+)−Az0 + αAATµ+ 3 0

⇔ x+ = z0 − αATµ+, ∂h∗(µ+) 3 Ax+

⇔ ATµ+ +1

α(x+ − z0) = 0, µ+ ∈ ∂h(Ax+)

⇔ AT∂h(Ax+) +1

α(x+ − z0) 3 0

⇒ x+ is the argmin.

We now start the derivation. Consider the equivalent problem

minimizex∈Rp, x∈Rry∈Rq, y∈Rs

f1(Cx+ Cx) + f2(x) + g1(Dy + Dy) + g2(y)

subject to Ax+By = c1√ρσ x = 01√ρσ y = 0,

where C ∈ Rr×r and D ∈ Rs×s. FLiP-ADMM applied to this method is

(xk+1, xk+1) ∈ argminx∈Rp, x∈Rr

{f1(Cx+ Cx) + 〈∇f2(xk) +ATuk, x〉+

1√ρσ〈ukx, x〉

+ρ

2‖Ax+Byk − c‖2 +

1

2σ‖x‖2 +

1

2‖x− xk‖2P +

1

2‖x− xk‖2

P

}(yk+1, yk+1) ∈ argmin

y∈Rq, y∈Rs

{g1(Dy + Dy) + 〈∇g2(yk) +BTuk, y〉+

1√ρσ〈uky , y〉

+ρ

2‖Axk+1 +By − c‖2 +

1

2σ‖y‖2 +

1

2‖y − yk‖2Q +

1

2‖y − yk‖2

Q

}uk+1 = uk + ϕρ(Axk+1 +Byk+1 − c)

uk+1x = ukx + ϕ

√ρ/σxk+1

uk+1y = uky + ϕ

√ρ/σyk+1.

Let

ϕ = 1, P =1

σI − ρATA, P = 0, Q =

1

σI − ρBTB, Q = 0.


Then we have

(xk+1, xk+1) ∈ argminx∈Rp, x∈Rr

{f1(Cx+ Cx) +

1

2σ

∥∥∥∥x+

√σ√ρukx

∥∥∥∥2

+1

2σ

∥∥x− xk + σ(∇f2(xk) +ATuk + ρAT (Axk +Byk − c)

)∥∥2}

(yk+1, yk+1) ∈ argminy∈Rq, y∈Rs

{g1(Dy + Dy) +

1

2σ

∥∥∥∥y +

√σ√ρuky

∥∥∥∥2

+1

2σ

∥∥y − yk + σ(∇g2(yk) +BTuk + ρBT (Axk+1 +Byk − c)

)∥∥2}

uk+1 = uk + ρ(Axk+1 +Byk+1 − c)

uk+1x = ukx +

√ρ/σxk+1

uk+1y = uky +

√ρ/σyk+1.

Using (10.10), we get

vk+1 = argminv∈Rr

{f∗1 (v) +

⟨√σ√ρukx, C

T v

⟩+σ

2

(‖CT v‖2 + ‖CT v‖2

)−⟨xk − σ


), CT v

⟩}xk+1 = xk − σ


)− σCT vk+1

xk+1 = −√σ√ρukx − σCT vk+1

wk+1 = argminw∈Rs

{g∗1(w) +

⟨√σ√ρuky , D

Tw

⟩+σ

2

(‖DTw‖2 + ‖DTw‖2

)−⟨yk − σ


), DTw

⟩}yk+1 = yk − σ


)− σDTwk+1

yk+1 = −√σ√ρuky − σDTwk+1

uk+1 = uk + ρ(Axk+1 +Byk+1 − c)

uk+1x = ukx +

√ρ/σxk+1 = −√ρσCT vk+1

uk+1y = uky +

√ρ/σyk+1 = −√ρσDTwk+1.


Eliminating some variables, we get

vk+1 = argminv∈Rr

{f∗1 (v)− σ

⟨vk, CCT v

⟩+σ

2

(‖CT v‖2 + ‖CT v‖2

)−⟨xk − σ


), CT v

⟩}xk+1 = xk − σ

(∇f2(xk) +ATuk + ρAT (Axk +Byk − c) + CT vk+1

)wk+1 = argmin

w∈Rs

{g∗1(w)− σ

⟨wk, DDTw

⟩+σ

2

(‖DTw‖2 + ‖DTw‖2

)−⟨yk − σ


), DTw

⟩}yk+1 = yk − σ

(∇g2(yk) +BTuk + ρBT (Axk+1 +Byk − c) +DTwk+1

)uk+1 = uk + ρ(Axk+1 +Byk+1 − c)

Next, we set

CCT =1

τσI − CCT , DDT =

1

τσI −DDT

to get

vk+1 = Proxτf∗1(vk + τC

(xk − σ

(∇f2(xk) +ATuk + ρAT (Axk +Byk − c) + CT vk

)))xk+1 = xk − σ

(∇f2(xk) +ATuk + ρAT (Axk +Byk − c) + CT vk+1

)wk+1 = Proxτg∗1(wk + τD

(yk − σ

(∇g2(yk) +BTuk + ρBT (Axk+1 +Byk − c) +DTwk

)))yk+1 = yk − σ

(∇g2(yk) +BTuk + ρBT (Axk+1 +Byk − c) +DTwk+1

)uk+1 = uk + ρ

(Axk+1 +Byk+1 − c

).

The minimization subproblems are solvable since they are evaluations proximaloperators of the CCP functions f1 and g1. We further simplify to get

xk+1/2 = xk − σ(CT vk +∇f2(xk) +ATuk + ρAT (Axk +Byk − c)

)vk+1 = Proxτf∗1

(vk + τCxk+1/2

)xk+1 = xk+1/2 − σCT

(vk+1 − vk

)yk+1/2 = yk − σ

(DTwk +∇g2(yk) +BTuk + ρBT (Axk+1 +Byk − c)

)wk+1 = Proxτg∗1

(wk + τDyk+1/2

)yk+1 = yk+1/2 − σDT

(wk+1 − wk

)uk+1 = uk + ρ

(Axk+1 +Byk+1 − c

).


10.3 Bregman methods

A class of methods called the Bregman methods were developed and popularizedin the image processing community. Later, it was discovered that the Bregmanmethods are related to the method of multipliers and ADMM. In this section, webriefly describe the relationship.

Bregman distance. Let f be a CCP function. When f is differentiable, we definethe f -induced Bregman distance or Bregman divergence as

Df (x, y) = f(x)− f(y)− 〈∇f(x), x− y〉.

When f is not differentiable, we use

Dvf (x, y) = f(x)− f(y)− 〈v, x− y〉,

where v ∈ ∂f(y).Df generalizes the squared Euclidean distance since Df (x, y) = ‖x− y‖2 when

f(x) = ‖x‖2. Df (x, y) ≥ 0 follows from convexity of f . However, despite its name,the Bregman “distance” is not a mathematical distance (metric). In particular,Df (x, y) may not equal Df (y, x) and Df (x, y) = 0 may hold for x 6= y.

Bregman method and method of multipliers. Consider the problem

minimizex∈Rn

f(x)

subject to Ax = b(10.11)

where f is CCP, A ∈ Rm×n, and b ∈ Rm. Let h(x) = h(Ax − b), for some

differentiable CCP function h such that h(0) = 0 and h(u) > 0 for u 6= 0. When fis differentiable, the Bregman method is

xk+1 = argminx∈Rn

{Df (x, xk) + ρh(x)

}where ρ is a scalar. When f is not differentiable, the Bregman method is

xk+1 = argminx∈Rn

{Dvk

f (x, xk) + ρh(x)}

vk+1 = vk − ρ∇h(xk+1),

where v0 ∈ ∂f(x0). The optimality condition of the first step ensures

vk+1 = vk − ρ∇h(xk+1) ∈ ∂f(xk+1).

Since the argmin depends on xk only through vk ∈ ∂f(xk), we can pick any v0 ∈range(∂f) without explicitly specifying x0. The method converges under certainmild conditions.

When h(x) = 12‖Ax − b‖

2, the Bregman method with the change of variablesvk = −ATuk coincides with the method of multipliers

xk+1 ∈ argminx∈Rn

{f(x) + 〈uk, Ax〉+

ρ

2‖Ax− b‖2

}uk+1 = uk + ρ(Axk+1 − b).

10.3 Bregman methods 197

Split Bregman method and ADMM. Consider the problem


f(x) + g(y)

subject to Ax+ y = c,

where f and g are CCP functions, A ∈ Rq×p, and c ∈ Rq. Let h(x) = 12‖Ax+y−c‖2

and apply the Bregman method to f(x) + g(y) to get:


{Dvk

f (x, xk) +Duk

g (y, yk) +ρ

2‖Ax+ y − b‖2

}vk+1 = vk − ρAT (Axk+1 + yk+1 − c)uk+1 = uk − ρ(Axk+1 + yk+1 − c).

Using vk = ATuk, we eliminate vk to get:


{f(x) + g(y)− 〈uk, Ax+ y〉+

ρ

2‖Ax+ y − c‖2

}uk+1 = uk − ρ(Axk+1 + yk+1 − c).

The split Bregman method computes xk+1 and yk+1 approximately through alter-nating updates. When only one pass of sequential minimization of x and then y isperformed, the split Bregman method coincides with ADMM.

Conclusion

In this section, we established the convergence of FLiP-ADMM and presentedtechniques for applying the method to a wide range of problems. As we will seethrough the exercises, these techniques are modular and can be combined to solveproblems with complicated structures.

The analysis of FLiP-ADMM differs from that of Part I, to derive various meth-ods, including ADMM-type methods, as instances of monotone operator methods.There are two distinct approaches to analyze the classical ADMM. The first is toderive ADMM from DRS, as we had done in §3. This approach leads to a possibleintermediate dual variable update, as we discuss in Exercise 10.1. The second isto construct a Lyapunov function and analyze convergence directly. This approachleads to a possible dual extrapolation parameter ϕ. To the best of our knowledge,the fully general FLiP-ADMM cannot be reduced to a monotone operator splittingmethod and therefore must be analyzed directly with a Lyapunov function.

ADMM-type methods are “splitting methods” in that they decompose the opti-mization problem into smaller, simpler pieces and operate on them separately. Theyare intimately related to monotone operator methods, although, strictly speaking,they are not monotone operator methods themselves.


Notes and references

FLiP-ADMM. The FLiP-ADMM presented in this chapter and the proof of Theo-rem 17 is new. More specifically, while the individual components of the method — dualextrapolation parameter, function linerization, and proximal terms — are known tech-niques within the ADMM literature, combining them all into a single method is a novelcontribution of this chapter.

Early development. ADMM dates back to the 1970s. When Glowinski and Marrocowas studying nonlinear Dirichlet problems in the form of

minimizev∈V

f(Av) + g(v), (10.12)

they reformulated it as

minimizeu∈AV, v∈V

f(u) + g(v)

subject to u−Av = 0(10.13)

and applied Hestenes and Powell’s augmented Lagrangian method (ALM) [Hes69, Pow69].In [GM75], Glowinski and Marroco proposed to solve the ALM subproblem by updatingu and v in an alternating manner while the Lagrange multipliers are fixed until a stoppingcriterion is met. Their approach hinted at ADMM. ADMM for (10.13) was first presentedwith a convergence proof by Gabay and Mercier [GM76, Algorithm 3.4]. They credited[GM74] for numerical experiments of their algorithm without proof. Gabay and Mercier’sproof assumes g is linear and establishes subsequence-convergence of the dual iterates forϕ ∈ (0, 2). When the objective functions are differentiable, they obtained convergence ofthe dual iterates. Then Glowinski and Fortin [FG83] took a large step forward to studyinggeneral convex f and g. They proved subsequence-convergence of the dual iterates for

ϕ ∈ (0,√5+12

). Later using [Opi67], Glowinski and Le Tallec obtained convergence of thedual iterates (weak convergence in infinite dimensional Hilbert spaces) [GLT89]. Bertsekasand Tsitsiklis also presented a proof in [BT89, §3.4].

Relationship with DRS. In [Roc76b], Rockafellar showed that ALM for a linearly-constrained convex problem is the proximal point algorithm applied to its dual problem,i.e., “ALM = PPA to the dual”. Gabay [Gab83] extended this result to “ADMM =DRS to the dual” and “PRS-ADMM = PRS to the dual.” (We discuss PRS-ADMMin Exercise 10.1.) Following Gabay’s naming in [Gab83], many numerical analysts referto ALM, ADMM, and PRS-ADMM as ALG1, ALG2, and ALG3, repsectively. Usingthis characterization, Eckstein generalized ADMM to allow the subproblems to be solvedinexactly [Eck89b, Proposition 4.7].

Although ADMM is equivalent to DRS applied to the dual, one can still directly applyADMM to the dual. Eckstein [Eck89b, Chapter 3.5] showed that ADMM is equivalent toDRS applied to the primal problem when A = I, Eckstein and Fukushima [EF94] showedthe same for certain special problems, and Yan and Yin [YY16] extended their results togeneral ADMM.

Update order. Swapping the orders of the two subproblems of ADMM leads to differentiterates in general, and Bauschke and Moursi studied the dependence on the iterates onthis order [BM16]. Yan and Yin [YY16] showed that the two iterates generated by thetwo orders are in fact equivalent if one of the functions is quadratic. While either ofthe two orders lead to convergence, repeatedly switching the orders causes divergenceas demonstrated by an example in [YY16]. Sun, Luo, and Ye [SLY15] showed that, forsolving linear systems of equations, ADMM under randomly permuted orders convergesin expectation.

Notes and references 199

Solvability and regularity conditions. For the ADMM iterations to be well-defined,one must either assume certain regularity conditions or directly assume the subproblemsare solvable. This fact was overlooked by several authors until it was pointed out byChen, Sun, and Toh in [CST17].

Parameter selection. The general question of how to optimally choose the scalar pa-rameters ϕ and ρ is open. Ghadimi, Teixeira, Shames, and Johansson characterized theoptimal parameters for the specific problem of `2 regularized and constrained quadraticprogramming [GTSJ15]. There has been extensive work presenting adaptive methods fortuning ρ in various settings [HYW00, Woh17, XFG17, XFY+17, XLLY17].

Dual extrapolation parameter. The first appearance of ϕ ∈ (0, (1 +√

5)/2), thegolden ratio range of the dual extrapolation parameter, is due to Fortin and Glowinski[FG83, Glo84]. Xu showed that the same golden ratio range can be used for proximalADMM [Xu07]. Tao and Yuan showed that when both f and g are quadratics, theparameter range extends to ϕ ∈ (0, 2) [TY18].

Techniques. Rockafellar presented the proximal method of multipliers [Roc76a]. Chenand Teboulle presented the predictor corrector proximal multiplier method [CT94]. Shefiand Teboulle [ST14] later identified that Chen and Teboulle’s method is an instance ofwe call the linearized method of multipliers. Eckstein presented proximal ADMM andshowed it is Douglas–Rachford splitting applied to the saddle subdifferential [Eck94]. He,Liao, Han, and Yang further generalized proximal ADMM [HLHY02]. However, what wecall the “linearization technique”, where one chooses the proximal term carefully to cancelout and linearize quadratic terms, was likely unknown at the time of these publications.The first explicit description of the linearization technique is due to the concurrent workof Deng and Yin [DY16b] and Shefi and Teboulle [ST14].

Partial linearization in ADMM was first proposed by Deng and Yin [DY16b], and wasapplied to CT and PET imaging by Ryu, Ko, and Won [RKW19]. In particular, themethod presented in Example 10.1 is the near-circulant splitting of [RKW19].

Function linearization in ADMM was first presented by Yang and Zhang [YZ11] and Yangand Yuan [YY13a]; they applied function linearization to one quadratic subproblem. Lin,Ma, Zhang [LMZ17] applied function linearization to general smooth convex functions inan ADMM-type method. Banert, Bot, and Csetnek [BBC16] applied function linearizationwith one function with the ADMM with proximal terms and B = I.

Jacobi-ADMM was first presented in [DLPY17, GHY14, HHY15, Tao14, HXY16, BK19].Similar but different methods are proposed and analyzed in [TY12, WS17, WHML15].

The 2-1-2 technique was first presented by Sun, Toh, Yang, and Li in [STY15, LST16].They also showed that the technique can be applied twice and that it can be combined withlinearizations, as we do in Exercises 10.11 and 10.12. The 2-1-2 technique has also beenused as the basis for the symmetric Gauss–Seidel ADMM of Li, Sun, and Toh [LST19].

3-block ADMM. Since the early 2010s, there has been attempts to generalize ADMMto three or more blocks of primal variables. Chen, He, Ye, and Yuan [CHYY14] showedthat the direct extension of ADMM to three blocks with sequential updates does notconverge. On the other hand, the multi-block generalization does converge with additionalassumptions or modifications [HY12a, CSY13, LMZ15b, LST15, DY17b, LMZ16, HL16].

Applications. In image processing, Wang, Yang, Yin, and Zhang [WYYZ08] formulateda total variation deblurring problem with (10.13) such that both subproblems have closed-form solutions; however, their method is inexact ALM rather than ADMM. In compressedsensing, Yang and Zhang [YZ11] applied ADMM to `1-optimization. Wen, Goldfarb, and


Yin [WGY10] used ADMM to solve semi-definite and conic programs, and O’Donoghue,Chu, Parikh, Boyd [OCPB16] applied ADMM to the self-dual homogeneous embeddingreformulation of conic programs. Lin, Ma, Ye, and Zhang [LMYZ18] used ADMM on abarrier formulation of linear programming. Yuan and Yang [YY13b, YY13a] used ADMMto recover sparse and low-rank components of a matrix. Yuan [Yua12] used ADMM forcovariance matrix selection. Ma, Xue, and Zou [MXZ13] used ADMM for graphical modelselection. Liu et al. [LMT+10] used ADMM for metric learning. Due to its popularity,ADMM has numerous other applications.

Relationship with Bregman methods. Bregman methods successively minimize a se-quence of Bregman distances instead of updating Lagrange multipliers. Osher et al. [OBG+05]proposed the Bergman method in the context of image processing for finding an memberof {x | ‖Ax − b‖ ≤ ε} for some small ε > 0 that serves as a good reconstruction of theoriginal image. This motivation is why the Bregman method looks like a method for min-imizing h, rather than minimizing f . Osher et al. shows h(xk)→ argminh monotonicallyin [OBG+05]. Yin, Osher, Goldfard, and Darbon proved the Bregman method convergesto a solution of the optimization problem [YOGD08, Section 5.1,5.2] and argued that themethod is equivalent to the method of multipliers, and more generally the augmentedLagrangian method, under a change of variable [YOGD08, Section 3.4]. In [YOGD08,Section 5.3], they introduced linearized Bregman iteration. Both linearized Bregman andlinearized method of multipliers have simpler subproblems, but they are not equivalent.Goldstein and Osher [GO09] introduced the Split Bregman method, equivalent to themethod of multipliers under change of variables, and reported great empirical perfor-mance with a “single-pass”, which is equivalent to ADMM for (10.13). Zhang, Burger,and Osher [ZBO11] linearized a split Bregman subproblem, obtaining a method that isequivalent to linearized ADMM for (10.13). These works helped to popularize ADMM-type methods in image processing. For other convergence results of Bregman method,see [OBG+05] for the general setting, [YOGD08, YO13] for `1-norm and piece-wise linearfunctions, and [JZZ09].

Other ADMM-type methods. Wang and Banerjee [WB13] extended ADMM to theonline setting. Suzuki [Suz13] and Ouyang, He, Tran, and Gray [OHTG13] concurrentlyextended ADMM to taking stochastic samples. Suzuki [Suz14], Zhong and Kwok [ZK14],Zheng and Kwok [ZK16] presented accelerated stochastic ADMM using variance reduc-tion techniques. Yi and Pavel [YP19] applied ADMM to computing generalized Nashequilibrium. Chen, Chan, Ma, and Yang [CCMY15] and Bot, Csetnek, and Hendrich[BCH15] presented ADMM and DRS with with inertial acceleration.

Convergence rates. Researchers have used different quantities to measure the rate ofADMM convergence. He and Yuan [HY12c] established an O(1/k) rate that is appliedto the violation of a variational-inequality optimality condition of ADMM. Monterio andSvaiter [MS13] showed another O(1/k) rate that is applied to the sizes of some approxi-mate subgradients and their approximate levels. Shefi and Teboulle [ST14] analyzed theconvergence rates for the proximal and linearized ADMM. He and Yuan [HY15] showedfixed-point residuals decay at an O(1/k) rate. Davis and Yin [DY16a, DY17a] presenteda comprehensive list of rates for function value and constraint violation corresponding todifferent smoothness and strong convexity assumptions. They also improved the O(1/k)rate of some setups to o(1/k) and establish tightness of the o(1/k) rate by extendingexamples from [BBCN+14]. Lin, Ma, and Zhang [LMZ15b] presented conditions for thisrate to hold for multi-block ADMM. Deng, Lai, Peng, and Yin established an o(1/k) ratefor Jacobi-ADMM [DLPY17].

With some additional assumptions, ADMM can be modified to achieve an acceleratedsublinear rate. Assuming strong convexity on f and g, Goldstein, O’Donoghue, Setzer,and Baraniuk [GOSB14] used a modified ADMM to achieve an O(1/k2) rate. Ouyuang,

Notes and references 201

Chen, Lan, and Pasiliao [OCLP15] introduced an accelerated linearized ADMM thatalso linearizes one objective function f . Assuming f is Lf Lipschitz differentiable, theyestablished a rate in the form of O(Lg/k

2+C/k), where C includes quantities independentof Lg. Xu [Xu17] proposed a modified linearized ADMM that achieves a full O(1/k2)rate when either f or g1 + g2 is strongly convex. His method linearizes the Lipschitzdifferentiable function g2.

Under different combinations of assumptions, ADMM converges linearly. Deng and Yin[DY16b] proved linear convergence of ADMM for four combinations, and Davis and Yin[DY17a, §6] discovered more combinations. Giselsson and Boyd [GB17] and Ryu, Taylor,Bergeling and Giselsson [RTBG18] obtained tight linear rates for certain combinations.Hong and Luo [HL16] established linear convergence for multi-block ADMM under an“error-bound” condition. Lin, Ma, and Zhang [LMZ15a] established linear convergencefor multi-block ADMM under certain strong convexity, smoothness, and rank conditions.

Linear convergence of ADMM has also been studied for specific classes of problems. Eck-stein and Bertsekas [EB90] proved it for linear programming. Bauschke et al. [BBCN+14]showed that, for finding an intersection point between two subspaces, the linear rate ofADMM is the cosine of the Friedrichs angle between the subspaces. Boley [Bol13] analyzedlinear convergence of ADMM in different phases when applied to certain quadratic prob-lems. Raghunathan and Di Cairano [RDC14] related the linear rate of ADMM appliedto quadratic programming with linear equality and bound constraints to the spectrumof the Hessian and the Friedrichs angle. Aspelmeier, Charitha, and Luke [ACL16] estab-lished eventual linear convergence based on metric subregularity. Liang, Fadili, and Peyre[LFP17] studied local linear convergence and manifold identification.

Miscellaneous. Eckstein showed that when total duality fails, ADMM diverges in thesense that it generates unbounded iterates [Eck89b, Proposition 4.8]. There has beenextensive work characterizing the manner in which the iterates diverge and using thedivergent iterates to identify pathological problems [RDC14, SBG+17, BGSB19, LRY19,RLY19].

Given a Douglas–Rachford splitting operator TDRS, there generally is no function f suchthat TDRS = Proxf [DY16a], but, when it exists, it is called the Douglas–Rachford en-velope. Patrinos, Stella, Bemporad, and Themelis showed that the Douglas–Rachfordenvelope exists under smoothness conditions and used it to accelerate DRS and ADMM[PSB14, TP17].

Finally, the review papers by Boyd et al. [BPC+11], Eckstein and Yao [EY12], and Glowin-ski [Glo14] serve as excellent tutorials on ADMM.


Exercises

10.1 Peaceman–Rachford ADMM. In §3, we derived ADMM as an instance of DRS. One maywonder: what if we used the FPI with

(1− θ)I + θRα∂fRα∂g,

where f and g are as defined in §3.1 or 3.2, with θ ∈ (0, 1)? Do we recover the goldenratio ADMM? The answer is no. Show that we instead get

xk+1 = argminx

{L(x, yk, uk) +

α

2‖Ax+Byk − c‖2

}uk+1/2 = uk + α(2θ − 1)(Axk+1 +Byk − c)

yk+1 = argminy

{L(xk+1, y, uk+1/2) +

α

2‖Axk+1 +By − c‖2

}uk+1 = uk+1/2 + α(Axk+1 +Byk+1 − c).

When θ = 1, this method is called Peaceman–Rachford ADMM. Although Peaceman–Rachford ADMM does not converge in general, it can be faster than regular ADMMunder additional assumptions [Gab83, HLWY14].

10.2 Provide the derivation of Example 10.1.

Hint. First show that FLiP-ADMM can be written as

yk+1 = Prox 1ρ`

(Axk − b+

1

ρξk)

zk+1 = Proxλρ‖·‖1

(Dyk +

1

ρζk)

xk+1 = xk − (γI + C)−1(AT (ξk + ρ(Axk − b)− ρyk+1) +D(ζk + ρDxk − ρzk+1)

)ξk+1 = ξk + ρ(Axk+1 − b− yk+1)

ζk+1 = ζk + ρ(Dxk+1 − zk+1),

where ξk+1 and ζk+1 are the dual variables. Then apply the Moreau identity (2.11).

10.3 Jacobi doubly linearized ADMM. Consider the problem

minimizex∈Rp

m∑i=1

(fi(xi) + hi(xi))

subject to Ax = c,

where (x1, . . . , xm) = x, f1, . . . , fm are proximable CCP functions, and h1, . . . , hm aredifferentiable CCP functions. Find a method analogous to Jacobi ADMM with a split x-update utilizing the proximal operators of f1, . . . , fm. What are the stepsize requirements?

10.4 Jacobi+1. Consider the problem


m∑i=1

fi(xi) + g(y)


Find a method analogous to Jacobi ADMM that performs a split x-update and then they-update. What are the stepsize requirements?

Exercises 203

10.5 More dummy variables. Consider the problem


m∑i=1

fi(xi) +∑j=1

gj(yj)


where x = (x1, . . . , xm) with xi ∈ Rpi for i = 1, . . . ,m and y = (y1, . . . , y`) with yj ∈ Rqj

for j = 1, . . . , `. Introduce dummy variables ξ1, . . . , ξm and ζ1, . . . , ζ` to get the equivalentproblem

minimize(x1,...,xm)∈Rp(y1,...,y`)∈Rq

ξ1,...,ξm,ζ1,...,ζ`∈Rn

m∑i=1

fi(xi) +∑j=1

gj(yj)

subject to A:,ixi = ξi for i = 1, . . . ,mB:,jyj = ζj for j = 1, . . . , `m∑i=1

ξj +∑j=1

ζj = c.

Find an ADMM-type method with split x- and y-updates. In particular, apply FLiP-ADMM with the x- and ζ-variables updated first, y- and ξ-variables updated second, andthe dual variables updated last. What are the stepsize requirements?

10.6 The exchange problem. Consider the problem

minimizex1,...,xm∈Rn

m∑i=1

fi(xi)

subject to x1 + · · ·+ xm = b,x1, . . . , xm ≥ 0.

Assume fi is Lf -smooth and CCP and evaluating ∇fi is efficient, for i = 1, . . . ,m. Pro-vide an ADMM-type method to efficiently solve this problem. What are the stepsizerequirements?

Remark. The economics interpretation of this problem is as follows. There are n goodswith a total amount of b1, . . . , bn. There are m agents exchanges these goods whilepreserving the total amount. Agents cannot have a negative amount of goods. The goalis to find the optimal exchange that minimizes the global cost of all agents.

10.7 Canonical sharing problem. Consider the problem

minimizex1,...,xm∈Rn

f

(m∑i=1

xi

)+

m∑i=1

gi(xi)

Assume f, g1, . . . , gm are proximable CCP functions. Provide an ADMM-type method toefficiently solve this problem. What are the stepsize requirements?

Remark. The economics interpretation of this problem is as follows. There are m agentssharing n common resources. The common shared cost is represented by f and theindividual costs by gi. For example, f may represent the shared cost of pollution, while girepresents the individual cost (or negative gain) incurred by performing actions producingpollutants. The goal is to minimize the sum of the global and all individual costs.

10.8 Model parallelism vs. data parallelism. Consider the problem

minimizex∈Rp

f(x) + g(Ax− b),

where A ∈ Rn×p and b ∈ Rn. Partition x ∈ Rp into m non-overlapping blocks of sizesp1, . . . , pm, and write x = (x1, . . . , xm), so xi ∈ Rpi for i = 1, . . . ,m. Partition z ∈ Rn


into ` non-overlapping blocks of sizes n1, . . . , n`, and write z = (z1, . . . , z`), so zj ∈ Rnj forj = 1, . . . , `. Assume f(x) = f1(x1) + · · ·+ fm(xm) where f1, . . . , fm are CCP functions.Assume g(z) = g1(z1) + · · ·+ g`(z`) where g1, . . . , g` are CCP functions.

The problem is equivalent to


z∈Rn

∑mi=1 fi(xi) + g(z − b)

subject to∑mi=1A:,ixi = z,

where

A =[A:,1 A:,2 · · · A:,m

], Ax = A:,1x1 +A:,2x2 + · · ·+A:,mxm.

Provide an ADMM-type method that updates the blocks x1, . . . , xm in parallel. Theupdate for xi should access the data A:,i for i = 1, . . . ,m. What are the stepsize require-ments?

The problem is also equivalent to

minimizex1,...,x`,y∈Rp

∑`j=1 gi(Aj,:xj − bj) + f(y)

subject to xj = y, j = 1, . . . , `,

where

A =

A1,:

A2,:

...A`,:

, Ax =

A1,:xA2,:x

...A`,:x

.Provide an ADMM-type method that updates the copies x1, . . . , x` in parallel. Theupdate for xj should access the data Aj,: and bj for j = 1, . . . , `. What are the stepsizerequirements?

Remark. In the context of machine learning, we say the first type of method utilizesmodel parallelism and the second type data parallelism. We view (x1, . . . , xm) = x as adecomposition of the machine learning model represented by x into m blocks, and we view(A1,:, b1), . . . , (A`,:, b`) as a decomposition of the data into ` blocks. In model parallelism,the i-th parallel process updates the i-th block using all data blocks (but only the datarelevant to the i-th block of the model). In data parallelism, the j-th parallel processupdates the entire model using (Aj,:, bj), the j-th block of data points.

10.9 Consolidating blocks with graph coloring. Consider the multi-block ADMM formulation


m∑i=1

fi(xi)

subject to∑mi=1A:,ixi = c.

As discussed in §10.2.3, we can solve this problem with several ADMM-type methodsutilizing block splitting. However, such methods can be slow; empirically, using proximalterms or introducing dummy variables slows down the iteration. On the other hand, weknow that if the blocks are orthogonal, i.e., AT:,iA:,j = 0 for all i 6= j, then the classicalADMM has split x-updates.

Consider the graph G = (V, E) with the vertex set V = {1, . . . ,m} representing the blocks.For the edge set E , we have {i, j} ∈ E if and only if i 6= j and AT:,iA:,j 6= 0.

Assume G has a 2-coloring, i.e., we can partition V into V1 and V2 such that for anyi, j ∈ Vk, {i, j} /∈ E for k = 1, 2. The problem is equivalent to


∑i∈V1

fi(xi) +∑i∈V2

fi(xi)

subject to∑i∈V1 A:,ixi +

∑i∈V2 A:,ixi = c.

Exercises 205

Provide an ADMM-type method that updates the primal blocks in V1 and V2 in analternating manner. What are the stepsize requirements?

Next, assume G has a χ-coloring, i.e., we partition V into V1, . . . ,Vχ such that for anyi, j ∈ Vk, {i, j} /∈ E for k = 1, . . . , χ. The problem is equivalent to


χ∑k=1

∑i∈Vk

fi(xi)

subject to∑χk=1

∑i∈Vk

A:,ixi = c.

Provide an ADMM-type method analogous to Jacobi ADMM that updates the primalblocks in V1, . . . ,Vχ concurrently. What are the stepsize requirements?

10.10 Quadratic subproblem with affine constraints. Assume g1 is a strongly convex quadraticfunction with affine constraints, i.e.,

g1(y) =1

2yTCy + 〈d, y〉+ δ{y |Ey=f}(y),

where C ∈ Rq×q, C � 0, d ∈ Rq, E ∈ Rs×q has linearly independent rows, and f ∈ Rs.When s = 0, the affine constraints vanish. Show that the solution to the subproblem

yk+1 = argminy∈Rq

{g1(y) + 〈∇g2(yk) +BTuk, y〉+

ρ

2‖Axk +By − c‖2 +

1

2‖y − yk‖2Q

}is given by

yk+1 = −T(ρ

2BTAxk +∇g2(yk) +BTuk −Qyk

)− h

for some symmetric positive semidefinite matrix T ∈ Rq×q and h = T (d − ρ2BT c) +

M−1ET (EM−1ET )−1f with M = C +Q+ ρBTB.

10.11 Four-block ADMM with 2-1-2-4-3-4 updates. Consider the problem

minimizex1∈Rp1 , x2∈Rp2x3∈Rp3 , x4∈Rp4

f1(x1) + f2(x2) + f3(x3) + f4(x4)

subject to A:,1x1 +A:,2x2 +A:,3x3 +A:,4x4 = c,

where f2 and f4 are strongly convex quadratic functions with affine constraints as inExercise 10.10. Find an ADMM-type method that updates the primal blocks in the orderx2 → x1 → x2 → x4 → x3 → x4, analogous to the 2-1-2 ADMM of §10.2.5. What are thestepsize requirements?

10.12 2-1-2 ADMM with FLiP. Consider the problem


f1(x) + g1(y)


Assume g is is a strongly convex quadratic functions with affine constraints as in Exer-cise 10.10. Consider the 2-1-2 ADMM method with function linearization and proximalterms.

yk+1/2 = argminy∈Rq

{g1(y) + 〈∇g2(yk) +BTuk, y〉+

ρ

2‖Axk +By − c‖2 +

1

2‖y − yk‖2Q

}xk+1 ∈ argmin

x∈Rp

{f1(x) + 〈∇f2(xk) +ATuk, x〉+

ρ

2‖Ax+Byk+1/2 − c‖2 +

1

2‖x− xk‖2P

}yk+1 = argmin

y∈Rq

{g1(y) + 〈∇g2(yk) +BTuk, y〉+

ρ

2‖Axk+1 +By − c‖2 +

1

2‖y − yk‖2Q

}uk+1 = uk + ϕρ(Axk+1 +Byk+1 − c)


To clarify, the function linearization and the proximal terms of the y-updates are centeredat yk for both the yk+1/2- and yk+1-updates. Show that this method reduces to aninstance of FLiP-ADMM, analogous to the 2-1-2 ADMM of §10.2.5. What are the stepsizerequirements?

10.13 Alternate proof of Theorem 17. In this exercise, we perform an alternate analysis forStages 2 and 3 of the proof of Theorem 17 to obtain a different stepsize requirement.Bound the last term of (10.6) with

1

ϕ〈uk+1 − uk, B(yk+1 − yk)〉 ≤ η

2ϕ2ρ‖uk+1 − uk‖2 +

1

2‖yk+1 − yk‖2ρ

ηBTB

to get

‖wk+1 − w?‖M0 ≤ ‖wk − w?‖M0 − ‖w

k+1 − wk‖M3 −(L(xk+1, yk+1, u?)− L(x?, y?, u?)

),

where

M3 =1

2

P − LfI 0 0

0 ρ(

1− 1η

)BTB +Q− LgI 0

0 0 2−ϕ−ηϕ2ρ

I

.Show that FLiP-ADMM converges if

P � LfI, Q � 0,

and there exist ε ∈ (0, 2− ϕ) such that

ρ

(1− 1

2− ϕ− ε

)BTB +Q � LgI.

Remark. This new condition is useful when Lg is large as it does not have a factor 3.

10.14 Refinement of Theorem 17. Unify the proof of Theorem 17 and the analysis of Exer-cise 10.13. For α ∈ [0, 1], define

V kα = ‖wk − w?‖2M0+ α‖wk − wk−1‖2M1

.

Show the key inequality

V k+1α ≤ V kα − ‖wk+1 − wk‖2αM2+(1−α)M3

−(L(xk+1, yk+1, u?)− L(x?, y?, u?)

),

where M3 is as defined in Exercise 10.13. Show that FLiP-ADMM converges if

P � LfI, Q � 0,

and there exist α ∈ [0, 1] and ε ∈ (0, 2− ϕ) such that

ρ

(1− α(1− ϕ)2 + 1− α

2− ϕ− ε

)BTB +Q � (2α+ 1)LgI.

Remark. When ϕ = 1, the choice α = ε leads to the (sufficient) condition Q � LgI.

References

[ACL16] T. Aspelmeier, C. Charitha, and D. Luke. Local linear convergence of theADMM/Douglas-Rachford algorithms without strong convexity and applica-tion to statistical imaging. SIAM Journal on Imaging Sciences, 9(2):842–868,2016.

[BBC16] S. Banert, R. I. Bot, and E. R. Csetnek. Fixing and extending some recentresults on the ADMM algorithm. arXiv preprint arXiv:1612.05057, 2016.

[BBCN+14] H. H. Bauschke, J. Y. Bello Cruz, T. T. A. Nghia, H. M. Phan, and X. Wang.The rate of linear convergence of the Douglas–Rachford algorithm for sub-spaces is the cosine of the Friedrichs angle. Journal of Approximation Theory,185:63–79, 2014.

[BCH15] R. I. Bot, E. R. Csetnek, and C. Hendrich. Inertial Douglas–Rachford split-ting for monotone inclusion problems. Applied Mathematics and Computa-tion, 256:472–487, 2015.

[BGSB19] G. Banjac, P. Goulart, B. Stellato, and S. Boyd. Infeasibility detection in thealternating direction method of multipliers for convex optimization. Journalof Optimization Theory and Applications, 183(2):490–519, 2019.

[BK19] E. Borgens and C. Kanzow. Regularized Jacobi-type ADMM-methods for aclass of separable convex optimization problems in Hilbert spaces. Compu-tational Optimization and Applications, 73(3):755–790, 2019.

[BM16] H. H. Bauschke and W. M. Moursi. On the order of the operators in thedouglas–rachford algorithm. Optimization Letters, 10(3):447–455, 2016.

[Bol13] D. Boley. Local linear convergence of the alternating direction method ofmultipliers on quadratic or linear programs. SIAM Journal on Optimization,23(4):2183–2207, 2013.

[BPC+11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed op-timization and statistical learning via the alternating direction method ofmultipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.

[BT89] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation:Numerical Methods. Prentice Hall, Englewood Cliffs, N.J, 1989.

[CCMY15] C. Chen, R. H. Chan, S. Ma, and J. Yang. Inertial proximal ADMM for lin-early constrained separable convex optimization. SIAM Journal on ImagingSciences, 8(4):2239–2267, 2015.

[CHYY14] C. Chen, B. He, Y. Ye, and X. Yuan. The direct extension of ADMMfor multi-block convex minimization problems is not necessarily convergent.Mathematical Programming, pages 1–23, 2014.

[CST17] L. Chen, D. Sun, and K.-C. Toh. A note on the convergence of ADMMfor linearly constrained convex optimization problems. Computational Opti-mization and Applications, 66(2):327–343, 2017.

[CSY13] C. Chen, Y. Shen, and Y. You. On the convergence analysis of the alternatingdirection method of multipliers with three blocks. Abstract and AppliedAnalysis, 2013:e183961, 2013.

[CT94] G. Chen and M. Teboulle. A proximal-based decomposition method forconvex minimization problems. Mathematical Programming, 64(1):81–101,1994.

[DLPY17] W. Deng, M.-J. Lai, Z. Peng, and W. Yin. Parallel multi-block ADMM witho(1/k) convergence. Journal of Scientific Computing, 71(2):712–736, 2017.

[DY16a] D. Davis and W. Yin. Convergence rate analysis of several splitting schemes.In R. Glowinski, S. Osher, and W. Yin, editors, Splitting Methods in Com-munication, Imaging, Science and Engineering, Chapter 4, pages 115–163.Springer, 2016.

[DY16b] W. Deng and W. Yin. On the global and linear convergence of the generalizedalternating direction method of multipliers. Journal of Scientific Computing,66(3):889–916, 2016.

[DY17a] D. Davis and W. Yin. Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions. Mathematics of Op-erations Research, 42(3):783–805, 2017.

[DY17b] D. Davis and W. Yin. A three-operator splitting scheme and its optimizationapplications. Set-Valued and Variational Analysis, 25(4):829–858, 2017.

[EB90] J. Eckstein and D. Bertsekas. An alternating direction method for linearprogramming. Technical Report LIDS-P-1967, Division of Research, HarvardUniversity, 1990.

[Eck89] J. Eckstein. Splitting Methods for Monotone Operators with Applications toParallel Optimization. PhD thesis, MIT, 1989.

[Eck94] J. Eckstein. Some saddle-function splitting methods for convex program-ming. Optimization Methods and Software, 4(1):75–83, 1994.

[EF94] J. Eckstein and M. Fukushima. Some reformulations and applications ofthe alternating direction method of multipliers. In W. W. Hager, D. W.Hearn, and P. M. Pardalos, editors, Large Scale Optimization, pages 115–134. Springer US, 1994.

[EY12] J. Eckstein and W. Yao. Augmented Lagrangian and alternating directionmethods for convex optimization: A tutorial and some illustrative computa-tional results. Technical Report RRR 32-2012, RUTCOR, Rutgers Univer-sity, 2012.

[FG83] M. Fortin and R. Glowinski. On decomposition-coordination methods us-ing an augmented Lagrangian. In M. Fortin and R. Glowinski, editors,Augmented Lagrangian Methods: Applications to the Numerical Solution ofBoundary-Value Problems, pages 97–146. Elsevier, 1983.

[Gab83] D. Gabay. Application of the methods of multipliers to variational inequal-ities. In M. Fortin and R. Glowinski, editors, Augmented Lagrangians: Ap-plication to the Numerical Solution of Boundary Value Problems, pages 299–331. North-Holland, Amsterdam, 1983.

[GB17] P. Giselsson and S. Boyd. Linear convergence and metric selection forDouglas-Rachford splitting and ADMM. IEEE Transactions on AutomaticControl, 62(2):532–544, 2017.

[GHY14] G. Gu, B. He, and X. Yuan. Customized proximal point algorithms forlinearly constrained convex minimization and saddle-point problems: A uni-fied approach. Computational Optimization and Applications, 59(1):135–161,2014.

[Glo84] R. Glowinski. Numerical methods for nonlinear variational problems.Springer, 1984.

[Glo14] R. Glowinski. On alternating direction methods of multipliers: A histori-cal perspective. In W. Fitzgibbon, Y. A. Kuznetsov, P. Neittaanmaki, andO. Pironneau, editors, Modeling, Simulation and Optimization for Scienceand Technology, volume 34, pages 59–82. Springer Netherlands, Dordrecht,2014.

[GLT89] R. Glowinski and P. Le Tallec. Augmented Lagrangian and Operator-SplittingMethods in Nonlinear Mechanics. Society for Industrial and Applied Math-ematics, 1989.

[GM74] R. Glowinski and A. Marrocco. Rapport No 74023. Technical Report Labora-toire d’Analyse NumCrique. L.A. 189, Universiti de Paris VI, Paris, France,1974.

[GM75] R. Glowinski and A. Marroco. Sur l’approximation, par elements finis d’ordreun, et la resolution, par penalisation-dualite d’une classe de problemesde Dirichlet non lineaires. Revue Francaise d’Automatique, Informatique,Recherche Operationnelle. Analyse Numerique, 9(2):41–76, 1975.

[GM76] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlin-ear variational problems via finite element approximation. Computers andMathematics with Applications, 2(1):17–40, 1976.

[GO09] T. Goldstein and S. Osher. The split Bregman method for L1-regularizedproblems. SIAM Journal on Imaging Sciences, 2(2):323–343, 2009.

[GOSB14] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternat-ing direction optimization methods. SIAM Journal on Imaging Sciences,7(3):1588–1623, 2014.

[GTSJ15] E. Ghadimi, A. Teixeira, I. Shames, and M. Johansson. Optimal parame-ter selection for the alternating direction method of multipliers (ADMM):Quadratic problems. IEEE Transactions on Automatic Control, 60(3):644–658, 2015.

[Hes69] M. R. Hestenes. Multiplier and gradient methods. Journal of OptimizationTheory and Applications, 4(5):303–320, 1969.

[HHY15] B. He, L. Hou, and X. Yuan. On full Jacobian decomposition of the aug-mented lagrangian method for separable convex programming. SIAM Jour-nal on Optimization, 25(4):2274–2312, 2015.

[HL16] M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direc-tion method of multipliers. Mathematical Programming, pages 1–35, 2016.

[HLHY02] B. He, L.-Z. Liao, D. Han, and H. Yang. A new inexact alternating directionsmethod for monotone variational inequalities. Mathematical Programming,92(1):103–118, 2002.

[HLWY14] B. He, H. Liu, Z. Wang, and X. Yuan. A strictly contractive Peaceman-Rachford splitting method for convex programming. SIAM Journal on Op-timization, 24(3):1011–1040, 2014.

[HXY16] B. He, H.-K. Xu, and X. Yuan. On the proximal Jacobian decompositionof alm for multiple-block separable convex minimization problems and itsrelationship to ADMM. Journal of Scientific Computing, 66(3):1204–1217,2016.

[HY12a] D. Han and X. Yuan. A note on the alternating direction method of mul-tipliers. Journal of Optimization Theory and Applications, 155(1):227–238,2012.

[HY12b] B. He and X. Yuan. On the o(1/n) convergence rate of the douglas–rachfordalternating direction method. SIAM Journal on Numerical Analysis,50(2):700–709, 2012.

[HY15] B. He and X. Yuan. On non-ergodic convergence rate of Douglas–Rachfordalternating direction method of multipliers. Numerische Mathematik,130(3):567–577, 2015.

[HYW00] B. S. He, H. Yang, and S. L. Wang. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. Journalof Optimization Theory and Applications, 106(2):337–356, 2000.

[JZZ09] R.-Q. Jia, H. Zhao, and W. Zhao. Convergence analysis of the Bregmanmethod for the variational model of image denoising. Applied and Compu-tational Harmonic Analysis, 27(3):367–379, 2009.

[LFP17] J. Liang, J. Fadili, and G. Peyre. Local convergence properties of Dou-glas–Rachford and alternating direction method of multipliers. Journal ofOptimization Theory and Applications, 172(3):874–913, 2017.

[LMT+10] W. Liu, S. Ma, D. Tao, J. Liu, and P. Liu. Semi-supervised sparse metriclearning using alternating linearization optimization. In Proceedings of the16th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, KDD ’10, pages 1139–1148, New York, NY, USA, 2010. ACM.

[LMYZ18] T. Lin, S. Ma, Y. Ye, and S. Zhang. An ADMM-based interior-point methodfor large-scale linear programming. arXiv:1805.12344 [math], 2018.

[LMZ15a] T. Lin, S. Ma, and S. Zhang. On the global linear convergence of the ADMMwith multiblock variables. SIAM Journal on Optimization, 25(3):1478–1497,2015.

[LMZ15b] T.-Y. Lin, S.-Q. Ma, and S.-Z. Zhang. On the sublinear convergence rate ofmulti-block ADMM. Journal of the Operations Research Society of China,3(3):251–274, 2015.

[LMZ16] T. Lin, S. Ma, and S. Zhang. Iteration complexity analysis of multi-blockADMM for a family of convex minimization without strong convexity. Jour-nal of Scientific Computing, 69(1):52–81, 2016.

[LMZ17] T. Lin, S. Ma, and S. Zhang. An extragradient-based alternating directionmethod for convex minimization. Foundations of Computational Mathemat-ics, 17(1):35–59, 2017.

[LRY19] Y. Liu, E. K. Ryu, and W. Yin. A new use of Douglas-Rachford splittingand ADMM for identifying infeasible, unbounded, and pathological conicprograms. Mathematical Programming, 177:225–253, 2019.

[LST15] M. Li, D. Sun, and K.-C. Toh. A convergent 3-block semi-proximal ADMMfor convex minimization problems with one strongly convex block. Asia-Pacific Journal of Operational Research, 32(04):1550024, 2015.

[LST16] X. Li, D. Sun, and K.-C. Toh. A Schur complement based semi-proximalADMM for convex quadratic conic programming and extensions. Mathemat-ical Programming, 155(1):333–373, 2016.

[LST19] X. Li, D. Sun, and K.-C. Toh. A block symmetric Gauss–Seidel decomposi-tion theorem for convex composite quadratic programming and its applica-tions. Mathematical Programming, 175(1):395–418, 2019.

[MS13] R. D. C. Monteiro and B. F. Svaiter. Iteration-complexity of block-decomposition algorithms and the alternating direction method of multi-pliers. SIAM Journal on Optimization, 23(1):475–507, 2013.

[MXZ13] S. Ma, L. Xue, and H. Zou. Alternating direction methods for latent variableGaussian graphical model selection. Neural Computation, 25(8):2172–2198,2013.

[OBG+05] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative reg-ularization method for total variation-based image restoration. MultiscaleModeling and Simulation, 4(2):460–489, 2005.

[OCLP15] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao. An accelerated linearized alter-nating direction method of multipliers. SIAM Journal on Imaging Sciences,8(1):644–681, 2015.

[OCPB16] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization viaoperator splitting and homogeneous self-dual embedding. Journal of Opti-mization Theory and Applications, 169(3):1042–1068, 2016.

[OHTG13] H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating directionmethod of multipliers. In International Conference on Machine Learning,pages 46–54, 2013.

[Opi67] Z. Opial. Weak convergence of the sequence of successive approximationsfor nonexpansive mappings. Bulletin of the American Mathematical Society,73(4):591–598, 1967.

[Pow69] M. J. D. Powell. A method for nonlinear constraints in minimization prob-lems. In R. Fletcher, editor, Optimization: Symposium of the Institute ofMathematics and Its Applications, University of Keele, England, 1968, pages283–298. 1969.

[PSB14] P. Patrinos, L. Stella, and A. Bemporad. Douglas-Rachford splitting: Com-plexity estimates and accelerated variants. In 53rd IEEE Conference onDecision and Control, pages 4234–4239, 2014.

[RDC14] A. U. Raghunathan and S. Di Cairano. Admm for convex quadratic pro-grams: Q-linear convergence and infeasibility detection. arXiv:1411.7288[math], 2014.

[RKW19] E. K. Ryu, S. Ko, and J.-H. Won. Splitting with near-circulant linear sys-tems: Applications to total variation CT and PET. SIAM Journal on Sci-entific Computing, 2019.

[RLY19] E. K. Ryu, Y. Liu, and W. Yin. Douglas-Rachford splitting for pathologicalconvex optimization. Computational Optimization and Applications, onlinefirst, 2019.

[Roc76a] R. T. Rockafellar. Augmented Lagrangians and applications of the prox-imal point algorithm in convex programming. Mathematics of OperationsResearch, 1(2):97–116, 1976.

[Roc76b] R. T. Rockafellar. Monotone operators and the proximal point algorithm.SIAM Journal on Control and Optimization, 14(5):877–898, 1976.

[RTBG18] E. K. Ryu, A. B. Taylor, C. Bergeling, and P. Giselsson. Operator splittingperformance estimation: Tight contraction factors and optimal parameterselection. arXiv, 2018.

[SBG+17] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP:An operator splitting solver for quadratic programs. arXiv preprintarXiv:1711.08013, 2017.

[SLY15] R. Sun, Z.-Q. Luo, and Y. Ye. On the efficiency of random permutation forADMM and coordinate descent. arXiv:1503.06387 [math], 2015.

[ST14] R. Shefi and M. Teboulle. Rate of convergence analysis of decompositionmethods based on the proximal method of multipliers for convex minimiza-tion. SIAM Journal on Optimization, 24(1):269–297, 2014.

[STY15] D. Sun, K.-C. Toh, and L. Yang. A convergent 3-block SemiProximal alter-nating direction method of multipliers for conic programming with 4-typeconstraints. SIAM Journal on Optimization, 25(2):882–915, 2015.

[Suz13] T. Suzuki. Dual averaging and proximal gradient descent for online alter-nating direction multiplier method. In International Conference on MachineLearning, pages 392–400, 2013.

[Suz14] T. Suzuki. Stochastic dual coordinate ascent with alternating directionmethod of multipliers. In International Conference on Machine Learning,pages 736–744, 2014.

[Tao14] M. Tao. Some parallel splitting methods for separable convex program-ming with the o(1/t) convergence rate. Pacific Journal on Optimization,10(2):359–384, 2014.

[TP17] A. Themelis and P. Patrinos. Douglas–Rachford splitting and ADMM fornonconvex optimization: tight convergence results. arXiv, 2017.

[TY12] M. Tao and X. Yuan. An inexact parallel splitting augmented Lagrangianmethod for monotone variational inequalities with separable structures.Computational Optimization and Applications, 52(2):439–461, 2012.

[TY18] M. Tao and X. Yuan. On Glowinski’s open question on the alternating direc-tion method of multipliers. Journal of Optimization Theory and Applications,179(1):163–196, 2018.

[WB13] H. Wang and A. Banerjee. Online alternating direction method (longer ver-sion). arXiv preprint arXiv:1306.3721, 2013.

[WGY10] Z. Wen, D. Goldfarb, and W. Yin. Alternating direction augmented La-grangian methods for semidefinite programming. Mathematical ProgrammingComputation, 2(3-4):203–230, 2010.

[WHML15] X. Wang, M. Hong, S. Ma, and Z.-Q. Luo. Solving multiple-block separableconvex minimization problems using two-block alternating direction methodof multipliers. Pacific Journal of Optimization, 11(4):645–667, 2015.

[Woh17] B. Wohlberg. ADMM penalty parameter selection by residual balancing.arXiv, 2017.

[WS17] J. J. Wang and W. Song. An algorithm twisted from generalized ADMMfor multi-block separable convex minimization models. Journal of Compu-tational and Applied Mathematics, 309:342–358, 2017.

[WYYZ08] Y. Wang, J. Yang, W. Yin, and Y. Zhang. A new alternating minimizationalgorithm for total variation image reconstruction. SIAM Journal on ImagingSciences, 1(3):248–272, 2008.

[XFG17] Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectralpenalty parameter selection. International Conference on Artificial Intelli-gence and Statistics, 2017.

[XFY+17] Z. Xu, M. A. T. Figueiredo, X. Yuan, C. Studer, and T. Goldstein. Adaptiverelaxed ADMM: Convergence theory and practical implementation. TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR),2017.

[XLLY17] Y. Xu, M. Liu, Q. Lin, and T. Yang. ADMM without a fixed penalty pa-rameter: Faster convergence with new adaptive penalization. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, editors, Advances in Neural Information Processing Systems 30,pages 1267–1277. Curran Associates, Inc., 2017.

[Xu07] M. H. Xu. Proximal alternating directions method for structured variationalinequalities. Journal of Optimization Theory and Applications, 134(1):107–117, 2007.

[Xu17] Y. Xu. Accelerated first-order primal-dual proximal methods for linearlyconstrained composite convex programming. SIAM Journal on Optimization,27(3):1459–1484, 2017.

[YO13] W. Yin and S. Osher. Error forgetting of Bregman iteration. Journal ofScientific Computing, 54(2-3):684–695, 2013.

[YOGD08] W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithmsfor `1-minimization with applications to compressed sensing. SIAM Journalon Imaging Sciences, 1(1):143–168, 2008.

[YP19] P. Yi and L. Pavel. Distributed generalized Nash equilibria computation ofmonotone games via double-layer preconditioned proximal-point algorithms.IEEE Transactions on Control of Network Systems, 6(1):299–311, 2019.

[Yua12] X. Yuan. Alternating direction method for covariance selection models. Jour-nal of Scientific Computing, 51(2):261–273, 2012.

[YY13a] J. Yang and X. Yuan. Linearized augmented Lagrangian and alternatingdirection methods for nuclear norm minimization. Mathematics of Compu-tation, 82(281):301–329, 2013.

[YY13b] X. Yuan and J. Yang. Sparse and low-rank matrix decomposition via alter-nating direction methods. Pacific Journal on Optimization, 9(1):167–180,2013.

[YY16] M. Yan and W. Yin. Self equivalence of the alternating direction methodof multipliers. In R. Glowinski, S. Osher, and W. Yin, editors, SplittingMethods in Communication, Imaging, Science and Engineering, pages 165–194. Springer, 2016.

[YZ11] J. Yang and Y. Zhang. Alternating direction algorithms for $\ell 1$-problemsin compressive sensing. SIAM Journal on Scientific Computing, 33(1):250–278, 2011.

[ZBO11] X. Zhang, M. Burger, and S. Osher. A unified primal-dual algorithm frame-work based on Bregman iteration. Journal of Scientific Computing, 46(1):20–46, 2011.

[ZK14] W. Zhong and J. Kwok. Fast stochastic alternating direction method ofmultipliers. In International Conference on Machine Learning, pages 46–54,2014.

[ZK16] S. Zheng and J. T. Kwok. Fast-and-light stochastic ADMM. In InternationalJoint Conference on Artificial Intelligence, pages 2407–2413, 2016.

lecture notes on admm-type methodsernestryu/papers/admm_early_release.pdf · 2019-12-31 · the \ad...

Documents