lecture notes on convex analysis and iterative algorithms · having a solid understanding of convex...

66
Lecture Notes on Convex Analysis and Iterative Algorithms ˙ Ilker Bayram [email protected]

Upload: others

Post on 13-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Lecture Notes on Convex Analysis and IterativeAlgorithms

Ilker [email protected]

Page 2: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

About These Notes

These are the lectures notes of a graduate course I offered in the Dept. of Elec-tronics and Telecommunications Engineering at Istanbul Technical University.My goal was to get students acquainted with methods of convex analysis, tomake them more comfortable in following arguments that appear in recentsignal processing literature, and understand/analyze the proximal point al-gorithm, along with its many variants. In the first half of the course, convexanalysis is introduced at a level suitable for graduate students in electrical engi-neering (i.e., some familiarity with the notion of a convex set, convex functionsfrom other courses). Then several other algorithms are derived based on theproximal point algorithm, such as the Douglas-Rachford algorithm, ADMM,and some applications to saddle point problems. There are no references inthis version. I hope to add some in the future.

Ilker BayramDecember, 2018

Page 3: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Contents

1 Convex Sets 2

1.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Operations Preserving Convexity of Sets . . . . . . . . . . . . . 3

1.3 Projections onto Closed Convex Sets . . . . . . . . . . . . . . . 7

1.4 Separation and Normal Cones . . . . . . . . . . . . . . . . . . . 10

1.5 Tangent and Normal Cones . . . . . . . . . . . . . . . . . . . . 12

2 Convex Functions 15

2.1 Operations That Preserve the Convexity of Functions . . . . . . 17

2.2 First Order Differentiation . . . . . . . . . . . . . . . . . . . . . 18

2.3 Second Order Differentiation . . . . . . . . . . . . . . . . . . . . 20

3 Conjugate Functions 22

4 Duality 27

4.1 A General Discussion of Duality . . . . . . . . . . . . . . . . . . 27

4.2 Lagrangian Duality . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Karush-Kuhn-Tucker (KKT) Conditions . . . . . . . . . . . . . 34

5 Subdifferentials 35

5.1 Motivation, Definition, Properties of Subdifferentials . . . . . . 35

5.2 Connection with the KKT Conditions . . . . . . . . . . . . . . . 40

5.3 Monotonicity of the Subdifferential . . . . . . . . . . . . . . . . 40

6 Applications to Algorithms 45

6.1 The Proximal Point Algorithm . . . . . . . . . . . . . . . . . . . 45

6.2 Firmly-Nonexpansive Operators . . . . . . . . . . . . . . . . . . 48

6.3 The Dual PPA and the Augmented Lagrangian . . . . . . . . . 52

6.4 The Douglas-Rachford Algorithm . . . . . . . . . . . . . . . . . 53

6.5 Alternating Direction Method of Multipliers . . . . . . . . . . . 57

6.6 A Generalized Proximal Point Algorithm . . . . . . . . . . . . . 60

1

Page 4: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

1 Convex Sets

This first chapter introduces convex sets and discusses some of their properties.Having a solid understanding of convex sets is very useful for convex analysisof functions because we can and will regard a convex function as a specialrepresentation of a convex set, namely its epigraph.

1.1 Basic Definitions

Definition 1. A set C ∈ Rn is said to be convex if x ∈ C, x′ ∈ C implies thatαx+ (1− α)x ∈ C for any α ∈ [0, 1]. �

Consider the sets below. Each pair of (x, x′) we select in the set on the leftdefines a line segment which lies inside the set. However, this is not the casefor the set on the right. Even though we are able to find line segments withendpoints inside the set (as in (x, x′)), this is not true in general, as exemplifiedby (y, y′).

x

x′

x

x′

y y′

For the examples below, decide if the set is convex or not and prove whateveryou think is true.

Example 1 (Hyperplane). For given s ∈ Rn, r ∈ R, consider the set Hs,r ={x ∈ Rn : 〈s, x〉 = r}. Notice that this is a subspace for r = 0.

Example 2 (Affine Subspace). This is a set V ∈ Rn such that if x ∈ V andx′ ∈ V , then αx+ (1− α)x′ ∈ V for all α ∈ R.

Example 3 (Half Space). For given s ∈ Rn, r ∈ R, consider the set H−s,r ={x ∈ Rn : 〈s, x〉 ≤ r}.

Example 4 (Cone). A cone K ∈ Rn is a set such that if x ∈ K, thenαx ∈ K for all α > 0. Note that a cone may be convex or non-convex.See below for an example of a convex (left) and a non-convex (right) cone.

2

Page 5: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Exercise 1. Let K be a cone. Show that K is convex if and only if x, y ∈ Kimplies x+ y ∈ K. �

1.2 Operations Preserving Convexity of Sets

Proposition 1 (Intersection of Convex Sets). Let C1, C2, . . .Ck be convexsets. Show that C = ∩iCi is convex.

Proof. Exercise!

Question 1. What happens if the intersection is empty? Is the empty setconvex? �

This simple result is useful for characterizing linear systems of equations orinequalities.

Example 5. For a matrix A, the solution set of Ax = b is an intersection ofhyperplanes. Therefore it is convex.

For the example above, we can in fact say more, thanks to the following vari-ation of Prop. 1.

Exercise 2. Show that the intersection of a finite collection of affine subspacesis an affine subspace. �

Let us continue with systems of linear inequalities.

Example 6. For a matrix A, the solution set of Ax ≤ b is an intersection ofhalf spaces. Therefore it is convex.

Proposition 2 (Cartesian Products of Convex Sets). Suppose C1,. . . , Ck areconvex sets in Rn. Then the Cartesian product C1 × · · · × Ck is a convex setin Rn×···×n.

Proof. Exercise!

Given an operator F and a set C, we can apply F to elements of C to obtainthe image of C under F . We will denote that set as F C. If F is linear thenit preserves convexity.

Proposition 3 (Linear Transformations of Convex Sets). Let M be a matrix.If C is convex, then M C is also convex.

Consider now an operator that just adds a vector dto its operand. This isa translation operator. Geometrically, it should be obvious that translationpreserves convexity. It is a good exercise to translate this mental picture to analgebraic expression and show the following.

3

Page 6: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proposition 4 (Translation). If C is convex, then the set C + d = {x : x =v + d for some v ∈ C}is also convex.

Given C1, C2, consider the set of points of the form v = v1 +v2, where vi ∈ Ci.This set is denoted by C1 + C2and is called the Minkowski sum of C1and C2.We have the following result concerning Minkowski sums.

Proposition 5 (Minkowski Sums of Convex Sets). If C1 and C2 are convexthen C1 + C2is also convex.

Proof. Observe that

C1 + C2 =[I I

](C1 × C2).

Thus it follows by Prop. 2 and Prop. 3 that C1 + C2 is convex.

Example 7. The Minkowski sum of a rectangle and a disk in R2 in shownbelow.

1

2

1+ =

2

3

Definition 2 (Convex Combination). Consider a finite collection of points x1,. . . , xk. x is said to be a convex combination of xi’s if x satisfies

x = α1 x1 + · · ·+ αk xk

for some αi such that

αi ≥ 0 for all i,

k∑i=1

αk = 1.

Definition 3 (Convex Hull). The set of all convex combinations of a set C iscalled the convex hull of C and is denoted as Co(C). �

Below are two examples, showing the convex hull of the sets C = {x1, x2}, andC ′ = {y1, y2, y3}.

4

Page 7: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

x1

x2

y1

y2

y3

Notice that in the definition of the convex hull, the set C does not have tobe convex (in fact, C is not convex in the examples above). Also, regardlessof the dimension of C, when constructing Co(C), we can consider convexcombinations of any number of finite elements chosen from C. In fact, if wedenote the set of all convex combinations of kelements from C as Ck, then wecan show that Ck ⊂ Ck+m for m ≥ 0. An interesting result, which we will notuse is in this course, is the following.

Exercise 3 (Caratheodory’s Theorem). Show that, if C ∈ Rn, then Co(C) =Cn+1. �

The following proposition justifies the term ‘convex’ in the definition of theconvex hull.

Proposition 6. The convex hull of a set C is convex.

Proof. Exercise!

The convex hull of C is the smallest convex set that contains C. More precisely,we have the following.

Proposition 7. If D = Co(C), and if E is a convex set with C ⊂ E, thenD ⊂ E.

Proof. The idea is to show that for any integer k, Econtains all convex com-binations involving k elements from C. For this, we will present an argumentbased on induction.

We start with k = 2. Suppose x1, x2 ∈ C. This implies x1, x2 ∈ E. Since Eis convex, we have αx1 + (1 − α)x2 ∈ E, for all α ∈ [0, 1]. Since x1, x2 werearbitrary elements of C, it follows that Econtains all convex combinations ofany two elements from C.

Suppose now that Econtains all convex combinations of any k − 1 elementsfrom C. That is, if x1, . . .xk−1 are in C, then for

∑k−1i=1 αi = 1, and αi ≥ 0, we

have∑k−1

i=1 αi xi ∈ E. Suppose we pick a kth element, say xk, from C. Also,let α1, . . .αkbe on the unit simplex, with αk 6= 0 (if αk = 0, we have nothing

5

Page 8: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

to prove). Observe that

k∑i=1

αi xi = αk xk +k−1∑i=1

αi xi

= αk xk + (1− αk)k−1∑i=1

αi1− αk

xi.

Notice that

k−1∑i=1

αi1− αk

= 1,

αi1− αk

≥ 0, for all i.

Therefore,

y =k−1∑i=1

αi1− αk

xi

is an element of E since it is a convex combination of k−1 elements of C. Butthen,

k∑i=1

αi xi = αk xk + (1− αk) y

is an element of E (why?). We are done.

Another operation of interest is the affine hull. For that, let us introduce affinecombinations.

Definition 4 (Affine Combination). Consider a finite collection of points x1,. . . , xk. x is said to be an affine combination of xi’s if x satisfies

x = α1 x1 + · · ·+ αk xk

for some αi such that

k∑i=1

αi = 1.

Definition 5 (Affine Hull). The set of all affine combinations of a set C iscalled the affine hull of C. �

6

Page 9: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

The convex hull of two points x1, x2 is a line segment passing through the twopoints.

x1

x2

. . .

. . .

Exercise 4. Consider a set C ⊂ R2, composed of two points C = {x1, x2}.What is the difference between the affine and convex hull of C? �

Let us end this discussion with some definitions.

Definition 6 (Interior). x is said to be in the interior of Cif there exists anopen set Bsuch that x ∈ B and B ⊂ E. The interior of C is denoted as int(C).�

Definition 7 (Boundary). Boundary of a set C is defined to be C∩(int(C)

)c.

1.3 Projections onto Closed Convex Sets

Definition 8 (Projection). Let C be a set. For a given point y (inside oroutside C), x ∈ C is said to be a projection of y onto C if it is a minimizer ofthe following problem

minz∈C‖y − z‖2.

In general, projections may not exist, or may not be uniquely defined.

Example 8. Suppose D denotes the open disk in R2 and ybe such that ‖y‖2 >1. Then, the projection of y onto D does not exist. Notice that D is convexbut not closed. �

Example 9. Let C be the unit circle. Can you find a point y /∈ C thathas infinitely many projections? Can you find a point which does not have aprojection onto C? Notice in this case that C is not convex, but closed. �

The two examples above imply that projections are not always guaranteed tobe well-defined. In fact, convexity alone is not sufficient. We will see later thatconvexity and closedness together ensure the existence of a unique minimizer.The following provides a useful characterization of the projection.

Proposition 8. Let C be a convex set. x is a projection of y onto x if andonly if

〈z − x, y − x〉 ≤ 0, for all z ∈ C.

7

Page 10: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

It is useful to understand what this proposition means geometrically. Considerthe figure below. If x is the projection of y onto the convex set, then Prop.8implies that the angle zxy is obtuse.

y

xz

Proof of Prop. 8. (⇒) Suppose x = PC(y) but there exists z such that

〈z − x, y − x〉 > 0.

The idea is as follows. Consider the figure below.

y

x

z

α

β

t

Pick t such that β > α. But then

‖y − t‖2 < ‖y − x‖2.

Thus, x cannot be the projection.

Let us now provide an algebraic proof. Consider

‖y −(αz + (1− α)x

)‖2

2 = ‖y − x+ α(x− z)‖22

= ‖y − x‖22 + α2 ‖x− z‖2

2︸ ︷︷ ︸d

+2α 〈x− z, y − x〉︸ ︷︷ ︸−c

Notice now that, since c > 0 by assumption, the polynomial α2d − 2αc isnegative in the interval (0, 2c/d). Thus we can find α ∈ (0, 1) such that‖y−

(αz + (1− α)x

)‖2

2 is strictly less than ‖y− x‖22. But this contradicts the

assumption that z = PC(y).

(⇐) Suppose x ∈ C and x 6= PC(y). Let z = PC(y). Also, suppose that

〈z − x, y − x〉 ≤ 0.

Consider

‖y − z‖22 = ‖y − x+ x− z‖2

2

= ‖y − x‖22 + ‖x− z‖2

2 + 2 〈x− z, y − x〉︸ ︷︷ ︸c

.

8

Page 11: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Now if c > 0, then ‖y − x‖2 < ‖y − z‖2. But this is a contradiction.

Corollary 1. If C is closed, convex then PC(y) is a unique point.

Proof. Suppose x1, x2 ∈ C and

‖x1 − y‖2 = ‖x2 − y‖2 ≤ ‖z − y‖2 for all z ∈ C.

Then, we have, by Prop. 8 that

〈y − x1, x2 − x1〉 ≤ 0,

〈y − x2, x1 − x2〉 ≤ 0.

Adding these inequalities, we obtain

〈x1 − x2, x1 − x2〉 ≤ 0,

which implies x1 = x2.

Projection operators enjoy useful properties. One of them is the following,which we will refer to as ‘firm nonexpansivity’ (to be properly defined later).

Proposition 9. ‖PC(x1)− PC(x2)‖22 ≤ 〈PC(x1)− PC(x2), x1 − x2〉.

Proof. Let pi = PC(xi). Then, we have

〈x1 − p1, p2 − p1〉 ≤ 0,

〈x2 − p2, p2 − p1〉 ≤ 0.

Summing these, we have

〈(p2 − p1) + (x1 − x2), p2 − p1〉 ≤ 0.

Rearranging, we obtain the desired inequality.

Consider the following figure. The proposition states that the inner productof the two vectors is greater than the length of the shorter one squared.

x1

x2

PC(x1)

PC(x2)

C

As a first corollary of this proposition, we have :

9

Page 12: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Corollary 2. For a closed, convex C, we have 〈PC(x1)−PC(x2), x1−x2〉 ≥ 0.

Applying the Cauchy-Schwarz inequality, we obtain from Prop. 9 that projec-tion operators are ‘non-expansive’ (also, to be discussed later).

Corollary 3. For a closed, convex C, we have ‖PC(x1)−PC(x2)‖2 ≤ ‖x1−x2‖2.

1.4 Separation and Normal Cones

Proposition 10. Let C be a closed convex set and x /∈ C. Then, there existss such that

〈s, x〉 > supy〈s, y〉.

Proof. To outline the idea of the proof, consider the left figure below.x

z

px

p

H

C

The hyperplane H, normal to x− p touching C at p should separate x and C.If this is not the case, the situation resembles the right figure above, and weshould have ‖x− z‖ < ‖x− p‖.Algebraically, 〈x − PC(x), z − PC(x)〉 ≤ 0, for all z ∈ C. Set s = x − PC(x).Then, we have

〈s, z − x+ s〉 ≤ 0, ∀z ∈ C⇐⇒〈s, x〉 ≥ 〈s, s〉+ 〈s, z〉, ∀z ∈ C.

As a generalization, we have the following result.

Proposition 11. Let C, D be disjoint closed convex sets. Suppose also thatC −D is closed. Then, there exists s such that 〈s, x〉 > 〈s, y〉, for all x ∈ C,y ∈ D.

Proof. The idea is to consider the segment between the closest points of C andD, and construct a separating hyperplane that is orthogonal to this segment,as visualized below.

10

Page 13: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

C

D

We want to find s such that

〈s, y − x〉 < 0,∀x ∈ C, y ∈ D.

Note that y−x ∈ D−C. Also, since C∩D = ∅, we have 0 /∈ C−D. Therefore,there exists s such that 〈s, 0〉 > 〈s, z〉, for all z ∈ D − C. This s satisdies

〈s, y − x〉 < 0, ∀ y ∈ D, x ∈ C.

Definition 9. An affine hyperplane Hs,r is said to support the set C if 〈s, x〉 ≤r for all x ∈ C. Notice that this is equivalent to C ⊂ H−s,r. �

For a given set C, let ΣC denote the set of (s, r) such that C ⊂ H−s,r.

Proposition 12. If C is closed and convex, then

C = ∩(s,r)∈ΣCH−s,r.

Proof. The proof follows by showing that the expressions on the rhs and lhsare subsets of each other.

Obviously,

C ⊂ ∪(s,r)∈ΣCH−s,r.

Let us now show the other inclusion. Take x /∈ C. Then, there exists p suchthat

〈x, p〉 < 〈z, p〉, ∀ z ∈ C.

Take q = PC(x). Then H−p,q ⊃ C, and x /∈ H−p,q. Since (p, q) ∈ ΣC , we findthat x /∈ ∩(s,r)∈ΣC

H−s,r. Thus,

C ⊃ ∪(s,r)∈ΣCH−s,r.

We also state, without proof, the following result, that will be useful in thediscussion of duality.

Proposition 13. Suppose C is a convex set. There exists a supporting hy-perplane for any x ∈ bd(C).

11

Page 14: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

1.5 Tangent and Normal Cones

Definition 10. Let C be a closed convex set. The direction s ∈ Rn is said tobe normal to C at x when

〈s, y − x〉 ≤ 0, ∀ y ∈ C.

According to the definition, for any y ∈ C, the angle between the two vectorsshown below is obtuse.

sx

y

Notice that if s is normal to C at x, then αs is also normal, for α ≥ 0. Theset of normal directions is therefore a cone.

Definition 11. The cone mentioned above is called the normal cone of C atx, and is denoted as NC(x). �

Proposition 14. NC(x) is a convex cone.

Proof. If s1, s2 are in NC(x), then

〈αs1 + (1− α)s2, y − x〉 = α〈s1, y − x〉+ (1− α)〈s2, y − x〉 ≤ 0,

implying that αs1 + (1− α)s2 ∈ NC(x).

Below are two examples.

x1 +NC(x1)

x2 +NC(x2)

C

x1

x2

zD

z +ND(z)

From the definition, we directly obtain the following results on normal cones.

Proposition 15. Let C be closed, convex. If s ∈ NC(x), then

PC(x+ αs) = x, ∀ α ≥ 0.

12

Page 15: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Normal cones will be of interest when we discuss constrained minimization,and subdifferentials of functions.

Let us now define a related cone through ‘polarity’.

Definition 12. Given a closed cone C, the polar cone of C is the set of p suchthat

〈p, s〉 ≤ 0, ∀ s ∈ C.

In the figure below, the dot indicates the origin, and D is the polar cone of C.Note also that C is the polar cone of D.

C D

Definition 13. The polar cone of NC(x) is called the tangent cone of C at x,and is denoted by TC(x). �

The figures below show the tangent cones of the sets at the origin (the originis indicated by a dot).

TC(0) NC(0)CNC(0)

TC(0)

Proposition 16. For a convex C, we have x+ TC(x) ⊃ C.

Proof. If p ∈ C and s ∈ NC(x), then

〈p− x, s〉 ≤ 0

⇒ p− x ∈ TC(x).

13

Page 16: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proposition 17. Suppose C is closed and convex. Then

∩x∈Cx+ TC(x) = C.

Proof. Let us denote the set on the lhs as D. Note that, by the previousproposition, D ⊃ C.

To see the converse inclusion, take z /∈ C. Let x = PC(z). Then z−x ∈ NC(x)and z − x /∈ TC(x). Thus z /∈ x+ TC(x), and thus z /∈ D.

Proposition 18. TC(x) is the closure of the cone generated by C − x.

Exercise 5. Show that C is convex if and only if1

2(x+y) ∈ C, for all x, y ∈ C.

Proposition 19. Let x ∈ bd(C), where C is convex. There exists s such that

〈s, x〉 ≤ 〈s, y〉, ∀y ∈ C.

Proof. Consider a sequence {xk}k with xk /∈ C and limk xk = x. Also, lety ∈ C. We can find a sequence {sk}k with ‖sk‖2 = 1 such that 〈sk, xk〉 ≤ 〈sk, y〉for all k. Now, extract a convergent subsequence ski with limit s (such asubsequence exists by the Bolzano-Weierstrass theorem, since sk are bounded,and are in Rn.). Then, we must have

〈ski , xki〉 ≤ 〈ski , y〉, ∀i.

Taking limits, we obtain 〈s, x〉 ≤ 〈s, y〉.

Alternative Proof (Sketch). Assume NC(x) 6= ∅. Take z ∈ NC(x). Then,〈z, y − x〉 ≤ 0. This is equivalent to 〈−z, x〉 ≤ 〈−z, y〉.

14

Page 17: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

2 Convex Functions

The standard definition is as follows.

Definition 14. f : Rn → R is said to be convex if for all x, y ∈ Rn, andα ∈ [0, 1], we have

f(αx+ (1− α)y

)≤ αf(x) + (1− α) f(y).

If the above inequality can be made strict, the function is said to be strictlyconvex.

The inequality is demonstrated in the figure below.

x y

f(·)

In order to link convex functions and sets, let us also introduce the following.

Definition 15. Given f : Rn → R, the epigraph of f is the subset of Rn+1

defined as

epi(f) ={

(x, r) ∈ Rn × R : r ≥ f(x)}.

epi(f)

Proposition 20. f is convex if and only if its epigraph is a convex set in Rn+1.

Proof. (⇒) Suppose f is convex. Pick (x1, r1), (x2, r2) from epi(f). Then,

r1 ≥ f(x1)

r2 ≥ f(x2).

Using the convexity of f , we have,

f(1

2x1 +

1

2x2

)≤ 1

2

(f(x1) + f(x2)

)≤ 1

2(r1 + r2).

Therefore

1

2

((x1, r1) + (x2, r2)

)∈ epi(f),

15

Page 18: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

and so epi(f) is convex.

(⇐) Suppose epi(f) is convex. Notice that since (x1, f(x1)), (x2, f(x2)) ∈epi(f), we have(

αx1 + (1− α)x2, αf(x1) + (1− α)f(x2))∈ epi(f).

But this means that

f(αx1 + (1− α)x2

)≤ αf(x1) + (1− α) f(x2).

Thus, f is convex.

Definition 16. The domain of f : Rn → R is the set

dom(f) = {x ∈ Rn : f(x) <∞}.

Example 10. Let C be a set in Rn. Consider the function

iC(x) =

{0, if x ∈ C,∞, if x /∈ C.

iC(x) is called the indicator function of the set C. Its domain is the set C.

Exercise 6. Show that iC(x) is convex if and only if C is convex. �

Exercise 7. Consider the function

uC(x) =

{0, if x ∈ C,1, if x /∈ C.

Determine if uC(x) is convex. If so, under what conditions? �

Proposition 21 (Jensen’s inequality). Let f : Rn → R be convex andx1, . . . , xk, be points in its domain. Also, let α1, . . . , αk ∈ [0, 1] with

∑i αi = 1.

Then,

f(∑

i

αixi

)≤∑i

αif(xi). (2.1)

Proof. Notice that(xi, f(xi)

)∈ epi(f), for i = 1, . . . , k. Therefore,(∑

i

αixi,∑i

αif(xi)

)∈ epi(f).

But this is equivalent to (2.1).

16

Page 19: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Definition 17. f is said to be concave if −f is convex. �

Below are some examples of convex functions.

Example 11. Affine functions : f(x) = 〈s, x〉+b, for some s and b. In relationwith this, determine if f(x, y) = xy is convex.

Example 12 (Norms). Suppose ‖·‖ is a norm. Then by the triangle inequality,and the homogeneity of the norm, we have

‖αx+ (1− α)y‖ ≤ α‖x‖+ (1− α)‖y‖.

In particular, for 1 ≤ p ≤ ∞, the `p norm is defined as

‖x‖p =

(∑i

|xi|p)1/p

.

Show that ‖x‖p is actually a norm. What happens if p < 1?

Example 13 (Quadratic Forms). f(x) = xTQx is convex if Q+QT is positivesemidefinite. Show this!

Exercise 8. Show that f(x) above is not convex if Q + QT is not positivesemi-definite. �

2.1 Operations That Preserve the Convexity of Func-tions

1. Translation by an arbitrary amount, multiplication with a non-negativeconstant.

2. Dilation : If f(x) is convex, so is f(αx), for α ∈ R. (Follows by consid-ering the epigraph.)

3. Pre-Composition with a matrix : If f(x) is convex, so is f(Ax). Showthis!

4. Post-Composition with an increasing convex function : Suppose g : R →R is an increasing convex function and f : Rn → R is convex. Then,g(f(·)

)is convex.

Proof. Since g is increasing, and convex we have

g(f(αx1 + (1− α)x2

))≤ g(αf(x1) + (1− α)f(x2)

)≤ α g

(f(x1)

)+ (1− α) g

(f(x2)

).

17

Page 20: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

5. Pointwise supremum of convex functions : Suppose f1(·), . . . fk(·) areconvex functions, and define

g(x) = maxifi(x).

Then g is convex.

Proof. Notice that

epi(g) =⋂i

epi(fi).

Since intersections of convex sets are convex, epi(g) is convex, and so gis convex.

Below is a visual demonstration of the proof.

f2f1

epi(g)

2.2 First Order Differentiation

Proposition 22. Suppose f : Rn → R is a differentiable function. Then, f isconvex if and only if

f(y) ≥ f(x) + 〈∇f(x), y − x〉 ∀x, y ∈ Rn.

Proof. (⇒) Suppose f is convex. Then,

f(x+ α(y − x)

)≤ (1− α)f(x) + αf(y), for 0 ≤ α ≤ 1.

Rearranging, we obtain

f(x+ α(y − x)

)− f(x)

α≤ f(y)− f(x) for 0 ≤ α ≤ 1.

Letting α→ 0, the left hand side converges to 〈∇f(x), y − x〉.(⇒) Consider the function gy(x) = f(y) + 〈∇f(y), x− y〉. Then, gy(x) ≤ f(x)and gy(y) = f(y) (see below). Also, gy(x) is convex (in fact, affine).

f(·)gy(·)

y

18

Page 21: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Now set

h(x) = supygy(x).

But, by the two properties of gy(·) above, it follows that h(x) = f(x). Butsince h(x) is the supremum of a collection of convex functions, it is convex.Thus, it follows that f(x) is convex.

We remark that a similar construction as in the second part of the proof willlead to the conjugate of f(x), which will be discussed later.

Note that if f : R → R is differentiable and convex, then f ′(·) is a monotonouslyincreasing function. To see this, suppose that f ′(x0) = 0. Then, for y ≥ x0,we can show that f ′(y) ≥ f ′(x0) = 0. Indeed, f(y) ≥ f(x0) because of theproposition. If f ′(y) < 0, then

f(x0) ≥ f(y)︸︷︷︸≥f(x0)

+ 〈f ′(y), x− y〉︸ ︷︷ ︸>0

> f(x0),

which is a contradiction. Therefore, we must have f ′(y) ≥ 0.

To generalize this argument, apply it to

hx0(x) = f(x)−(f ′(x0)x

)Observe that h′(x0) = 0, and h′x0(x) = f ′(x)− f ′(x0).

Question 2. How does the foregoing discussion generalize to convex functionsdefined on Rn? �

Definition 18. An operator M : Rn → Rn is said to be monotone if

〈M(x)−M(y), x− y〉 ≥ 0, for all x, y ∈ Rn.

Below is an instance of this relation. The two vectors x− y and M(x)−M(y)approximately point in the same direction.

x

y

M(x)

M(y)

Proposition 23. Suppose f : Rn → R is differentiable. Then, f is convex ifand only if ∇f is monotone.

19

Page 22: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proof. (⇒) Suppose f is convex. This implies the following two inequalities.

f(y) ≥ f(x) + 〈∇f(x), y − x〉,f(x) ≥ f(y) + 〈∇f(y), x− y〉,

Rearranging these inequalities, we obtain

0 ≥ 〈∇f(x)−∇f(y), y − x〉.

(Leftarrow) Suppose ∇f is monotone. For y, x ∈ Rn, let z = y − x. Then

f(y)− f(x) =

∫ 1

0

〈∇f(x+ αz), z〉 dα.

Rearranging,

f(y)− f(x)−∫ 1

0

〈∇f(x), z〉dα =

∫ 1

0

〈∇f(x+ αz)−∇f(x), z〉 dα.

But the right hand side is non-negative by the monotonicity of ∇f . It followsthat

f(y) ≥ f(x) + 〈∇f(y), y − x〉.

It then follows by Prop. 22 that f is convex.

2.3 Second Order Differentiation

We have seen that f : R → R is convex if and only if its first derivativeif monotone increasing. But the latter property is equivalent to the secondderivative being non-negative. Therefore, we also have an additional equivalentcondition that involves the second derivative. This generalizes as follows.

Proposition 24. Let f : Rn → R be a twice-differentiable function. Then, fis convex if and only if ∇2f is positive semi-definite.

Proof. (⇒) If f is convex, then F = ∇f : Rn → Rn is a monotone mapping,by Prop. 23. Let d ∈ Rn. Then,

〈F (x+ αd)− F (x), αd〉 ≥ 0 for all α > 0.

This implies

⟨F (x+ αd)− F (x)

α, d⟩≥ 0 for all α > 0.

20

Page 23: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Taking limits (which exist because f is twice differentiable), we obtain 〈G(x) d, d〉 ≥0, where

G(x) = ∇2f =

∂21f ∂2∂1f . . . ∂n∂1f...

. . .

∂1∂nf ∂2∂nf . . . ∂2nf

(⇐) Conversely, assume that G(x) = ∇F is positive semi-definite. We willshow that F = ∇f is monotone. Notice that

F (x+ d)− F (x) =

∫ 1

0

G(x+ αd) d dα.

This implies

〈F (x+ d)− F (x), d〉 =

∫ 1

0

〈G(x+ αd) d, d〉︸ ︷︷ ︸≥0

≥ 0.

Therefore F is monotone.

21

Page 24: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

3 Conjugate Functions

This chapter introduces the notion of a conjugate function, along with somebasic properties.

Suppose epi(f) is a closed set. We will now consider a dual representation ofthis set.

Consider a point in epi(f) : (x0, f(xo)) ∈ Rn+1, where f : Rn → R.

x0

f(x0)

Notice that this point is on the boundary of epi(f). Therefore we can find apoint (z, c) such that

〈z, x0〉+ cf(x0) ≥ 〈z, y〉+ cr for all (y, r) ∈ epi(f). (3.2)

Notice that here c ≤ 0 since if y ∈ dom(f), r can be arbitrarily large.

Now, if c 6= 0, we can find s, and r such that fs,r(x) = 〈s, x〉 + r minorizesf(x) at x0. That is,

fs,r(x0) = f(x0)

fs,r(x) ≤ f(x)

To be concrete, we obtain s and r as follows.

〈z/c, x0〉+ f(x0) ≤ 〈z/c, x〉+ f(x)

⇐⇒ 〈−z/c︸ ︷︷ ︸s

, x〉+ 〈z/c, x0〉+ f(x0)︸ ︷︷ ︸r

≤ f(x)

The remaining question is : can we always find a (z, c) pair with c 6= 0 suchthat (3.2) holds?

Fortunately, the answer is yes, and it is easier to see if we assume dom(f) = Rn.Note that, in this case, if (z, c) 6= (0, 0) and (3.2) holds, then c = 0 impliesthat

〈z, x0〉 ≥ 〈z, y〉 for all y ∈ Rn.

But this inequality is not valid for y = 2z‖x0‖2/‖z‖2. Thus, we must havec 6= 0. The general case is considered below.

Lemma 1. Suppose f : Rn → R is closed, convex. Then, there exist (z, c)with z 6= 0, c 6= 0 such that (3.2) holds.

22

Page 25: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proof. To see that we can find a pair (z, c) with c 6= 0, let xk ∈ relint(dom(f))with xk → x. Then, we can find (sk, ck) such that

〈sk, xk〉+ ck f(xk) ≤ 〈sk, y〉+ ck f(y).

Here, if ck = 0, then

〈sk, xk − y〉 ≤ 0, ∀y ∈ dom(f).

But since xk ∈ relint(dom(f)), xk + α(xk − y) ∈ dom(f) for sufficiently smallα > 0. This implies 〈sk, y − xk〉 ≤ 0, which is a contradiction (here, weassume that sk is included in the subspace parallel to aff(dom(f))). Therefore,ck 6= 0.

The foregoing discussion implies that, for a closed convex f : Rn → R, givenany x0 ∈ Rn, we can find z ∈ Rn and r ∈ R such that the following twoconditions hold:

〈z, x0〉+ r = f(x0)

〈z, x〉+ r ≤ f(x).

This is depicted below.

x0

〈z, ·〉+ r f(·)

Notice that there is a maximum value of r, associated with a z, so that theabove two conditions are valid. How can we find this value?

Consider the following figure.

x0

〈z, ·〉f(·)

In order to find the maximum r, we can look for the minimum vertical distancebetween f(x) and 〈z, x〉. That is, we set

r = infxf(x)− 〈z, x〉

23

Page 26: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Observe that this definition implies the two conditions in (3.3).

In order to work with sup rather than inf, and to emphasize the dependenceon z, and we define

f ∗(z) = −r = supx〈z, x〉 − f(x).

The function f ∗(z) is called the Fenchel conjugate of f , and thanks to thesupremum operation, it is convex with respect to z – note that in the defini-tion of f ∗, x acts like an index. In addition to convexity, f ∗∗ is also closed,because its epigraph is the intersection of closed sets (half spaces).The follow-ing inequality follows immediately from the definition.

Proposition 25 (Fenchel’s inequality). For a closed convex f : Rn → R, wehave

f(x) + f ∗(z) ≤ 〈z, x〉, for all x, z.

Consider now the conjugate of the conjugate:

f ∗∗(x) = supz〈z, x〉 − f ∗(z).

Since the function g(z) = 〈z, x〉−f ∗(z) minorizes f(x), we have f ∗∗(x) ≤ f(x).However, by the previous discussion, we also know that for any x∗, there isa pair (z∗, r∗) such that 〈z∗, x〉 + r∗ = f(x∗). Therefore, we deduce thatf ∗∗(x) = f(x). Thus, we have shown the following.

Proposition 26. If f : Rn → R is a closed convex function, then

f(x) = supz〈z, x〉 − f ∗(z).

Example 14. Let f(x) = 12xT Qx, where Q is positive definite. Then,

f ∗(z) = supx〈z, x〉 − 1

2xT Qx.

Maximum is achieved at x∗ such that z = Qx∗. Plugging in x∗ = Q−1 z, weobtain

f ∗(z) = zT Q−1z − 1

2zTQ−T QQ−1z

=1

2zT Q−1 z.

�Example 15. Let C be a convex set. The indicator function is defined as

iC(x) =

{0, if x ∈ C,∞, if x /∈ C.

The conjugate

σC(z) = supx〈z, x〉 − iC(x) = sup

x∈C〈z, x〉.

24

Page 27: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Some Properties of the Conjugate Function

(i) If g(x) = f(x− s), then

g∗(z) = supx〈z, x〉 − f(x− s)

= supy〈z, y + s〉 − f(y)

= f ∗(z) + 〈z, s〉.

(ii) If g(x) = t f(x), with t > 0, then

g∗(z) = supx.〈z, x〉 − tf(x)

= t supx〈z/t, x〉 − f(x)

= tf ∗(zt

).

(iii) If g(x) = f(tx), then

g∗(z) = supy=tx〈z, y/t〉 − f(y)

= f ∗(zt

).

(iv) If A is invertible, and g(x) = f(Ax), then

g∗(z) = supx〈z, x〉 − f(Ax)

= supy〈A−T z, y〉 − f(y)

= f ∗(A−T z).

(v) If g(x) = f(x) + 〈s0, x〉, then

g∗(z) = supx〈z − s0, x〉 − f(x)

= f ∗(z − s0).

(vi) If g(x1, x2) = f1(x1) + f2(x2), then

g∗(z1, z2) = supx1,x2

〈z1, x1〉+ 〈z2, x2〉 − f(x)

= f ∗(z1) + f ∗(z2).

(vii) If g(x) = f1(x) + f2(x), then

g∗(z) = supx〈z, x〉 − f1(x)−

[supy〈y, x〉 − f ∗2 (y)

]= sup

xinfy〈z, x〉 − f1(x)− 〈y, x〉+ f ∗2 (y).

25

Page 28: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

We will later dicsuss that whenever there is a ‘sadlle point’, we can ex-change the order of inf and sup. Doing so, we obtain

g∗(z) = infy

supx〈z − y, x〉 − f1(x) + f ∗2 (y)

= infyf ∗1 (z − y) + f ∗2 (y).

The operation that appears in the last line is called infimal convolution.

(viii) More generally, if g(x) = f1(x) + f2(x) + . . .+ fn(x), then

g∗(z) = infz1,...,zn

s.t. z=z1+...+zn

f ∗1 (z1) + . . .+ f ∗n(zn).

Example 16. The last property will be useful when we consider multipleconstraints. In particular, let C = A ∩ B, where A, B are convex sets. Thenwe have that

σC(x) = supz∈C〈z, x〉 = inf

yσA(x− y) + σB(y).

To see this, note that σ∗(z) = iC(z). But iA∫B(z) = iA(z)+ iB(z). Computing

the conjugate, we obtain

i∗C(x) = σC(x) = infyσA(x− y) + σB(y).

26

Page 29: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

4 Duality

This chapter introduces the notion of duality. We start with a general discus-sion of duality. We then pass to Lagrangian duality, and finally consider theKarush-Kuhn-Tucker conditions of optimality.

4.1 A General Discussion of Duality

Consider a minimization problem like

minx∈C

f(x), where C is a closed convex set.

Suppose there exists a function K(x, λ), which is

(i) convex for fixed λ, as a function of x,

(ii) concave for fixed x, as a function of λ,

maxλ∈D

K(x, λ) =

{f(x), if x ∈ C,∞, if x /∈ C,

for some closed convex set D.

Example 17. Consider the problem

minx

1

2‖y − x‖2

2 + ‖x‖2, (4.4)

for x, y ∈ C = Rn. Here, by Cauchy-Schwarz inequality, we can write

‖x‖2 = maxλ∈B2

〈x, λ〉,

where B2 is the unit `2 ball of Rn. In this case, the problem (4.4) is equivalentto

minx

maxλ∈B2

1

2‖y − x‖2

2 + 〈y, x〉︸ ︷︷ ︸K(x,λ)

.

In short, “minx∈C f(x)” is equivalent to

minx∈C

maxλ∈D

K(x, λ).

We call (x∗, λ∗) a solution of this problem if

K(x∗, λ) ≤ K(x∗, λ∗) ≤ K(x, λ∗) for all x ∈ C, λ ∈ D.

27

Page 30: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

If such a (x∗, λ∗) exists, it is called a saddle point of (x, λ).

See below for a depiction.

x

λ

This approach is useful especiallly if for fixed λ, K(x, λ) is easy to minimizewith respect to x. In that case, if λ = λ∗, then minimizing K(x, λ∗) over x ∈ Cis equivalent to solving the problem.

Example 18. Suppose λ is fixed. Then to maximize

1

2‖y − x‖2

2 + 〈λ, x〉,

set the gradient to zero. This gives

x− y + λ = 0 ⇔ x = y − λ.

The question now is, how do we obtain λ∗? For that, define the function

g(λ) = minx∈C

K(x, λ).

Notice that

g(x, λ) ≤ K(x, λ) for all x ∈ C.

Consider now the problem

maxλ∈D

g(λ). (4.5)

Exercise 9. Show that g(λ) is concave for λ ∈ D. �

Now let λ = maxλ∈D

g(λ). Also, let x = arg minx∈C

K(x, λ). Then,

K(x, λ) ≤ K(x, λ) for x ∈ C,

28

Page 31: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

and

g(λ) = K(x, λ) ≥ g(λ) ≥ K(x, λ) for λ ∈ D.

Combining these, we obtain

K(x, λ) ≤ K(x, λ) ≤ K(x, λ) for x ∈ C, λ ∈ D.

Therefore, (x, λ) is a saddle point, and x solves the problem

minx∈C

f(x).

We have shown that if a saddle point exists, we can obtain it either by solvinga min−max or a max−min problem.

Here, (4.4) is called the primal problem and (4.5) is called the dual problem.Note that there might be different dual problems depending on how we chooseK(x, λ). Notice also that if a saddle point exists, we have

d∗ = maxλ∈D

g(λ) = minx∈C

f(x) = p∗.

Example 19. Consider the problem

minx

1

2‖y − x‖2

2 + ‖x‖2.

An equivalent problem is

minx

maxλ∈B2

1

2‖y − x‖2

2 + 〈x, λ〉,

where B2 is the unit ball of the `2 norm. Let us define

g(λ) = minx

1

2‖y − x‖2

2 + 〈x, λ〉.

Carrying out the minimization, we find

g(λ) =1

2‖λ‖2

2 + 〈y − λ, λ〉

= −1

2‖y − λ‖2

2 + c,

for some constant c. Therefore, the dual problem is

maxλ∈B2

−‖y − λ‖22.

Or, equivalently

minλ∈B2

‖y − λ‖22.

29

Page 32: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

The minimizer is the projection of y onto B2, and is given by

λ∗ =

{y, if ‖y‖2 ≤ 1,

y/‖y‖2, if 1 < ‖y‖2.

The solution of the primal problem is

x∗ = y − λ∗ = y − PB2(y) =

0, if ‖y‖2 ≤ 1,

y‖y‖2 − 1

‖y‖2

, if 1 < ‖y‖2.

Notice that this discussion depends heavily on the existence of a saddle point.However, even if such a point does not exist, we can define a dual problem,but in this case, the maximum of the dual d∗ and the minimum of the primalproblem p∗ are not necessarily equal. Instead, we will have d∗ ≤ p∗. To seethis, note that

g(λ) ≤ K(x, λ) ≤ f(x) for all x, λ.

Therefore,

d∗ = g(λ∗) ≤ K(x, λ∗) ≤ f(x∗) = p∗.

Therefore, p∗ − d∗ is always nonnegative. This difference is called the dualitygap.

The following proposition is a summary of the foregoing discussion. We providea proof for the sake of completeness.

Proposition 27. Let K(x, λ) be convexoconcave, and define

f(x) = supλK(x, λ)

g(λ) = infxK(x, λ).

Then,

K(x∗, λ) ≤ K(x∗, λ∗) ≤ K(x, λ∗) (4.6)

if and only if

x∗ ∈ arg minxf(x), (4.7)

λ∗ ∈ arg maxλ

g(λ),

infx

supλK(x, λ) = sup

λinfxK(x, λ).

30

Page 33: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proof. (⇒) Suppose (4.6) holds. Then, since K(x∗, λ∗) = f(x∗) = g(λ∗), andsince f(x) ≥ g(λ) for arbitrary x, λ, it follows that (4.7) holds too.

(⇐) Suppose (4.7) holds. Then, by the last equality in (4.7), we have f(x∗) =g(λ∗). Now by the definition of f(·), we have

f(x∗) ≥ K(x∗, λ), for any λ.

Similarly, by the definition of g(·), we obtain

g(λ∗) ≤ K(x, λ∗) for any x.

Using the definition of f , and g once again, we can write,

f(x∗) ≥ K(x∗, λ∗) ≥ g(λ∗).

Since f(x∗) = g(λ∗), we therefore obtain the desired inequality (4.6) by com-bining these inequalities.

4.2 Lagrangian Duality

We now discuss a specific dual, associated with a constrained minimizationproblem.

minx

f(x) subject to

g1(x) ≤ 0,

g2(x) ≤ 0,...

gm(x) ≤ 0,

(4.8)

where all of the functions are closed, convex, and defined on Rn.

In this setting, we define the Lagrangian function as

L(x, λ) =

{f(x) + λ1 g1(x) + λ2 g2(x) + . . . λm gm(x), if λi ≥ 0

−∞, if at least one λi ≤ 0.

Notice that

maxλ≥0

L(x, λ) =

{f(x), if gi(x) ≤ 0, for all i,

∞ otherwise.

Therefore, (4.8) is equivalent to

minx

maxλ≥0

L(x, λ).

31

Page 34: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Also, notice that for fixed x, L(x, λ) is concave (in fact affine), with respect toλ. It follows from the previous discussion that if (x∗, λ∗) is a saddle point ofL(x, λ) if x∗ solves (4.8), or

λ∗ = arg minx

L(x, λ∗)

= arg minx

f(x) + λ∗1 g1(x) + λ∗2 g2(x) + . . .+ λ∗m gm(x).

Notice that the problem is transformed into an unconstrained problem, withthe help of λ∗. To obtain λ∗, we consider the Lagrangian dual, defined as

g(λ) = minxL(x, λ).

The dual problem is,

maxλ≥0

g(λ).

If a saddle point exists, λ∗ is the solution of this dual problem. In that case,we obtain a minimizer (which need not be unique) as

x ∈ arg minxL(x, λ∗).

Example 20. Consider the problem

minx s.t. x2 + 2x ≤ 0.

The dual function is

g(λ) = minxx+ λ(x2 + 2x)︸ ︷︷ ︸L(x,λ) for λ≥0

.

At the minimum we must have

1 + λ (2x+ 2) = 0⇐⇒ − 1

2λ− 1 = x.

Plugging in, we obtain,

g(λ) = − 1

2λ− 1 + λ

(1

4λ2+ 1 +

1

λ− 1

λ− 2

)= −λ− 1

4λ− 1.

λ∗ satisfies

−1 +1

(2λ∗)2.

Solving for λ∗, and taking the positive root (since λ∗ ≥ 0), we find λ∗ =1

2.

Therefore,

x∗ = arg minx

x+1

2(x2 + 2x),

which can be easily solved, to give x∗ = −2. Note that we can see this easilyif we sketch the constraint function. �

32

Page 35: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

So far, the discussion relied on the assumption that the Lagrangian has a saddlepoint, so that the duality gap is zero. A natural question is to ask when thiscan be guaranteed. The conditions that ensure the existence of a saddle pointare called constraint qualification. A simple one to state is Slater’s condition.

Proposition 28 (Slater’s condition). Suppose f(·) and gi(·) for i = 1, 2, . . . , nare convex functions. Consider the problem

minx

f(x), s.t. gi(x) ≤ 0, ∀i. (4.9)

Suppose there exists x such that gi(x) ≤ 0 for all i, and gj(x) < 0 for some j.Then, x∗ solves (4.9) if and only if there exists λ∗ ≥ 0 such that (x∗, λ∗) is asaddle point of the function

L(x, λ) =

f(x) +n∑i=1

λi gi(x), if λi ≥ 0 for all i,

−∞, otherwise.

Proof. For simplicity, we assume that n = 1, so there is only one constraintfunction g(·).Assume that x∗ solves (4.9). Consider the sets

A =

{(z1

z2

):

(z1

z2

)≥(f(x)g(x)

)for some x

},

B =

{(z1

z2

):

(z1

z2

)<

(f(x∗)

0

)}.

g

f(x∗)

A

B

f

It can be shown that both A and B are convex (show this!). Further, we haveA ∩B = ∅ (show this!). Therefore, there exists µ1, µ2 such that

µ1 z1 + µ2 z2 ≤ µ1 t1 + µ2 t2 for all z ∈ B, t ∈ A.

Note here that µ2 ≥ 0 since z2 can be taken as small as desired. Similarly,µ1 ≥ 0, since z1 can be taken as small as desired. Also notice that µ1 6= 0,since otherwise we would have 0 ≤ g(x) for all x, but we already know thatg(x) < 0. Therefore, for λ∗ = µ2/µ1, we can write

z1 + λ∗z2 ≤ 0 · t1 + λ∗ t2, for all z ∈ B, t ∈ A.

33

Page 36: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Consequently,

f(x∗) ≤ f(x) + λ∗g(x) for all x and λ∗ ≥ 0.

In particular, we have that f(x∗) ≤ f(x∗) + λ∗g(x∗). Since g(x∗) ≤ 0, andλ∗ ≥ 0, it follows that λ∗g(x∗) = 0. So, we have

f(x) +λ∗g(x) ≥ f(x∗) = f(x∗) +λ∗ g(x∗) ≥ f(x∗) +λ g(x∗) for all λ ≥ 0.

Thus, (x∗, λ∗) is a saddle point.

4.3 Karush-Kuhn-Tucker (KKT) Conditions

Suppose now that Slater’s conditions are satisfied so that x∗ solves the problemso that (x∗, λ∗) is a saddle point of the Lagrangian L(x, λ). Notice that in thiscase, x∗ is a minimizer for the problem,

minxf(x) + λ∗1 g1(x) + λ∗2 g2(x) + . . .+ λ∗m gm(x)

If all of the functions above are differentiable, we can write

∇f(x∗) + λ∗1∇g1(x∗) + λ∗2∇g2(x∗) + . . .+ λ∗m∇gm(x∗) = 0.

But since λ∗i ≥ 0, we have that if gi(x∗) < 0, then we must have λ∗i . = 0.

Therefore, λ∗i gi(x∗) = 0 for all i. Collected together, these conditions are

known as KKT conditions.

λ∗i ≥ 0,

gi(x∗) ≤ 0,

λ∗i gi(x∗) = 0, (‘Complementary slackness’)

∇f(x∗) +∑i

λ∗i∇gi(x∗) = 0. (4.10a)

By the above discussion these conditions are necessary for optimality. It turnsout that they are also sufficient conditions. To see this, first observe that(4.10a) implies

g(λ∗) = L(x∗, λ∗).

But since λ∗i gi(x∗) = 0, we have

L(x∗, λ∗) = f(x∗) +∑i

λ∗i gi(x∗) = f(x∗).

Therefore, g(λ∗) = L(x∗, λ∗) = f(x∗), i.e., (x∗, λ∗) is a saddle point of L(x, λ).Thus x∗ solves the primal problem.

34

Page 37: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

5 Subdifferentials

A convex function does not have to be differentiable. However, even when itis not differentiable, there is considerable structure in how it varies locally.Subdifferentials generalize derivatives (or gradients) and capture this structurefor convex functions. In addition, they enjoy a certain calculus, which provesvery useful in deriving minimization algorithms. We introduce, and discusssome basic properties of subdifferentials in this chapter. We start with somemotivations underlying definitions and basic properties. We then explore con-nections with KKT conditions. Finally, we dicsuss the notion of monotonicity,a fundamental property of subdifferentials.

5.1 Motivation, Definition, Properties of Subdifferen-tials

Recall that if f : Rn → R is differentiable and convex, then

f(x) ≥ f(y) + 〈x− y,∇f(y)〉

for all x, y. In fact, s = ∇f(y) is the unique vector that satisfies the inequalitybelow

f(x) ≥ f(y) + 〈x− y, s〉 for all x, y. (5.11)

f(·)f(y) + 〈x− y,∇f(y)〉

y

This useful observation has the shortcoming that it requires f to be differ-entiable. In general, a convex function may not be differentiable – considerfor instance f(x) = |x|. We define the subdifferential by making use of theinequality (5.11).

Definition 19. Let f : Rn → R be convex. The subdifferential of f at y isthe set of s that satisfy

f(x) ≥ f(y) + 〈x− y, s〉 for all x.

This set is denoted by ∂f(y). �

Since for every y ∈ dom(f), we can find r, s, such that

(i) r + 〈s, y〉 = f(y),

35

Page 38: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

(ii) r + 〈s, y〉 ≤ f(x),

it follows that ∂f(y) is non-empty. ∂f(·) has other interesting properties aswell.

Proposition 29. For every y ∈ dom(f), ∂f(y) is a convex set.

Proof. Suppose (s1, s2) ∈ ∂f(y). Then, we have

f(y) + 〈x− y, s1〉 ≤ f(x)

f(y) + 〈x− y, s2〉 ≤ f(x).

Taking a convex combination of these inequalities, we find

f(y) + 〈x− y, αs1 + (1− α)s2〉 ≤ f(x).

Example 21. Let f(x) = |x|. Then

∂f(x) =

{−1}, if x < 0,

[−1, 1], if x = 0,

{1}, if x > 0.

Notice the example above, wherever the function is differentiable, the subdif-ferential coincides with the gradient. This holds in general.

Proposition 30. If f : Rn → R is differentiable at x, then ∂f(x) = {∇f(x)}.

Proof. Suppose s ∈ ∂f(x). In this case, for any d ∈ Rn, t ∈ R, we have

f(x+ t d) ≥ f(x) + 〈s, t d〉 ⇐⇒ 〈s, d〉 ≤ f(x+ td)− f(x)

tfor all t, d.

Now let t→ 0. The inequality above implies

〈s, d〉 ≤ 〈∇f(x), d〉, for all d. (5.12)

Now notice that

〈s,−d〉 ≤ f(x− td)− f(x)

tfor all t, d.

Therefore,

〈s,−d〉 ≤ 〈∇f(x),−d〉, for all d. (5.13)

Taken together, (5.12), (5.13) imply 〈s, d〉 = 〈∇f, d〉 for all d. Thus, s =∇f(x).

36

Page 39: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

The subdifferential can be used to characterize the minimizers of convex func-tions.

Proposition 31. Let f : Rn → R be convex. Then,

x∗ ∈ arg minxf(x)

if and only if

0 ∈ ∂f(x∗).

Proof. This follows from the equivalences

x∗ ∈ arg minxf(x)

⇐⇒ f(x∗) ≤ f(x) for all x,

⇐⇒ f(x) ≤ f(x∗) + 〈0, x− x∗〉 for all x,

⇐⇒ 0 ∈ ∂f(x∗).

Proposition 32. Let f , g : Rn → R be convex functions. Also let h = f + g.If x ∈ dom(f) ∩ dom(g), then

∂h(x) = ∂f(x) + ∂g(x).

Proof. Let s1 ∈ A, s2 ∈ B. Then,(f(x) + 〈y − x, s1〉

)+(g(x) + 〈y − x, s2〉

)≤ f(y) + g(y).

Therefore s1 + s2 ∈ C. Since s1 and s2 are arbitrary members of A, B, itfollows that A+B ⊂ C.

For the converse (i.e., C ⊂ A. + B), we need to show that any z ∈ ∂h(x) wecan find u ∈ ∂f(x) such that z − u ∈ ∂g(x). This is equivalent to saying that,we can find u ∈ ∂f(x), v ∈ ∂g(x) such that z = u + v. We take a detour toshow this result.

Let us now study the link between conjugate functions and subdifferentials.This will complete the remaining part of Prop. 5.1.

Proposition 33. Let f : Rn → R be a closed, convex function and f ∗ denoteits conjugate. Then s ∈ ∂f(x) if and only if

f(x) + f ∗(x) = 〈s, x〉. (5.14)

37

Page 40: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proof. Since f(x) = supz 〈z, x〉 − f ∗(z), we have

f(x) + f ∗(x) ≥ 〈z, x〉, for all z, x. (5.15)

Consider now the following chain of equivalences

s ∈ ∂f(x)

⇐⇒ f(x)− 〈s, x〉 ≤ f(y)− 〈s, y〉 for all y

⇐⇒ f(x)− 〈s, x〉 ≤ infyf(y)− 〈s, y〉 = − sup

y〈s, y〉 − f(y) = −f ∗(s).(5.16)

Combining the two inequalities (5.15), (5.16), we obtain (5.14).

Corollary 4. x ∈ ∂f ∗(s) if and only if f(x) + f ∗(s) = 〈x, s〉.

Corollary 5. 0 ∈ ∂f(x) if and only if x ∈ f ∗(0).

Proof. Consider the following chain of equivalences.

0 ∈ ∂f(x) ⇐⇒ f(x) + f ∗(x) = 〈0, x〉 ⇐⇒ x ∈ ∂f ∗(x).

Example 22. Recall the definition of a support function : for a closed convexset C, let σC(x) = supz∈C 〈z, x〉. Recall also that σ∗C(s) = iC(s). Therefore,

s ∈ ∂σC(x) ⇐⇒ σC(x) + iC(s) = 〈s, x〉⇐⇒ σC(x) = 〈s, x〉, ∀s ∈ C.

In words, ∂σC(x) is the set of s ∈ C for which σC(x) = 〈s, x〉.[Insert Fig. on p37] �

Example 23. Consider now a closed convex set C, and let us obtain a de-scription of the subdifferential of the characteristic function of C at x, i.e.,∂iC(x).

Observe that

s ∈ ∂iC(x) ⇐⇒ iC(y) ≥ iC(x) + 〈s, y − x〉, for all y.

If x /∈ C, the inequality is not satisfied for any s. In this case, ∂iC(x)0∅.If x ∈ C, then there are two cases to consider

(i) If y /∈ C, then iC(y) =∞, and the inequality is always satisfied.

(ii) If y ∈ C, then

0 ≥ 〈s, y − x〉, ∀y ∈ C ⇐⇒ s ∈ NC(x).

38

Page 41: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

In summary,

∂iC(x) =

{NC(x), if x ∈ C,∅, if x /∈ C.

See below for when C is a rectangular set.

C

x

x+ ∂iC(x)

We now go back to Prop. 5.1, and complete the proof. Recall that whatremains is to show that ∂h(x) ⊂ ∂f(x) + ∂g(x), where h(x) = f(x) + g(x) forconvex functions f(x), g(x).

Proof of Prop. cont’d: Let u ∈ ∂h(x). This implies,

h(x) + h∗(x) = 〈x, u〉.

Plugging in the definition of h, we can write

f(x) + g(x) + infzf ∗(z) + g∗(u− z)︸ ︷︷ ︸

m

= 〈x, u〉.

Suppose the infimum of m is achieved for when z = s.[f(x) + f ∗(x)− 〈x, s〉

]+[g(x) + g∗(u− s)− 〈u− s− x〉

]= 0

Both terms in square bracksts are non-negative by Fenchel’s inequality. There-fore, for equality to hold, we need both terms to be zero. Thus, we can write

f(x) + f ∗(s) = 〈x, s〉g(x) + g∗(u− s) = 〈u− s, x〉.

Thus,

u = s︸︷︷︸∈∂f(x)

+u− s︸ ︷︷ ︸∈∂g(x)

∈ ∂f(x) + ∂g(x).

Since u was arbitrary, this implies ∂h(x) ⊂ ∂f(x) + ∂g(x).

39

Page 42: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

From this dicsussion, we obtain the following corollary concerning constrainedminimization.

Corollary 6. Consider the problem

minx∈C

f(x),

where f is a convex function, C is a convex set. x∗ is a solution of this problemif and only if

0 ∈ ∂f(x∗) +NC(x∗).

5.2 Connection with the KKT Conditions

Let us now study what the condition stated in Corollary 6 means. Let usconsider the problem

minx

f(x) subject to

{g1(x) ≤ 0,

g2(x) ≤ 0,

where gi are convex and differentiable. Let Ci = {x : gi(x) ≤ 0}. Note thatboth Ci’s are convex sets. Suppose gi(x) < 0. Then NCi

(x) = {0}. However,if gi(x) = 0, then

NCi(x) = {s : 〈s, y − x〉 ≤ 0 = gi(x) for all y with gi(y) ≤ gi(x) = 0}.

That is, NCi(x) = α∇gi(x), where α ≥ 0. Also, if gi(x) > 0, then NCi

(x) = ∅.Therefore, the condition

0 ∈ ∇f(x) +NC1(x) +NC2(x)

is equivalent to

0 = ∇f(x) + α1∇g1(x) + α2g(x) where

αi ≥ 0,

αi gi(x) = 0,

gi(x) ≤ 0.

These are precisely the KKT conditions.

5.3 Monotonicity of the Subdifferential

Recall that for a convex differentiable f , we had

〈∇f(x)−∇f(y), x− y〉 ≥ 0, for all x, y.

A similar property holds for the subdifferential.

40

Page 43: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proposition 34. Suppose f : Rn → R is convex. If s ∈ ∂f(x) and z ∈ ∂f(y),then

〈s− z, x− y〉 ≥ 0. (5.17)

Proof. Observe that

s ∈ ∂f(x) =⇒ f(y) ≥ f(x) + 〈s, y − x〉z ∈ ∂f(y) =⇒ f(x) ≥ f(y) + 〈z, x− y〉.

Summing the inequalities, we obtain (5.17).

Definition 20. An operator T : Rn → 2Rnis said to be monotone if 〈s−z, x−

y〉 ≥ 0, for all x, y, and s ∈ T (x), z ∈ T (y). �

It is useful to think of set-valued operators in terms of their graphs.

Definition 21. The graph of T : Rn → 2Rnis the set of u, v such that v ∈ T (u).

Notice that for a convex function, the graph of ∂f is the set of (x, u) such thatx ∈ dom(f) and u ∈ ∂f(x).

A curious property satisfied by ∂f(x) is ‘maximal’ monotonicity.

Definition 22. T : Rn → 2Rnis said to be maximal monotone if there is no

monotone operator F , such that the graph of T is a strict subset of the graphof F . �

Example 24. For f(x) = |x|, the graph of ∂f(x) is shown below.

[Insert Fig on p.41]. �

We state the following fact without proof. See Rockafellar’s book for a proof.

Proposition 35. If T = ∂f for a closed convex function f , then T is maximalmonotone.

Let us now revisit a minimization problem like minx f(x), for a convex f .In terms of the subdifferential of f , this is equivalent to looking for x suchthat 0 ∈ T (x), where T = ∂f . An equivalent problem is to find x such thatx ∈ x+ λT (x), for λ > 0.

Definition 23. Given S : Rn → 2Rn, S−1 is the operator whose graph is the

set of (u, v) where u ∈ S(v). �

Note that, by the foregoing discussion, minx f(x) is equivalent to finding xsuch that x = (I + λT )−1 x. In the following, we study the properties ofJλT = (I + λT )−1.

Let us first write down our previous observation as a proposition.

41

Page 44: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proposition 36. If f(z) ≤ f(x) for all x, then JλT (z) = z.

Definition 24. An operator U : Rn → Rn is said to be firmly-nonexpansive if

‖U(x)− U(y)‖22 + ‖(I − U)(x)− (I − U)(y)‖2

2 ≤ ‖x− y‖22.

Proposition 37. If T is monotone, then (I + T )−1 is firmly non-expansive.

Proof. Note that,

‖(I − J)x− (I − J)y‖22 = ‖x− y‖2

2 + ‖Jx− Jy‖22 − 2〈Jx− Jy, x− y〉.

Therefore, if we can show that

〈Jx− Jy, x− y〉 ≥ ‖Jx− Jy‖22, (5.18)

we are done.

Now suppose

x = u+ v, with v ∈ ∂T (u),

y = z + t, with t ∈ ∂T (z).

Then, J(x) = v, and J(y) = t. This implies

〈v − t, u+ v − z − t〉 = 〈Jx− Jy, x− y〉 = ‖v − t‖22 + 〈v − t, u− z〉.

The last term is non-negative by the monotonicity of T . Also, since ‖Jx −Jy‖2

2 = ‖v − t‖22, the (5.18) follows.

Alternative Proof :

T is monotone

⇐⇒ 〈x′ − x, y′ − y〉 ≥ 0 ∀(x, y), (x′, y′) ∈ T,⇐⇒ 〈x′ − x+ y′ − y, x′ − x〉 ≥ ‖x′ − x‖2

2 ∀(x, y), (x′, y′) ∈ T.

Proposition 38. Suppose f : Rn → R is differentiable, convex and

‖∇f(x)−∇f(y)‖2 ≤ L‖x− y‖2, (5.19)

for any x, y pair. Then,

f(x) ≤ f(y) + 〈∇f(y), x− y〉+L

2‖x− y‖2

2.

42

Page 45: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proof. Consider the function g(s) = f(y+s (x−y)

). Observe that g(0) = f(y),

g(1) = f(x), and

g(1)− g(0) =

∫ 1

0

g′(s) ds

=

∫ 1

0

〈∇f(y + s (x− y), x− y

)〉 ds

≤∫ 1

0

∣∣〈∇f(y + s (x− y)−∇f(y), x− y)〉∣∣+ 〈∇f(y), x− y〉ds

≤∫ 1

0

‖∇f(y + s (x− y)−∇f(y)‖2 ‖x− y‖2 + 〈∇f(y), x− y〉ds

≤∫ 1

0

Ls ‖x− y‖22 + 〈∇f(y), x− y〉 ds

=L

2‖x− y‖2

2 + 〈∇f(y), x− y〉.

where we applied the Cauchy-Schwarz inequality and (5.19) to obtain the lasttwo inequalities.

Now suppose we want to minimize f and it satisfies the hypothesis of Prop. 38,namely (5.19). We will derive an algorithm which will start from some initialpoint x0 and produce a sequence that converges to the minimizer.

Suppose that at the kth step, we have xk, and we set xk+1 as

xk+1 = arg minx

{Fk(x) = f(xk) + 〈∇f(xk), x− xk〉+

α

2‖x− xk‖2

2

}.

Observe that

(i) Fk(x) = f(xk),

(ii) Fk(x) ≥ f(x) for all x.

xk

f(·)Fk(·)

Therefore,

f(xk+1) ≤ F (xk+1) ≤ Fk(xk) ≤ f(xk).

In words, at each iteration, we reduce the value of f . Let us derive what xk+1

is. Observe that

0 = ∇f(xk) + α(xk+1 − xk).

43

Page 46: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

This is equivalent to

xk+1 = xk −1

α∇f(xk).

This is an instance of the steepest-descent algorithm with a fixed step-size.

44

Page 47: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

6 Applications to Algorithms

We now consider the application of the foregoing discussion convex analysisto algorithms. We start with the proximal point algorithm. Although the prox-imal point algorithm in its first form is almost never used in minimizationalgorithms, its convergence proof contains key ideas that are relevant for morecomplicated algorithms. In fact, the algorithms discussed in the sequel can bewritten as proximal point algorithms. We then discuss firmly non-expansiveoperators, which may be regarded as building blocks for developing convergentalgorithms. Following this, we discuss the augmented Lagrangian, the Douglas-Rachford algorithm, and the alternating direction method of multipliers algo-rithm (ADMM). The last section considers a generalization of the proximalpoint algorithm, along with an application to a saddle point problem. I hope toadd more algorithms to this collection...

6.1 The Proximal Point Algorithm

Consider the minimization of a convex function f(x). The proximal point al-gorithm constructs a sequence xk that converges to a minimizer. The sequenceis defined as

xk+1 = arg minx

1

2α‖x− xk‖2

2 + f(x)︸ ︷︷ ︸Fk(x)

(6.20)

The function Fk(x) is a perturbation of f(x) around xk. The quadratic termensures that xk+1 is close to xk.

In fact, this algorithm may actually be regarded as an MM algorithm since

• Fk(xk) = f(xk),

• Fk(x) ≥ f(x) for all x.

It follows immediately that

f(xk+1) ≤ Fk(xk+1) ≤ Fk(x

k) = f(xk)

In terms of subdifferentials, we have

0 ∈ (xk+1 − xk) + α ∂f(xk+1).

Equivalently, we can write

xk ∈ (I + α∂f)xk+1.

Thus, at the kth step of PPA, we are essentially computing the inverse of(I + α ∂f) at xk. Notice that, since Fk(x) in (6.20) is strictly convex, xk+1 isuniquely defined. Therefore, (I + α ∂f)−1 is a single-valued operator.

45

Page 48: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Definition 25. For a convex f , the operator Jf = (I + ∂f)−1 is called theproximity operator of f . �

The proximity operator of a convex function acts like a projection operator inthe following sense.

Proposition 39. For a proximity operator Jαf , we have

〈Jαf (x)− Jαf (y), x− y〉 ≥ ‖Jαf (x)− Jαf (y)‖22. (6.21)

Proof. Let x′ = Jαf (x) and y′ = Jαf (y). Then, we have

x ∈ (I + α ∂f)x′ ⇐⇒ (x− x′) ∈ α∂f(x′),

y ∈ (I + α ∂f) y′ ⇐⇒ (y − y′) ∈ α∂f(y′).

Thanks to the monotonicity of α∂f , we can therefore write

〈x′ − y′, (x− x′)− (y − y′)〉 ≥ 0

⇐⇒ 〈x′ − y′, x− y〉 ≥ ‖x′ − y′‖22,

which is what we wanted to show.

The property in (6.21) is a key observation so we give it a name.

Definition 26. An operator F is said to be firmly-nonexpansive if

〈Fx− Fy, x− y〉 ≥ ‖Fx− Fy‖22.

Thus, proximity operators are firmly-nonexpansive.

Let us now make another observation regarding the proximity operator. Sup-pose x∗ minimizes f . For α > 0, this is equivalent to

0 ∈ ∂f(x∗)

⇐⇒ 0 ∈ α ∂f(x∗)

⇐⇒ x∗ ∈ x∗ + α ∂f(x∗)

⇐⇒ x∗ = Jαf (x∗).

The last equality is a very special one.

Definition 27. A point x is said to be a fixed point of an operator F ifx = F (x). �

We have thus shown the following.

46

Page 49: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proposition 40. A point x∗ minimizes the convex function f if and only ifx∗ = Jαf (x∗) for all α > 0.

The following theorem is known as the Krasnoselskii-Mann theorem.

Theorem 1 (Krasnoselskii-Mann). Suppose F is a firmly-nonexpansive map-ping on RN and its set of fixed points is non-empty. Let x0 ∈ RN . If thesequence xk is defined as xk+1 = F xk, then xk converges to a fixed point of F .

Before we prove the theorem, let us state an auxiliary result of interest, thatprovides an alternative definition of firm-nonexpansivity.

Lemma 2. For an operator F , the following conditions are equivalent.

• 〈Fx− Fy, x− y〉 ≥ ‖Fx− Fy‖22

• ‖Fx− Fy‖22 +

∥∥(I − F )x− (I − F )y‖22 ≤ ‖x− y‖2

2

Proof. Exercise!Hint : Expand the expression

‖Fx− Fy‖22 +

∥∥(x− y)− (Fx− Fy)‖22.

We are now ready for the proof of the Krasnoselskii-Mann Theorem.

Proof of the Krasnoselskii-Mann Theorem. Pick some z such that z = Fz. Wehave,

‖Fxk − z‖22 = ‖Fxk − Fz‖2

2

≤ ‖xk − z‖22 − ‖(I − F )xk − (I − F ) z︸ ︷︷ ︸

=0

‖22

Therefore,

‖xk+1 − z‖22 ≤ ‖xk − z‖2

2 − ‖xk − xk+1‖22. (6.22)

Summing this inequality from k = 0 to n, and rearranging, we obtain

‖xn+1 − z‖22 ≤ ‖x0 − z‖2

2 −n∑k=0

‖xk − xk+1‖22.

From this inequality, we obtain

n∑k=0

‖xk − xk+1‖22 ≤ ‖x0 − z‖2

2.

47

Page 50: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Therefore, ‖xk − xk+1‖2 → 0, which implies (I − F )xk → 0.

Also, since ‖xn − z‖2 ≤ ‖x0 − z‖2 for any n, the sequence xk is bounded.Therefore, we can pick a convergent subsequence {xkn}n with limit, say x∗.By the continuity of I − F , we have,

0 = limn→∞

(I − F )xkn = (I − F )x∗.

Therefore x∗ = Fx∗. If we now plug in z = x∗ in (6.22), we have

‖xkn+m − x∗‖2 ≤ ‖xkn − x∗‖2 for all m > 0.

Since the ‖xkn−x∗‖2 can be made arbitrarily small by choosing n large enough,it follows that xk → x∗.

An immediate corollary of this general result (which we will refer to later) isthe convergence of PPA.

Corollary 7. The sequence xk, constructed by the proximal point algorithmin (6.20) converges to a minimizer of f for any α > 0.

Thanks to the Krasnoselskii-Mann theorem, firmly-nonexpansive operatorsplay a central role in the convergence study of a number of algorithms. Wenow provide a brief study of these operators. We state the results in their mostgeneral form, for later reference.

6.2 Firmly-Nonexpansive Operators

We already saw that proximity operators are firmly-nonexpansive. However,not all firmly-nonexpansive operators are proximity operators. Firmly-nonexpansiveoperators can be generated from monotone, or maximal monotone operators.Specifically, suppose T is a monotone operator with domain RN and considerSαT = (I +αT )−1. Let us name this operator, for it has interesting propertiesthat will be useful later.

Definition 28. For a monotone operator T , the operator ST = (I + T )−1 iscalled the resolvent of T . �

We pose two questions regarding resolvent operators.

• Is SαT defined everywhere?

• Is SαT single-valued?

We noted earlier that for a convex function f , if T = ∂f , then the answer toboth questions is affirmative. However, for a general monotone operator, JαTmay not be defined everywhere. Let us consider an example to demonstratethis.

48

Page 51: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Example 25. Suppose we defined T : R→ R as the unit step function, i.e.,

T (x) =

{0, if x ≤ 0,

1, if x > 0.

Consider the operator I + T . Observe that

(I + T )(x) =

{x, if x ≤ 0,

x+ 1, if x > 0.

Notice that the interval (0, 1] is not included in the range of I + T . Therefore,ST = (I + T )−1 is not defined for any point in (0, 1]. �

Nevertheless, in the range of I + αT , the inverse (I + αT )−1 is single-valued.

Proposition 41. Suppose T is a monotone mapping and α > 0. Then, forany x in the range of the operator I + αT , we can find a unique z such that(I +αT ) z = x. That is, (I +αT )−1 is a single-valued operator on its domain.

Proof. First, observe that for α > 0, the operator T is monotone if and onlyif αT is monotone. Therefore, without loss of generality, we take α = 1.

We first show that for w ∈ T u and z ∈ T y, if u+ w = y + z, then u = y. Tosee this, observe that

u+ w − y − z = 0

implies

‖(u− y) + (w − z)‖22 = 0,

which is equivalent to

‖u− y‖22 + ‖w − z‖2

2 + 2〈u− y, w − z〉 = 0. (6.23)

But the inner product term in (6.23) is non-negative thanks to the monotonic-ity of T . Therefore, in order for equality to hold in (6.23), we must have u = yand w = z.

Now, if x is in the range of I + αT , then there is a point v and u ∈ αTv suchthat u+ v = x. But the observation above implies that if this is the case, thenu is unique and in fact, u = (I + αT )−1x.

We noted that for a monotone operator T , the range of I + αT may be astrict subset of Rn. This restricts the domain of SαT . For maximal monotoneoperators, the range of I +αT is in fact the whole RN . This non-trivial result(which we will not prove) is known as Minty’s theorem.

49

Page 52: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Theorem 2 (Minty’s Theorem). Suppose T is a maximal monotone operatordefined on RN and α > 0. Then, the range of I + αT is RN .

To demonstrate Minty’s theorem, let us consider the maximal monotone op-erator whose graph is a superset of the graph of the operator in Example 25.

Example 26. Suppose the set valued operator T is defined as the set valuedoperator

T x =

{0}, if x < 0,

[0, 1], if x = 0,

{1}, if x > 0.

Consider now the operator I + T . Observe that

(I + T )(x) =

{x}, if x < 0,

[0, 1], if x = 0,

{x+ 1}, if x > 0.

Notice that for any y ∈ Rn, we can find x such that y ∈ (I + T )x. Therefore,the range of I + T is the whole Rn.

Minty’s theorem ensures that the domain of SαT is the whole space if T ismaximal monotone. Therefore, we can state the following corollary.

Corollary 8. If T is a maximal monotone operator on RN , then SαT is singlevalued and defined for all x ∈ RN .

We have the following generalization of Prop. 39 in this case.

Proposition 42. Suppose T is maximal monotone and α > 0. Then, SαT isfirmly-nonexpansive.

Proof. Exercise!(Hint : Consider the argument used in the proof of Prop. 39.)

We remark that a subdifferential is maximal monotone. Therefore, a proximityoperator is in fact the resolvent of a maximal monotone operator and Prop. 39is a special case of Prop. 42.

We introduce a final object called the ‘reflected resolvent’ that will be of in-terest in the convergence analysis of algorithms that will follow. Let us firststate a result to motivate the definition.

Proposition 43. An operator S is firmly non-expansive if and only if 2S − Iis non-expansive.

50

Page 53: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proof. Suppose S is firmly nonexpansive and let N = 2S − I. Observe that

‖Nx−Ny‖22 = ‖x− y‖2

2 +{

4‖Sx− Sy‖22 − 4〈Sx− Sy, x− y〉

}.

But the term inside the curly brackets above is non-negative thanks to thefirm-nonexpansivity of S. Thus we obtain

‖Nx−Ny‖2 ≤ ‖x− y‖2.

Conversely, suppose N = 2S − I is non-expansive. Then, S = 12I + 1

2N .

Observe also that I − S = 12I − 1

2N . We compute

‖Sx− Sy‖22 + ‖(I − S)x− (I − S)y‖2

2 =1

2‖x− y‖2

2 +1

2‖Nx−Ny‖2

2

≤ ‖x− y‖22.

Thus, S is firmly nonexpansive by Lemma 2.

Definition 29. Suppose T is maximal monotone and α > 0. The operatorNT = 2ST − I is called the reflected resolvent of T . �

Thus, the reflected resolvent of a maximal monotone operator is non-expansive.

Let us state Prop. 43 from another viewpoint for later reference.

Corollary 9. N is nonexpansive if and only if

(1

2I +

1

2N

)is firmly-nonexpansive.

Another useful observation is the following.

Corollary 10. If f is a convex function and α > 0, then (2Jαf − I) is non-expansive.

The following is a useful property to know concerning reflected resolvents(which, in fact, justifies the term ‘reflected’).

Proposition 44. Suppose T is maximal monotone. Then, NT (x+ y) = x− yif and only if y ∈ T x.

Proof. The claim follows from the following chain of equivalences.

NT (x+ y) = x− y⇐⇒ 2ST (x+ y)− (x+ y) = x− y⇐⇒ ST (x+ y) = x

⇐⇒ x+ y ∈ (I + T )x

⇐⇒ y ∈ T x.

51

Page 54: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

6.3 The Dual PPA and the Augmented Lagrangian

Consider now a problem of the form

minx

f(x) subject to Ex = d, (6.24)

where E is a matrix.

In order to solve this problem, we will apply PPA on the dual problem. Forthis let us derive a dual problem through the use of Lagrangians. Let

L(x, λ) = f(x) + 〈λ,Ex− d〉.

Then, (6.24) can be expressed as,

minx

maxλ

f(x) + 〈λ,Ex− d〉.

In order to obtain the dual problem, we change the order of min and max.This gives us the dual problem,

maxλ

g(λ),

where

g(λ) = minx

f(x) + 〈λ,Ex− d〉.

Recall that if

λ∗ ∈ arg maxλ

g(λ),

then

x∗ ∈ arg minxL(x, λ∗)

is a solution of the original problem (6.24). To find λ∗, we apply PPA on thedual problem and define a sequence as

λk+1 = arg maxλ

g(λ)− α

2‖λ− λk‖2

2.

To find λk+1, we need to solve

maxλ

minx

f(x) + 〈λ,Ex− d〉 − α

2‖λ− λk‖2

2. (6.25)

Let (xk+1, λk+1) denote the solution of this saddle point problem. To find thispoint, suppose we first tackle the maximization. Observe that, for fixed x, theoptimality condition for the maximization part implies

0 = (Ex− d)− α(λ∗ − λk)

⇐⇒ λ∗ = λk +1

α(Ex− d)

52

Page 55: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

To find xk+1, plug this expression into the saddle point problem (6.25).

xk+1 = arg minxf(x) + 〈λk +

1

α(Ex− d), Ex− d〉 − α

2‖λk +

1

α(Ex− d)− λk‖2

2

= arg minxf(x) + 〈λk, Ex− d〉+

α

2‖Ex− d‖2

2.

To summarize, the dual PPA algorithm is

xk+1 = arg minx

LA(x, λk)

λk+1 = λk +1

α(Exk − d),

where

LA(x, λ) = f(x) + 〈λ,Ex− d〉+α

2‖Ex− d‖2

2.

The function LA is called the augmented Lagrangian. Notice that LA is similarto the Lagrangian L but contains the additional term 〈λ,Ex − d〉, justifyingthe expression ‘augmented’.

Let us now consider the convergence of this algorithm. Observe that since

0 ∈ E xk+1 − d+ α(λk+1 − λk),

and since limk λk+1 − λk → 0, we have that

limkE xk − d→ 0.

Therefore, if x∗ is a limit point of xk, and x is an arbitrary vector such thatEx = d, we have

f(x∗) = LA(x∗, λ∗) ≤ f(x).

Therefore x∗ is a solution.

Note that this argument is not necessarily a convergence proof, because itdepends on the assumption that x∗ is a limit point of the sequence of xk’s.

6.4 The Douglas-Rachford Algorithm

Consider now a problem of the form

minx

f(x) + g(x), (6.26)

where both f and g are convex. Let F = ∂f , and G = ∂G. If x is a solutionof this problem, we should have

0 ∈ Fx+Gx. (6.27)

53

Page 56: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Let us now derive equivalent expressions of this inclusion, to obtain a fixedpoint iteration. Suppose we fix α > 0. Then, (6.27) and the following state-ments are equivalent.

There exist u ∈ Fx, z ∈ Gx such that 0 = u+ z

⇐⇒ There exist u ∈ Fx, z ∈ Gx such that x+ αz = x− αu⇐⇒ There exist u ∈ Fx such that x = (I + αG)−1 (x− αu) (6.28)

⇐⇒ x ∈ (I + αG)−1 (I − αF )x (6.29)

At this point, observe that if f is differentiable then F is single-valued. In thatcase, the inclusion in (6.29) is actually an equality (may I confuse matters :this statement is correct if we consider the range of F as Rn and not 2Rn

). Thissuggests the following fixed point iterations, which is known as the forward-backward splitting algorithm.

xk+1 = (I + αG)−1 (I − αF )x

We will discuss this algorithm later.

We would like to derive an algorithm that employs Jαf = (I+αF )−1. For this,let us now backtrack to (6.28) and write down another equivalent statement.

⇐⇒ There exist u ∈ Fx such that x+αu = (I+αG)−1 (x−αu)+αu (6.30)

If we now define a new variable t = x + αu, we have the following usefulequalities :

x = (I + αF )−1 t = Jαf (t)αu = t− x = (I − Jαf ) t

Plugging these in (6.30), we obtain the following proposition.

Proposition 45. Suppose f and g are convex and the minimization problem(6.26) has at least one solution. A point x is a solution of (6.26) if and only if

x = Jαf (t), for some t that satisfies t =(Jαg(2Jαf − I

)+ (I −Jαf )

)(t).

(6.31)

Thus, if we can obtain t that satisfies the fixed point equation in (6.31), we canobtain the solution to our minimization problem as Jαf (t). A natural choiceis to consider the fixed point iterations

tk+1 =(Jαg(2Jαf − I

)+ (Jαf − I)

)(tk).

These constitute the Douglas-Rachford iterations. By a little algebra, we canput them in a form that is easier to interpret. For this observe that,

Jαg(2Jαf − I

)+ (I − Jαf ) = Jαg

(2Jαf − I

)− 1

2(2Jαf − I) +

1

2I

=1

2I +

1

2(2Jαg − I) (2Jαf − I). (6.32)

54

Page 57: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

The convergence of the algorithm is easier to see in this form. Recall that,Corollary 10 implies the non-expansivity of (2Jαf − I) and (2Jαg − I). Butcomposition of non-expansive operators is also non-expansive. Therefore, thecomposite operator (2Jαg − I) (2Jαf − I) is non-expansive. Finally, by Corol-lary 9, we can conclude that the operator in (6.32) is firmly-nonexpansive.Combining this observation with Prop. 47, we obtain the following convergenceresult as a consequence of the Krasnoselskii-Mann theorem (i.e., Thm. 1).

Proposition 46. Suppose f and g are convex and the minimization problem(6.26) has at least one solution. The sequence constructed as

tk+1 =

(1

2I +

1

2(2Jαg − I) (2Jαf − I)

)(tk), (6.33)

is convergent. If t∗ denotes the limit of this sequence, then x∗ = Jαf (t∗) is asolution of (6.26).

Generalized Douglas-Rachford Algorithm

In this section, we obtain a small variation on the Douglas-Rachford iterationsin (6.31). Let us start with the following general observation. Suppose T is asingle valued operator and z =

((1 − β∗)I + β∗T

)z for some β∗ 6= 0. Then,

actually z =((1 − β)I + βT

)z, for any β. Therefore, we have the following

generalization of Prop. 47.

Proposition 47. Suppose f and g are convex and the minimization problem(6.26) has at least one solution. A point x is a solution of (6.26) if and only ifx = Jαf (t) for some t that satisfies

t =(

(1− β)I + β(2Jαg − I) (2Jαf − I))t, for all β.

Although this proposition is true for any β ∈ R, we will specifically be inter-ested in β ∈ (0, 1), because that is the interval for which we can construct aconvergent sequence that can be used to obtain a minimizer of (6.26). Thegeneralized DR iterations are as follows.

Proposition 48. Suppose f and g are convex and the minimization problem(6.26) has at least one solution. Also, let β ∈ (0, 1). Then, the sequenceconstructed as

tk+1 =(

(1− β)I + β(2Jαg − I) (2Jαf − I))

(tk), (6.34)

is convergent. If t∗ denotes the limit of this sequence, then x∗ = Jαf (t∗) is asolution of (6.26).

In order to prove this result, we need a generalization of the Krasnoselskii-Mann theorem. Let us start with a definition.

55

Page 58: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Definition 30. An operator T is said to be β-averaged with β ∈ (0, 1) if Tcan be written as

T = (1− β)I + βN,

for a non-expansive operator N . �

Specifically, a firmly-nonexpansive operator is 12-averaged. Notice also that if

T is β-averaged and β < β′, then T is also β′-averaged.

Let us now consider how we can generalize the Krasnoselskii-Mann theorem.Recall that, in the proof of that theorem, a key inequality used in the beginningof the proof was of the form

‖Tx− Ty‖22 + ‖(I − T )x− (I − T )y‖2

2 ≤ ‖x− y‖22, (6.35)

where T is firmly nonexpansive. However, this depends heavily on the firmnon-expansivity of T . If T is only β averaged, with β > 1/2, then (6.35) doesnot hold any more. Let us now see if we can come up with an alternative. So,let T = (1− β)I + βN , for a nonexpansive N . Then, we have,

‖Tx− Ty‖22 = (1− β)2‖x− y‖2

2 + β2‖Nx−Ny‖22 + 2β(1− β)〈Nx−Ny, x− y〉

‖(I − T )x− (I − T )y‖22 = β2‖x− y‖2

2 + β2‖Nx−Ny‖22 − 2β2〈Nx−Ny, x− y〉.

In order to cancel the inner product terms, let us consider a weighted sum.

‖Tx− Ty‖22+

1− ββ‖(I − T )x− (I − T )y‖2

2

=[(1− β)2 + β(1− β)

]‖x− y‖2

2 +[β2 + β(1− β)

]‖Nx−Ny‖2

2

≤ ‖x− y‖22.

Thus, we have shown the following.

Proposition 49. Suppose T is β-averaged. Then,

‖Tx− Ty‖22 +

1− ββ‖(I − T )x− (I − T )y‖2

2 ≤ ‖x− y‖22. (6.36)

We can now state a generalization of Theorem 1, which we also refer to as theKrasnoselskii-Mann theorem.

Theorem 3 (Krasnoselskii-Mann (General Statement)). Suppose F is a β-averaged mapping on RN and its set of fixed points is non-empty. Given aninitial x0, suppose we define a sequence as xk+1 = F x0. Then, xk convergesto a fixed point of F .

56

Page 59: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Proof. Exactly the same arguments as in the proof of Thm 1 work, except thatwe now start with the inequality (6.36).

Proposition 48 now follows as a corollary of Theorem 3 because the operatorin (6.34) is β-averaged with β ∈ (0, 1).

6.5 Alternating Direction Method of Multipliers

Consider the problem

minxf(x) + g(M x). (6.37)

In order to solve this problem, we will apply the Douglas-Rachford algorithmon the dual problem. The resulting algorithm is known as the alternatingdirection method of multipliers algorithm (ADMM). We will assume that Mis full column rank.

We first split variables and write (6.37) as a saddle point problem

minx,z

maxλ

f(x) + g(z) + 〈λ,M x− z〉.

The dual problem is then

maxλ

[minx

f(x) + 〈λ,M x〉]

︸ ︷︷ ︸−d1(λ)

+[minz

g(z) + 〈λ, z〉]

︸ ︷︷ ︸−d2(λ)

Equivalently, the dual problem can be expressed as,

minλd1(λ) + d2(λ), (6.38)

for

d1(λ) = maxx〈−MTλ, x〉 − f(x) = f ∗(−MT λ) (6.39a)

d2(λ) = maxz〈λ, z〉 − g(z) = g∗(λ). (6.39b)

For this problem, the Douglas-Rachford algorithm, starting from some y0, canbe written as follows.

yk = Nαd2(yk),yk = Nαd1(yk)

yk+1 =1

2yk +

1

2yk

λk+1 = Jαd2 (yk+1)

57

Page 60: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Recall that the sequence of λk’s converge to a solution of (6.38). Let us nowfind expressions for the terms above in terms of f and g.

Suppose yk = λk +α zk, with zk ∈ ∂d2(λk). This implies (also recall Prop. 44)that λk = Jαd2(yk), so that

yk = 2Jαd2(yk)− yk

= 2λ − (λk + α zk)

= λk − αzk.

Now observe that

Jαd1(yk) = arg miny

1

2α‖y − yk‖2

2 + d1(y)

= arg miny

[maxx

1

2α‖y − yk‖2

2 − f(x)− 〈MT y, x〉]

Changing the order of min and max, we find that Jαd1(yk) must satisfy

Jαd1(yk) = yk + αM xk+1,

where

xk+1 := arg maxx

1

2α‖αMx‖2

2 − f(x)− 〈yk + αMx,Mx〉

= arg maxx

1

2α‖αMx‖2

2 − f(x)− 〈λk − αzk + αMx,Mx〉

= arg minxf(x) + 〈λk,Mx〉+

α

2‖Mx− zk‖2

2.

Therefore,

yk = 2Jαd1(yk)− yk

= yk + 2αMxk

= λk − αzk + 2αMxk+1.

We also have

yk+1 =1

2yk +

1

2yk

= λk + αM xk+1

Let us finally show that yk+1 can be expressed as yk+1 = λk+1 +α zk+1 for somezk+1 ∈ ∂d2(λk+1), so that the assumption stated in the beginning of the kth

iteration is also valid at the (k + 1)st iteration, when we define zk+1 properly.We have,

λk+1 = arg miny

1

2α‖y − yk+1‖2

2 + d2(y)

= arg miny

[maxz

1

2α‖y − yk+1‖2

2 + 〈y, z〉 − g(z)

]

58

Page 61: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Changing the order of min-max, we find

λk+1 = yk+1 − αzk+1

= λk + α(Mxk+1 − zk+1),

where

zk+1 = arg maxz

1

2α‖αz‖2

2 + 〈yk+1 − αz, z〉 − g(z)

= arg maxz

1

2α‖αz‖2

2 + 〈λk + αMxk+1 − αz, z〉 − g(z)

= arg minz

g(z)− 〈λk, z〉+α

2‖M xk+1 − z‖2

2.

The optimality conditions for the last equation are

0 ∈ ∂g(zk+1)− λk + α(zk+1 −Mxk+1)

⇐⇒ λk − α(zk+1 −Mxk+1) ∈ ∂g(zk+1)

⇐⇒ λk+1 ∈ ∂g(zk+1)

⇐⇒ zk+1 ∈ ∂g∗(λk+1) = ∂d2(λk+1).

ADMM in Terms of the Primal Variables

We can rewrite the iterations solely in terms of xk, zk, λk. This produces thefollowing algorithm, referred to as the ADMM.

xk+1 = arg minx

f(x) + 〈λk,Mx〉+α

2‖Mx− zk‖2

2

zk+1 = arg minz

g(z)− 〈λk, z〉+α

2‖Mxk+1 − z‖2

2

λk+1 = λk + α(Mxk+1 − zk+1).

Convergence of ADMM

Recall that we derived ADMM as an instance of Douglas-Rachford algorithmapplied on the dual problem in (6.38). The iterations we started with, andtheir relation to xk, zk, are given below.

yk = Nαd2(yk)[yk = λk − αzk

](6.42a)

yk = Nαd1(yk)[yk = λk − αzk + 2αMxk+1

](6.42b)

yk+1 =1

2yk +

1

2yk

[yk+1 = λk + αMxk+1

](6.42c)

λk+1 = Jαd2 (yk+1)[λk+1 = λk + α(Mxk+1 − zk+1)

](6.42d)

59

Page 62: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

The convergence results on the Douglas-Rachford iterations therefore ensurethat in (6.40), yk is convergent. Since the operators Nαd2 , Nαd1 , Jαd2 are con-tinuous, this in turn implies that yk, yk and λk are also convergent sequences.Thanks to the relations in (6.42), and the full column rank property of Mwe then obtain that xk and zk are also convergent. Let λ∗, x∗, z∗ denote thecorresponding limits.

First, notice that (6.42d) implies

λ∗ = λ∗ + α(Mx∗ − z∗).

Thus, we have Mx∗ = z∗.

Now, using Mx∗ = z∗, from (6.42c), and (6.42a) we obtain,

Nαd2(λ∗ + αz∗) = λ∗ − αz∗.

But by Prop. 44, this implies that z∗ ∈ ∂d2(λ∗). In view of (6.39b), equivalentlyλ∗ ∈ ∂g(z∗).

By a similar argument, we obtain from (6.42a) that y∗ = λ∗ − αMx∗, from(6.42b) that y = λ∗ + αMx∗, and

Nαd1(λ∗ − αMx∗) = λ∗ + αMx∗.

This time, Prop. 44, implies −Mx∗ ∈ ∂d1(λ∗). In view of (6.39a), equivalently−MTλ∗ ∈ ∂f(x∗). Using this and the previous observation λ∗ ∈ ∂g(z∗), wethus can write

0 ∈ ∂f(x∗) +MT λ∗

⇐⇒ 0 ∈ ∂f(x∗) +MT ∂g(z∗)

⇐⇒ 0 ∈ ∂f(x∗) +MT(∂g)(Mx∗).

In words, x∗ is a solution of the primal problem (6.37).

6.6 A Generalized Proximal Point Algorithm

Recall that, given a maximal monotone T , the proximal point algorithm con-sists of

xk+1 = (I + αT )−1xk.

The sequence xk converges to some x∗ such that 0 ∈ T (x∗). Recall that theconvergence of PPA depended on the firm-nonexpansivity of S = (I + αT )−1,which is equivalent to

〈S x1 − S x2, x1 − x2〉 ≥ ‖S x1 − S x2‖22.

60

Page 63: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

To derive a generalized PPA, suppose M is a positive definite matrix, andconsider the following train of equivalences

0 ∈ T (x),

⇐⇒ Mx ∈Mx+ αT (x),

⇐⇒ x = (M + αT )−1M x.

Note that, the last line assumes that (M + αT ) has an inverse. This is indeedthe case, since M can be written as M = cI + U for some positive definitematrix U and c > 0. Thanks to positive definiteness, U is maximal monotone.Consequently, U + αT is also maximal monotone and we can then resort toMinty’s theorem and Prop. 42 to conclude that (M + αT ) = c I + (U + αT )has a well-defined inverse. We also remark at this point that M does not haveto be symmetric.

We will study the operator (M + αT )−1M in a modified norm.

Lemma 3. Suppose M is a positive definite matrix. Then, the mapping〈x, y〉M = 〈x,My〉 defines an inner product.

Proof. To be added...

In the following, we denote the induced norm√〈·, ·〉M as ‖ · ‖M .

Proposition 50. Suppose M is positive definite and T is maximal monotonewith respect to the inner product 〈·, ·〉I . Then, the operator S = (M+αT )−1Mis firmly-nonexpansive with respect to the inner product 〈·, ·〉M . That is,

〈Sx− Sy, x− y〉M ≥ ‖Sx− Sy‖2M .

Proof. Without loss of generality, take α = 1. Since the range of M +T is thewhole space, and M is invertible, for any yiwe can find xi and ui ∈ T (x) suchthat Myi = Mxi + ui. Notice that xi = Syi. But then,

〈Sy1 − Sy2, y1 − y2〉M = 〈x1 − x2, x1 − x2〉M + 〈x1 − x2,M−1 (u1 − u2)〉M

= ‖x1 − x2‖2M + 〈x1 − x2, u1 − u2〉I

≥ ‖x1 − x2‖2M .

To be added : generalization of the Krasnoselskii-Mann theorem...

In summary, the generalized PPA consists of the following iterations

xk+1 = (M + αT )−1M xk.

61

Page 64: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Application to a Saddle Point Problem

Consider a problem of the form

minx

maxz∈B

f(x) + 〈Ax, z〉,

where B is a closed, convex set, and f is a convex function. Rewriting, weobtain an equivalent problem as,

minx

maxz

{L(x, z) = f(x) + 〈Ax, z〉 − iB(z)

}.

Observe that L(x, z) is a convexoconcave function. For such functions, thefollowing operator replaces the subdifferential

T (x, z) =

[∂xL(x, z)

∂z(−L(x, z)

)]=

[∂f(x) + AT z∂iB(z)− Ax

] (6.43)

Proposition 51. The operator T defined in (6.43) is maximal monotone.

Proof. Let us first show monotonicity. Observe that⟨T (x1, z2)− T (x2, z2),

[x1

z1

]−[x2

z2

]⟩=

⟨[∂f(x1)∂iB(z1)

]−[∂f(x2)∂iB(z2)

],

[x1 − x2

z1 − z2

]⟩+

[x1 − x2

z1 − z2

]T [0 AT

−A 0

] [x1 − x2

z1 − z2

]︸ ︷︷ ︸

=0

≥ 0.

Proof of maximality to be added...

If we apply PPA for obtaining a zero of T (x, z) defined in (6.43), the resultingiterations are complicated. Consider now a generalized PPA (GPPA) with thechoice

M =

[I αAT

αA I

].

Observe that M is positive definite if α2 σ(AT A) < 1. In order to apply GPPA,we need an expression for (M + αT )−1. Notice that

(M + αT )

[xz

]=

([I αAT

αA I

]+ α

[∂f AT

−A ∂iB

])[xz

]=

[I + α∂f 2αAT

0 I + α∂iB

] [xz

]

62

Page 65: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

Thus,

(M + αT )

[xz

]∈[xz

]means

(I + α∂f)x+ 2αAT z = x, (6.44a)

(I + α ∂iB)z = z. (6.44b)

Solving (6.44b) and plugging this back in (6.44a), we obtain the expressionsfor xand zas,

z = PB(z),

x = Jαf (x− 2αAT z),

where PB = JαiB denotes the projection operator onto B and Jαf is proximityoperator for f .

Observe also that,

M

[xz

]=

[x+ αAT zαAx+ z

].

Therefore, the GPPA for this problem is,

zk+1 = PB(zk + αAxk)

xk+1 = Jαf(xk + αAT zk − 2αAT zk+1

)= Jαf

(xk − αAT (2zk+1 − zk)

)By the analysis on GPPA, we can state that this algorithm converges if α2 >σ(AT A).

Application to an Analysis Prior Problem

Consider now the problem

minx

1

2‖y −H x‖2

2 + λ‖Ax‖1

By making use of the dual expression of the `1 norm, we can express thisproblem as a saddle point problem. For this, recall that,

‖x‖1 = maxz∈B∞

〈x, z〉,

= maxz〈x, z〉 − iB∞(z),

63

Page 66: Lecture Notes on Convex Analysis and Iterative Algorithms · Having a solid understanding of convex sets is very useful for convex analysis of functions because we can and will regard

where B∞ denotes the unit ball of the `∞ norm. The equivalent saddle problemis,

minx

maxz

1

2‖y −H x‖2

2︸ ︷︷ ︸f(x)

+λ 〈Ax, z〉 − λ iB∞(z),

This is exactly the same form considered above. The choice

M =

[I −αAT−αA I

]leads to the iterations

xk+1 = Jαf (xk − αAT zk)zk+1 = PB∞

(zk + αA(2xk+1 − xk)

).

These are the iterations proposed by Chambolle and Pock.

64