lecture 3: a brief overview on convex analysis and duality€¦ · lecture 3: a brief overview on...

Selected Methods for Modern Optimizationin Data Analysis

Department of Statistics and Operations Research

UNC-Chapel Hill

Fall 2018

Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh

Lecture 3: A brief overview on convex analysis and duality

Index Terms: Convex sets, convex cones, convex functions, Fenchel’s conjugate, proximal operators,

monotone operators, duality theory, convergence rate, and computational complexity theory.

Copyright: This lecture is released under a Creative Commons License and Full Text of the License.

This lecture gives a very short overview for the following central concepts:

• Convex sets, polyhedral, convex hull, and convex cones.

• Convex functions, strictly/strongly convex functions, subgradients, and smooth convex functions.

• Fenchel conjugate and properties.

• Proximal operators, projections, proximity functions and Bregman divergences.

• Lagrange duality theory, saddle-points, Slater condition, and Karush-Kuhn-Tucker (KKT) condition.

• Maximal monotone operators.

• A brief review on convergence rates and computational complexity theory.

For more details on these topics, we refer to three monographs [1, 3, 14].

1 Convex sets and convex functions

Let Rp be a finite dimensional Euclidean space, and X be a subset of Rp. We assume that students have already

known some topology properties of X such as the interior of X and the boundary of X , and when X is closed or

open, etc. We denote by int (X ) the interior of X , cl (X ) the closure of X and ∂ (X ) the boundary of X . We say

that X is closed if X = cl (X ). We say that X is compact if it is closed and bounded. We use R+ and R++ for

sets of nonnegative and positive real numbers, respectively. We use Sp (respectively, Sp+ and Sp++) to denote the

set of symmetric (respectively, symmetric positive semidefinite and symmetric positive definite) matrices of size

p× p. A matrix X ∈ Sp+ is denoted by X 0. If X ∈ Sp++, we use X 0. Similarly, X 0 and X ≺ 0 mean that

−X ∈ Sp+ and −X ∈ Sp++, respectively. For P,Q ∈ Sp, we say that P Q if Q−P ∈ Sp+. If ‖ · ‖ is a norm in Rp,then ‖u‖∗ := max 〈u,v〉 | ‖v‖ ≤ 1 denotes its dual norm.

Two central concepts we use in this course are convex sets and convex functions. Let us quickly review them.

1.1 Convex sets

1.1.1 Definitions and examples

A set X is called an affine set if (1−α)x1 +αx2 ∈ X for all x1,x2 ∈ X and α ∈ R. In other words, an affine set

X is a set that contains any line going through two points x1,x2 ∈ X .

Definition 1 (Convex set). We say that X is convex if for all x1,x2 ∈ X and α ∈ [0, 1], then (1−α)x1+αx2 ∈ X .

http://creativecommons.org/licenses/by-nc-sa/1.0/

http://creativecommons.org/licenses/by-nc-sa/1.0/legalcode

STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh

x1

x2↵x1 + (1 ↵)x2

x1

x2

↵x1 +

(1↵)x

2

x1

x2

↵x1 +

(1↵)x

2

Figure 1: Convex set (left), nonconvex set (middle) and polytope (right)

There are several equivalent definitions of convex sets. For example, X is convex if it contains any closed

segment [x1,x2] connecting two points x1,x2 ∈ X .

Example 1. Here are a few quick examples:

1. There are two trivial convex sets: 0p and Rp. Of course, the empty set, ∅, is also convex.

2. A hyperplane H(a, b) :=x ∈ Rp | a>x = b

in Rp is convex. A half-plane H+(a, b) :=

x ∈ Rp | a>x ≤ b

in Rp is convex.

3. A ball Br(x) := x ∈ Rp | ‖x− x‖ ≤ r is convex, where the center x ∈ Rp and the radius r ≥ 0 are given.

4. A linear matrix inequality x ∈ Rp |∑pi=1 xiAi 0 is convex, where Ai (i = 1, · · · , p) are symmetric

matrices in Sq (why?).

Proof: (It is the intersection of⋂

u∈Rp

u>A(x)u ≤ 0

with A(x) =

∑pi=1 xiAi, which is linear in x).

1.1.2 Polyhedra

Definition 2 (Convex set). We say that X is polyhedral in Rp if it is the intersection of a finite number of

half-planes in Rp, i.e.,

P :=x ∈ Rp | a>i x ≤ bi, i = 1, · · · ,m

.

When P is bounded, we called P a polytope as in Figure 1 (right). We can represent a polyhedron into

a matrix form as P := x ∈ Rp | Ax ≤ b, where A ∈ Rm×p whose rows are a>i for i = 1, · · · ,m, and b :=

(b1, · · · ,bm)> ∈ Rm. Convex polyhedra form the feasible set of linear and quadratic programs as we have seen

before. They also represent the feasible set of many other problems as we will see in the next lectures.


1. A box in Rp as [a,b] := x ∈ Rp | aj ≤ xj ≤ bj , j = 1, · · · , p is a polyhedron in Rp.

2. A hyperplane H(a, b) :=x ∈ Rp | a>x = b

in Rp is also a polyhedron.

3. The `1-norm ball B‖·‖1 := x ∈ Rp | ‖x‖1 ≤ 1 is a polytope.

4. The standard simplex in Rp, ∆p := x ∈ Rp |∑pi=1 xi = 1, xi ≥ 0, i = 1, · · · , p is also a polytope.

1.1.3 Convex hulls

Given a set of n points xini=1 in Rp and n nonnegative numbers αini=1 ⊂ R+ such that∑ni=1 αi = 1, then∑p

i=1 αixi is called a convex combination of xini=1 with the coefficients αini=1. We consider a nonempty

set X in Rp, we define the following setx ∈ Rp | x =

n∑i=1

αixi, αi ∈ R+,

n∑i=1

αi = 1,xi ∈ X , n ≥ 1

. (1)

2/29


This set is called the convex hull of X , which is denoted by conv (X ).

Given n points xini=1 in Rp and n real numbers αini=1 ⊂ R such that∑ni=1 αi = 1 (not necessarily

nonnegative), then∑pi=1 αixi is called an affine combination of xini=1 with the coefficients αini=1. Given a

subset X in Rp, let aff (X ) be the set that contains all the affine combinations of any nonempty set of points in

X . Then aff (X ) is called the affine hull of X .

x(1)

x(2)

x(3)

x(4)

x(5)

x(6) convex hull of 6 points

Y

conv(Y)

x(1)

x(2)

z=

(1 ↵

)x(1) +↵x

(2)

Figure 2: Convex hulls. Left: The convex combination of two points, Middle: The convex hull of a finite set of

points, Right: The convex hull of a nonconvex set Y.


1. Given two points x(1) and x(2), then, conv(

x(1),x(2))

= [x(1),x(2)] the segment between these two points.

2. conv (ei,−ei | 1 ≤ i ≤ p) = B‖·‖1 := x ∈ Rp | ‖x‖1 ≤ 1, where ei is the i-unit vector of Rp.

1.1.4 Basic properties

We recall some basic properties of convex sets:

• The intersection of convex sets is convex. More precisely, if Xi, i ∈ I (I is an index set) are convex, then

∩i∈IXi is also convex.

• But the union of two convex sets is not necessarily convex.

• The scale of a convex set is convex, i.e., αX := αx | x ∈ X is convex for any convex set X in Rp and

α ∈ R.

• The Cartesian product X ⊗ Y of two convex sets X and Y is convex.

• The sum z := x + y | x ∈ X ,y ∈ Y of two convex set X and Y is convex.

1.1.5 Relative interior

We say that x ∈ X is a relative interior point of X if there exists (an open) neighborhood Vε(x) of x such that

Vε(x) ∩ aff (X ) ⊆ X .

Example 4. A real interval [a,b] has interior (a,b) in R but does not have interior in R2. In this case, [a,b] has

relative interior (a,b) in R2. Similarly, a plane does not have interior in R3 but it has relative interior in R3.

3/29


1.1.6 Extreme points

Now, we define extreme points, and extreme rays of a convex set.

Definition 3. Given a convex set X , a point x ∈ X is called an extreme point of X if x cannot be written as the

middle point of two separated points y, z ∈ X .

In other words, x ∈ X is an extreme point of X if x cannot be written as x = 12 (y + z), where y, z ∈ X and

y 6= z.

Figure 3 illustrates that x (denoted by E) is an extreme point of X , since it cannot be written as x = 12 (y+ z)

for y, z ∈ X and y 6= z. But x (denoted by N) is not an extreme point, since we can write x = 12 (y + z) where

y, z ∈ X . Here, x is a point of the boundary of X . Clearly, any extreme point should be on the boundary of X .

y

z

xy

z

x

y

z

x

zxy

y

z

xE E EE

N

zxy

N

Figure 3: Extreme points: E is an extreme point, and N is not an extreme point. Left: X has a finite number of

extreme points (red), Middle: X has an infinite number of extreme points, Right: All the points on the sphere

are extreme points.

The existence of extreme point is stated by Krein-Milman’s theorem [14]. We omit the proof here.

Theorem 1.1 (Krein-Milman’s theorem). Any nonempty, closed and convex set that does not contain a straight

line always has an extreme point.

1.1.7 Convex cones

A set K in Rp is called a cone if tx ∈ K for any t ≥ 0 and x ∈ K. The set K is called a convex cone if K is a

cone and convex. In other words, K is a convex cone if for all x1,x2 ∈ X and α, β ≥ 0, αx1 + βx2 ∈ K.

0

x1

x2↵x1+x2

K

Tuesday, June 17, 14

Figure 4: A convex cone

Example 5. Here are a few examples.

• The first orthant of Rp, Rp+, is a convex cone, called an orthant cone.

4/29


• The set of positive semidefinite matrices Sp+ := X ∈ Sp | X 0 is a convex cone (why?).

Proof: First, if X,Y ∈ Sp+, then for any α, β ≥ 0, we have αX + βY ∈ Sp+. It is a convex cone. Next, we

can also write Sp+ =⋂

u∈Rpu>Xu ≥ 0 with u>Xu = trace((uu>)X), which is linear in X.

• The Lorentz set Lp :=

(x, t) ∈ Rp+1 | ‖x‖2 ≤ t

is a convex cone, called an icream cone, or a second order

cone (why?).

Proof: Since ‖x‖2 ≤ t is equivalent to t2 − ‖x‖22 ≥ 0, or t− t−1x>x ≥ 0. By Schur’s complement, we have[t x>

x tI

] 0. This inequality presents a convex set.

One can also prove this directly. Take (x, t) ∈ Lp and (y, s) ∈ Lp, we have ‖x‖2 ≤ t and ‖y‖2 ≤ s.

Now, for any α ≥ 0 and β ≥ 0, we have ‖αx + βy‖2 ≤ α ‖x‖2 + β ‖y‖2 ≤ αt + βs, which shows that

α(x, t) + β(y, s) ∈ Lp.

Given a nonempty, closed, and convex set X , we call TX (x) and NX (x) the tangent and normal cone of X at

x ∈ X , respectively, which are defined as

TX (x) := cl (d ∈ Rp | x + td ∈ X for some t > 0) and NX (x) :=w ∈ Rp | w>(x− y) ≥ 0, ∀y ∈ X

. (2)

These two cones are key concepts to study the optimality condition of optimization problems.

1.2 Convex functions

In this section, we briefly review some basic concepts and properties of convex functions. The reader can refer to

[1, 3, 14] for more details.

1.2.1 Basic definitions

Consider a function f from Rp to R ∪ ±∞, which takes both −∞ and +∞ into its values.

1. We denote by dom (f) := x ∈ Rp | f(x) < +∞ the (effective) domain of f .

2. We say that f is proper if f(x) > −∞ for all x ∈ Rp, i.e., f does not take the −∞ value, and dom (f) 6= ∅.

3. We also denote by epi (f) := (x, t) ∈ dom (f)× R | f(x) ≤ t the epi-graph of f .

4. We say that f is closed if epi (f) is closed.

Now, we give a formal definition of convex function.

Definition 4. Given a nonempty subset X ⊆ Rp, we say that f is convex on X if for any x1,x2 ∈ X and any

α ∈ [0, 1], we have (1− α)x1 + αx2 ∈ X and

f((1− α)x1 + αx2) ≤ (1− α)f(x1) + αf(x2). (3)

We say that f is convex if it is convex on its domain. If f is said to be concave on X if −f is convex on X .

We can also show that f is convex iff its epi (f) is convex. We denote by Γ0(X ) the class of proper, closed and

convex functions on X . If a function f is not convex, then we say that f is nonconvex.

Example 6. Here are a few examples of convex functions.

1. The affine function f(x) := a>x + b for given a ∈ Rp and b ∈ R is convex on Rp.

2. The quadratic function (positive semidefinite Hessian) f(x) := 12 〈Qx,x〉 + q>x for Q ∈ Sp+ and q ∈ Rp is

convex on Rp.

5/29


x

f(x)

x1 x2

f(x2)

f(x1)

x

f(x)

x1 x2

f(x2)

f(x1)

↵x1 + (1 ↵)x2

f(↵x1 + (1 ↵)x2)

↵f(x1) + (1 ↵)f(x2)

x

f(x)

x1 x2

f(x2)

f(x1)

Figure 5: Nonconvex function (left), convex function (middle), concave function (right)

3. The logarithmic function f(x) := − log(x) for x ∈ R++ is convex on R++.

4. The logistic loss function f(x) := log(1 + ex) is convex on R.

5. The norm f(x) := ‖x‖ is convex on Rp. In particular, the square `2-norm f(x) := ‖x‖22 is convex on Rp.

1.2.2 The indicator and support functions

There are two special functions: the indicator and support functions of a convex set. Given a nonempty, closed,

and convex set X in Rp, we define

• The indicator function of X :

δX (x) :=

0 if x ∈ X+∞ otherwise

(4)

This function is proper, closed, and convex on Rp with dom (f) = X .

• The support function sX of X is

sX (x) := supu∈X〈x,u〉 . (5)

This function is also convex on Rp.

1.2.3 Basic properties of convex functions

Here are some basic properties of convex functions.

1. f is convex on Rp iff epi (f) is convex on Rp × R.

Proof: Assume that f is convex. For any (x, t) ∈ epi (f), (y, s) ∈ epi (f), with x,y ∈ dom (f) and t, s ∈ R,

we have f(x) ≤ t and f(y) ≤ t. Since f is convex, for any α ∈ [0, 1], f((1−α)x+αy) ≤ (1−α)f(x)+αf(y) ≤(1−α)t+αs. This shows that ((1−α)x+αy, (1−α)t+αs) ∈ epi (f). Conversely, for any x,y ∈ dom (f) we

take t = f(x) and s = f(y). This shows that (x, t), (y, s) ∈ epi (f). Since epi (f) is convex, for any α ∈ [0, 1],

we have (1−α)x +αy ∈ epi (f), which means that f((1−α)x +αy) ≤ (1−α)t+αs = (1−α)f(x) +αf(y).

This shows that f is convex.

2. If f and g are two convex functions on Rp, then αf + βg is also convex for any α, β ∈ R+.

3. If f : Rn → R ∪ +∞ is convex and A : Rp → Rn is a linear/affine function, then f(A(·)) is also convex.

4. Jensen’s inequality: Given a convex function f ∈ Γ0(Rp), n points xi ∈ dom (f), αi ≥ 0 for i = 1, · · · , nand

∑ni=1 αi = 1, we have

f

(n∑i=1

αixi

)≤

n∑i=1

αif(xi).

6/29


5. A function f : Rn → R ∪ +∞ is convex iff, for any x ∈ dom (f) and v ∈ Rp, its line restriction

ϕ(t) = f(x + tv) with dom (ϕ) := t ∈ R | x + tv ∈ dom (f)

is convex on dom (ϕ). This property is very important to verify whether a function is convex or not.

Proof: If f is convex, we have ϕ((1 − α)t + αs) = f(x + ((1 − α)t + αs)d) = f((1 − α)(x + td) + α(x +

sd)) ≤ (1 − α)f(x + td) + αf(x + sd) = (1 − α)ϕ(t) + αϕ(s) for any α ∈ [0, 1], and t, s. Conversely,

if ϕ((1 − α)t + αs) ≤ (1 − α)ϕ(t) + αϕ(s), then, for any x,y ∈ dom (f), we take d = y − x, we have

f((1 − α)(x + td) + α(x + sd)) ≤ (1 − α)f(x + td) + αf(x + sd). Take t = 0, and s = 1, we get

f((1− α)x + αy) ≤ (1− α)f(x) + αf(y).

6. If f1, · · · , fn are convex, then f(x) := max fi(x) | 1 ≤ i ≤ n is also convex. This is a useful property for

representing nonsmooth convex functions. For example, |x| = 2 max x, 0 − x.

Proof: We note that epi (f) = (x, t) | f(x) ≤ t, which means that fi(x) ≤ t for i = 1, · · · , n. Hence,

epi (f) =⋂ni=1 (x, t) | fi(x) ≤ t =

⋂ni=1 epi (fi). One can also verify this property directly using the

definition of convex functions above.

Example 7 (Sum of r-largest components). Let x ∈ Rp, we denote x[i] the i-th largest component of x, i.e.,

x[1] ≥ x[2] ≥ · · · ≥ x[p]. Then, the function f(x) =∑ri=1 x[i] is convex.

Indeed, it can be written as f(x) = max xi1 + · · ·+ xir | 1 ≤ i1 < i2 < · · · < ir ≤ n.

7. Let F (x,y) be a bifunction for y ∈ Y. If F (·,y) is convex w.r.t. to x, then the marginal f defined as

f(x) = supy∈Y

F (x,y)

of F indexed by y is also convex (why?).

Proof: We note that epi (f) = ∩y∈Yepi (F (·,y)).

Example 8. The conjugate of f is defined by f∗(x) := supy∈Rp 〈x,y〉 − f(y), which is convex.

8. If P : Rp × R → Rp is defined as P (x, t) = x/t with dom (P ) := (x, t) ∈ Rp × R | t > 0. Then, for any

convex set X ⊆ dom (P ), we have P (X ) := P (x, t) | (x, t) ∈ X ⊂ Rp is a convex set.

Proof: Let take (x, t), (y, s) ∈ C. We consider P ((1−α)(x, t) +α(y, t)) = (1−α)x+αy(1−α)t+αs =

(1− αs

(1−α)t+αs

)xt +(

αs(1−α)t+αs

)ys = (1− β)P (x) + βP (y), where β := αs

(1−α)t+αs ∈ [0, 1].

9. Let f be convex from Rp → R ∪ +∞. The perspective of f is defined as

g(x, t) := tf (x/t) with dom (g) := (x, t) ∈ Rp × R | x/t ∈ dom (f) , t > 0 .

In this case, g is also convex.

Example 9. The function g(x, t) :=‖x‖22t is convex on Rp × R++.

Proof: We note that (x, t, s) ∈ epi (g) ⇔ tf(x/t) ≤ s ⇔ f(x/t) ≤ s/t ⇔ (x/t, s/t) ∈ epi (f). Finally,

using Item 8, we obtain the result.

10. If F (x,y) is convex with respect to x and y, and Y is a nonempty, and convex set, then

f(x) = infy∈Y

F (x,y)

is convex.

7/29


Example 10. The Moreau envelope g(x) := infu∈Rp

f(u) +

1

2‖u− x‖22

is convex. The distance function

from x to a convex set X , dX (x) := infy∈X ‖y − x‖2, is convex.

Proof: Given x1,x2 ∈ dom (g), since g(xi) = infy∈X F (xi,y) for i = 1, 2, we have y1,y2 ∈ X such that

f(xi,yi) ≤ g(xi) + ε for any ε > 0. We verify the definition as f((1− α)x1 + αx2) = infy∈X F ((1− α)x1 +

αx2,y) ≤ F ((1−α)x1 +αx2, (1−α)y1 +αy2) ≤ (1−α)F (x1,y1) +αF (x2,y2) ≤ (1−α)g(x1) +αg(x2) + ε.

Letting ε to zero, we obtain g((1− α)x1 + αx2) ≤ (1− α)g(x1) + αg(x2). Hence, g is convex.

1.2.4 Strictly and strongly convex functions

We say that f is strictly convex on X if for any x,y ∈ X and α ∈ (0, 1) with x 6= y, we have f((1−α)x +αy) <

(1−α)f(x) +αf(y). We say that f is strongly convex with the convexity parameter µf > 0 if f(·)− µf

2 ‖·‖2

is

convex. This is equivalent to

f((1− α)x + αy) ≤ (1− α)f(x) + αf(y)− µf2

(1− α)α ‖y − x‖2 . (6)

f(x) f(y)

x y

f(x)

x

Set of minima

x y

f(x)

f(y)

f(x)

x

(1 ↵)f(x) + ↵f(y)

f((1 ↵)x + ↵y)

(1 ↵)x + ↵y

Figure 6: Nonstrongly convex function (left) vs. strongly convex function (right)

Example 11. We note that f(x) = − log(x) is strictly convex on R+ but is not strongly convex on R+. The

function f(x) := 12 ‖x‖

2is strongly convex with the convexity parameter µf = 1. If f is a convex function on Rp

and µf > 0, then f(·) +µf

2 ‖·‖2

is strongly convex with the convexity parameter µf . We can also shortly say that

f is µf -strongly convex when f is strongly convex with the strong convexity parameter µf > 0.

1.2.5 Subgradients and subdifferential

Given a convex function f , a vector w ∈ Rp is said to be a subgradient of f at x ∈ dom (f) if

f(x) ≥ f(x) + 〈w,x− x〉, ∀x ∈ dom (f) . (7)

We denote by ∂f(x) := w ∈ Rp | f(x) ≥ f(x) + 〈w,x− x〉, ∀x ∈ dom (f) the subdifferential of f at x. A

function f ∈ Γ0(Rp) is said to be subdifferentiable at x ∈ dom (f) if ∂f(x) is nonempty. If f is convex and

continuous on dom (f), then it is subdifferentiable at any point x ∈ ri (dom (f)). If f is differentiable at x ∈dom (f), then ∂f(x) = ∇f(x), where ∇f(x) is the gradient of f at x.

Here are some basic properties:

• For any x ∈ dom (f), if ∂f(x) is nonempty, then f is convex.

• If f ∈ Γ0(Rp), then ∂f(x) is nonempty and bounded for any x ∈ int (dom (f)).

• f(x?) = minx∈Rp

f(x) if and only if 0 ∈ ∂f(x?) (Fermat’s rule).

Example 12. The function f(x) = |x| on R is subdifferentiable at x = 0, and ∂f(0) = [−1, 1], the whole interval

[−1, 1]. If x 6= 0, then f is differentiable.

8/29


f(x)

x

f(x)

x

f(x) + hw1, x xif(x) + hw2, x xi

f(x)

x

f(x)

x

f(x) + hrf(x), x xi

The graph is above the

tangent line

f(x)

xxf(x) + hw2, x xi

epi(f)

Figure 7: Nonsmooth function (left), smooth function (middle), support of epi-graph (right)

Evaluation of subgradients: For simple functions, computing a subgradient can be done explicitly. However,

if the function is complicated, this task is not easy. We review some basic rules in order to compute a subgradient

of a convex function.

• If f ∈ Γ0(Rp), then ∂f(x) is closed and convex, and it is nonempty and compact if x ∈ ri (dom (f)).

• If g(x) = f(Ax + b) then, ∂g(x) = A>∂f(Ax + b).

• If f = αg + βh, then ∂f(x) = α∂g(x) + β∂h(x), for any x ∈ int (dom (f)) = int (dom (g)) ∩ int (dom (h)).

• If f(x) = max1≤i≤m

fi(x), then ∂f(x) = conv (⋃i ∂fi(x) | f(x) = fi(x)).

• Let fω ∈ Γ0(Rp) be a family of convex functions indexed by ω ∈ Ω. If f(x) = supω∈Ω fω(x), then

conv(⋃

i∈I(x) ∂fω(x))⊆ ∂f(x), where I(x) := ω ∈ Ω | fω(x) = f(x). If Ω is compact, and fω is contin-

uous w.r.t. ω, then we have the equality.

Example 13. Let f(x) = maxa>i x + bi | 1 ≤ i ≤ m

. Let us denote I(x) :=

i ∈ 1, 2, · · · ,m | f(x) = a>i x + bi

.

Then, ∂f(x) = conv (ai | i ∈ I(x)).Example 14. Let us consider f(X) := λmax(X) the maximum eigenvalue of a symmetric matrix X. Since we can

write λmax(X) = max‖u‖2≤1

u>Xu

. Since fu(X) = u>Xu, we have ∂fu(X) = u(X)u(X)>, where u(X) is the

unit eigenvector of X corresponding to the eigenvalue λmax(X). Hence, we have u(X)u(X)> ∈ ∂f(X).

1.3 Smooth convex functions

We assume that f is smooth on X , which means that f has first-order or up to second-order derivative, which is

continuous on X . We denote by ∇f and ∇2f its gradient [vector] and Hessian [matrix], respectively.

For a smooth function f on X , it is convex on X iff

(a) f(y) ≥ f(x) + 〈∇f(x),y − x〉 for all x,y ∈ X , or

(b) 〈∇f(x)−∇f(y),x− y〉 ≥ 0 for all x,y ∈ X .

If, in addition, f is second-order smooth (i.e., twice continuously differentiable) on X , then f is convex on X iff

∇2f(x) 0 for all x ∈ X .

Proof: (a) If f is convex, then f((1 − α)x + αy) ≤ (1 − α)f(x) + αf(y), or f(x + α(y − x)) − f(x) ≤α(f(y)−f(x)). We can write f(x+α(y−x))−f(x)

α ≤ f(y)−f(x). Taking the limit as α→ 0, we get 〈∇f(x),y − x〉 =

limα→0f(x+α(y−x))−f(x)

α ≤ f(y) − f(x). Conversely, let zα = (1 − α)x + αy. We have f(y) ≥ f(zα) +

〈∇f(zα),y − zα〉 and f(x) ≥ f(zα) + 〈∇f(zα),x− zα〉. Multiplying the first inequality by α and the sec-

ond one by 1 − α, then summing up the result, we have (1 − α)f(x) + αf(y) ≥ f(zα). Here, we note that

(1− α)(x− zα) + α(y − zα) = 0.

9/29


(b) By adding (a) with exchange variable between x and y, we get (b). Conversely, let zt = x + t(y − x) for

any t ∈ [0, 1]. We have zt − x = t(y − x). By the mean-value theorem, we have f(y)− f(x)− 〈∇f(x),y − x〉 =∫ 1

0〈∇f(x + t(y − x))−∇f(x),y − x〉dt =

∫ 1

01t 〈∇f(zt)−∇f(x), zt − x〉dt ≥ 0 due to (b). Hence, we get (a).

1.3.1 Lipschitz gradient convex functions

We say that ∇f is Lipschitz continuous on X with the Lipschitz constant Lf ∈ [0,+∞) if

‖∇f(x)−∇f(y)‖∗ ≤ Lf ‖x− y‖ , ∀x,y ∈ X . (8)

where ‖·‖∗ is the dual norm of ‖·‖. We can say that f has Lipschitz gradient ∇f (or call Lf -Lipschitz gradient,

or Lf -smooth). We denote by F1,1L (X ) the class of convex functions f and f has Lipschitz gradient ∇f with the

Lipschitz constant Lf ∈ [0,+∞).

If f ∈ F1,1L (X ) and g ∈ F1,1

L (X ), then h = f + g is Lipschitz gradient with Lh = Lf + Lg, and h = cf is

Lipschitz gradient with Lh = |c|Lf for any c ∈ R. If f ∈ F1,1L (X ) and g ∈ F1,1

L (f(X )), then h = g f is Lipschitz

gradient with Lh = LfLg.

f(x)

x

f(x)

x

f(x) + hrf(x), x xi

f(x) + hrf(x), x xi +Lf

2 kx xk2

f(x) + hrf(x), x xi Lf

2 kx xk2

Figure 8: A Lipschitz gradient convex function

Example 15. Here are few examples.

1. The function f(x) := 12 ‖Ax− b‖22 has gradient ∇f(x) = A>(Ax−b) on Rp, which is Lipschitz continuous

on Rp with the Lipschitz constant Lf := ‖A‖2 (see the operator or matrix norm defined above).

2. The function f(x) := log(1 + ea>x+b) has gradient ∇f(x) =

(ea>x+b

1+ea>x+b

)a, which is Lipschitz continuous on

Rp with the Lipschitz constant Lf := ‖a‖24 .

3. The gradient of the function f(x) := − log(a>x− b) is NOT Lipschitz continuous on its domain dom (f) :=x ∈ Rp | a>x− b > 0

.

From the Lipschitz gradient assumption, we can derive the following bounds [10]:

Theorem 1.2. 1. Bound on the gradient (co-coericiveness) [Baillon-Haddad’s theorem]: For any x,y ∈ dom (f),

we have

〈∇f(x)−∇f(y),x− y〉 ≥ 1

Lf‖∇f(x)−∇f(y)‖2∗ . (9)

2. Bound on the function values: For any x,y ∈ dom (f), we have

1

2Lf‖∇f(x)−∇f(y)‖2∗ ≤ f(y)− f(x)− 〈∇f(x),y − x〉 ≤ Lf

2‖y − x‖2 . (10)

10/29


This property is called the monotonicity of ∇f . As we will see, these inequalities are very useful to develop

first-order algorithms in the sequel.

Proof: (1) Let us fix x ∈ dom (f), and consider a function ϕ(y) := f(y)− f(x)−∇f(x)>(y−x), which is smooth

and convex. Moreover, ∇ϕ(·) = ∇f(·)−∇f(x) is also Lipschitz continuous with the same Lipschitz constant Lf .

For any y ∈ dom (f), by convexity of f , we have ϕ(y) ≥ ϕ(x) = 0. Hence, x is an optimal solution of ϕ. Using the

Lipschitz continuity of ∇ϕ, we have 0 = ϕ(x) ≤ ϕ(y− 1Lf∇ϕ(y)) = ϕ(y)− 1

2Lf‖∇ϕ(y)‖2∗. Using the definition of

ϕ, we can write this as f(x) + 〈∇f(x),y − x〉+ 12Lf‖∇f(y)−∇f(x)‖2∗ ≤ f(y). Exchanging the role of x and y

in this inequality, and summing this inequality and the resulting one, we obtain (10).

(2) Using the mean-valued theorem, we have f(y) − f(x) =∫ 1

0∇f(x + τ(y − x))>(y − x)dτ . Hence, we can

show that |f(y) − f(x) − ∇f(x)>(y − x)| =∣∣∣∫ 1

0[∇f(x + τ(y − x))−∇f(x)]

>(y − x)dτ

∣∣∣ ≤ ∫ 1

0‖∇f(x + τ(y −

x))−∇f(x)‖∗ ‖y − x‖ dτ ≤ Lf ‖y − x‖2∫ 1

0τdτ =

Lf

2 ‖y − x‖2. This implies the right hand side of (10).

To prove the left-hand side of (10), we consider the function φ(y) := f(y) − ∇f(x)>(y − x) for a fixed

x ∈ Rp. We have ∇φ(y) = ∇f(y) − ∇f(x), which leads to ∇φ(x) = 0. Hence, y = x is the solution of

miny φ(y). On the other hand, ∇φ is also Lipschitz continuous with the same Lipschitz constant Lf , we have

φ(y−L−1f ∇φ(y)) ≤ φ(y)+∇φ(y)>(y−L−1

f ∇φ(y)−y)+Lf

2 ‖y − L−1f ∇φ(y)− y‖2 ≤ φ(y)− 1

2Lf‖∇φ(y)‖2∗. Using

these arguments, we can derive

φ(x) ≤ φ(y − L−1f ∇φ(y)) ≤ φ(y)− 1

2Lf‖∇φ(y)‖2∗ .

Using φ(y) = f(y)−∇f(x)>(y − x) in the above inequality, we get

f(x) ≤ f(y)−∇f(x)>(y − x)− 1

2Lf‖∇f(y)−∇f(x)‖2∗ ,

which is the left-hand side of (10). We obtain (10) by adding two symmetric forms of (10).

If we further assume that f is twice continuously differentiable with Lf -Lipschitz gradient and µf -strongly

convex, then it is necessary and sufficient that

µf I ∇2f(x) Lf I, ∀x ∈ Rp,

where I is the identity mapping. As a standard example, we consider a quadratic function f(x) := (1/2)〈Qx,x〉+q>x, where Q is symmetric positive definite. We have ∇2f(x) = Q. Hence, f is strongly convex and Lipschitz

gradient continuous with µf = λmin(Q) > 0 and Lf := λmax(Q).

1.3.2 Strongly convex and smooth functions

For strongly convex and smooth functions, we have the following properties (We skip the proof, the reader can

find it in [10]):

1. Strong monotonicity: 〈∇f(x)−∇f(y),x− y〉 ≥ µf ‖x− y‖2 for x,y ∈ Rp.

2. Bound on function values: f(y) ≥ f(x) + 〈∇f(x),y − x〉+µf

2 ‖y − x‖2 for x,y ∈ Rp.

3. Bound on function values with gradients: f(y) ≤ f(x) + 〈∇f(x),y − x〉+ 12µf‖∇f(y)−∇f(x)‖2 for x,y ∈

Rp.

4. Co-coercive property: 〈∇f(y)−∇f(x),y − x〉 ≤ 1µf‖∇f(y)−∇f(x)‖2 for x,y ∈ Rp.

When the function f is both Lf -Lipschitz continuous and µf -strongly convex, we have the following property:

〈∇f(x)−∇f(y),x− y〉 ≥ µfLfLf + µf

‖x− y‖2 +1

Lf + µf‖∇f(x)−∇f(y)‖2 .

11/29


To prove this property, we consider φ(x) := f(x)− (µf/2) ‖x‖2 and use (10) for φ.

Let f be strongly convex and smooth with the strong convexity parameter µf > 0. We consider the following

unconstrained convex problem:

f? := minx∈Rp

f(x). (11)

The optimality condition (i.e., Fermat’s rule) is ∇f(x?) = 0 which is necessary and sufficient for x? to be a

uniquely optimal solution of this problem. Moreover, we have the following properties

f(x) ≥ f(x?) +µf2‖x− x?‖2, (12)

for all x ∈ dom (f). Note that, by changing the gradient ∇f to subgradients in the following proof, the existence

and uniqueness of x? and the inequality (12) still hold for (11) even when f is nonsmooth.

Proof: We note that f(x) ≥ f(x?) + 〈∇f(x?),x− x?〉+µf

2 ‖x− x?‖2 = f? +µf

2 ‖x− x?‖2 for all x ∈ dom (f). If

x? is another solution of the above problem, then f(x?) = f?. Using the above inequality, we have f? = f(x?) ≥f? µ2 ‖x? − x?‖2. Hence, we have ‖x? − x?‖2 ≤ 0, we have x? = x?. We conclude that the problem has unique

solution.

To prove the existence of x?, we take any α ∈ R, and consider Lα(f) := x ∈ dom (f) | f(x) ≤ α, the

sublevel set of f . Since f is closed, Lα(f) is closed. Now, we show that it is bounded. Assume by contradiction

that Lα(f) is unbounded. Hence, there existsxk⊂ Lα(f) such that ‖xk‖ → +∞. This also implies that

‖xk− (x0− 1µf∇f(x0))‖2 →∞. By strong convexity of f , we have f(xk) ≥ f(x0) + 〈∇f(x0),xk − x0〉+ µf

2 ‖xk−x0‖22 = f(x0)+

µf

2 ‖xk−(x0− 1µf∇f(x0))‖22− 1

2µf‖∇f(x0)‖2 →∞. This is contradict with the fact that f(xk) ≤ α.

Hence, Lα(f) is bounded. By Weierstrass’s theorem, problem minx∈Rp f(x) has a solution x?.

The inequality (12) holds due to Property 2 above, and the fact that ∇f(x?) = 0.

2 Fenchel conjugates

Let f ∈ Γ0(Rp) be a proper, closed, and convex function. The Fenchel conjugate of f is defined as

f∗(y) := supx∈dom(f)

〈x,y〉 − f(x) . (13)

Clearly, f∗ is convex even f is nonconvex. Moreover, f∗∗∗ = f∗. This concept can be illustrated in Figure 9 for

both convex and non-convex functions.

f(x)

yT x

x

0

(0,f(y))

x

yT x

f(x)

xT y

x

f(x)

(0,f(y))

0

Figure 9: The conjugate of a non-convex function (left), and of a convex function (right). Here, f∗(y) is the

maximum gap between the linear function x>y (red line) and f(x), as shown in dashed line.

It is proven that if f ∈ Γ0(Rp) (i.e., f is proper, closed, and convex), then f∗∗ = f , and f∗ ∈ Γ0(Rp).

f∗∗(x) = supy〈x,y〉 − f∗(y) = sup

y

〈x,y〉 − sup

v〈y,v〉 − f(v)

= sup

yinfv〈y,x− v〉+ f(v) ≤ f(x).

12/29


It is also clear that

f(x) + f∗(y) ≥ 〈x,y〉, ∀x,y and y ∈ ∂f(x) ⇔ x ∈ ∂f∗(y). (14)

From this definition, if we take f(u) := δU (u) the indicator of a nonempty, closed, and convex set U ⊂ Rp, then

f∗(x) = supu∈U 〈x,u〉, which is the support function of U . Hence, the conjugate of the indicator function of a

convex set is its support function. If f(u) = ‖u‖, then

f∗(x) = supu〈x,u〉 − ‖u‖ = δB‖·‖∗ (0,1)(x) =

0 ‖x‖∗ ≤ 1

+∞ otherwise,

the indicator of the dual norm ‖·‖∗. To prove this, we use 〈x,u〉 ≤ ‖u‖ ‖x‖∗.Here are some well-known examples. More examples can be found in [1].

1. If f(x) = − log(x) for x ∈ R++, then f∗(y) = − log(−y)− 1 for y ∈ −R++.

2. If f(x) = − log det(X) for X ∈ Sp++, then f∗(Y) = − log det(−Y)− p for Y ∈ −Sp++.

3. If f(x) = log(1 + ex) for x ∈ R, then f∗(y) = y ln(y) + (1 − y) ln(1 − y) for 0 ≤ y ≤ 1, which is the

cross-entropy function (Here, we use a convention (or fact) that 0 ln(0) = 0).

For a nonconvex function f , its conjugate f∗ is convex since f∗(x) := supu F (x,u), where F (x,u) := 〈x,u〉 −f(u) is linear (therefore, convex) in x indexed by u (applying the marginal theorem). Two well-known examples:

1. Consider the cardinality function f(x) := card(x) = # |x| = ‖x‖0, the number of nonzero elements of x.

Without loss of generality, we can consider x ∈ [−1, 1]p. The conjugate of this function is

f∗(u) := supx∈[−1,1]p

〈u,x〉 − card(x) .

Proof: Let σ ∈ 1, 2, · · · , p be an order of x such that |uσi| ≥

∣∣uσi−1

∣∣ (meaning that we sort the absolute

coordinates of u in ascending order). Then, we can write f∗ as

f∗(u) := maxr∈1,··· ,p

supx∈[−1,1]r

r∑i=1

uσixσi− r.

If we fix the order r ∈ 1, · · · , p, then supx∈[−1,1]r ∑ri=1 uσixσi − r achieves at xi = sign (ui). Hence, we

can write

f∗(u) := maxr∈1,··· ,p

r∑i=1

(|uσi | − 1) =

p∑i=1

max 0, |ui| − 1 .

Now, we take the conjugate of f? as

f∗∗(x) := supu∈Rp

〈x,u〉 − f∗(u) = supu∈Rp

〈x,u〉 −

p∑i=1

max 0, |ui| − 1

=

p∑i=1

supui

xiui −max 0, |ui| − 1 .

It is not hard to show that supuixiui −max 0, |ui| − 1 = |xi| by solving one-variable maximization

problem with xi ∈ [−1, 1]. Hence, f∗∗(x) =∑pi=1 |xi| = ‖x‖1. Indeed, since xi ∈ [−1, 1], we have

xiui ≤ |ui|. We consider the following cases:

• If |ui| ≤ 1, then we have supuixiui −max 0, |ui| − 1 = supui

xiui = |xi| with ui = sign (xi).

• If ui > 1, supuixiui − |ui|+ 1 = 1 if xi = 1, and supui

xiui − |ui|+ 1 = +∞ if xi ∈ (−1, 1).

• If ui < −1, supuixiui − |ui|+ 1 = 1 if xi = −1, and supui

xiui − |ui|+ 1 = +∞ if xi ∈ (−1, 1).

13/29


This allows us to conclusion that supuixiui −max 0, |ui| − 1 = |xi|.

2. What about the rank function f(X) := rank (X)? Its convex envelope is the nuclear norm f∗∗(X) :=

‖X‖∗. By SVD decomposition, we have rank (X) := ‖σ(X)‖0, where σ(X) := (σ1(X), · · · , σr(X)) with

r = min m,n is the vector of singular values of X. Hence, we can write the function in X as a function in

σ as

f(X) = ‖σ‖0 with X = Udiag(σ)VT , UTU = I, VTV = I .

We note that nuclear norm is ‖X‖∗ :=∑ri=1 |σi(X)| = ‖σ(X)‖1.

One important property of Fenchel conjugate functions is the following result.

Theorem 2.1. If f is µf -strongly convex with µf > 0, then f∗ is smooth, and its gradient is Lipschitz continuous

with the Lipschitz constant Lf∗ = 1µf

. Conversely, if f is Lf -Lipschitz gradient with Lf > 0, then f∗ is µf -strongly

convex with µf∗ = 1Lf

.

Proof: We consider f∗(x) = supu 〈x,u〉 − f(u) = −minu f(u)− 〈x, u〉. Since, f is µf -strongly convex,

f(·) − 〈x, ·〉 is also µf strongly convex, the minimization problem has unique solution u∗(x) for any x. By the

marginal theorem, f∗ is smooth, and its gradient is given by ∇f∗(x) = u∗(x).

Now, we show that ∇f∗ is Lipschitz continuous. Indeed, from the optimality condition of the minimization

problem, we have x ∈ ∂f(u∗(x)). Hence, f(u∗(y)) ≥ f(u∗(x)) + 〈x,u∗(y)− u∗(x)〉 +µf

2 ‖u∗(y)− u∗(x)‖2.

Similarly, f(u∗(x)) ≥ f(u∗(y)) + 〈y,u∗(x)− u∗(y)〉 +µf

2 ‖u∗(y)− u∗(x)‖2. Summing up both inequalities, we

obtain µf ‖u∗(y)− u∗(x)‖2 ≤ 〈u∗(y)− u∗(x),y − x〉. Hence, by using Cauchy-Schwarz inequality, we can show

that ‖u∗(y)− u∗(x)‖ ≤ 1µf‖x− y‖. We conclude that ∇f∗ is Lipschitz continuous with the constant Lf∗ = 1

µf.

To prove the second part, since f is Lipschitz gradient, ∇f is co-coercive as shown in (10), we have

〈∇f(u∗(x))−∇f(u∗(y)),u∗(x)− u∗(y)〉 ≥ 1

Lf‖∇f(u∗(x))−∇f(u∗(y))‖2 .

Since f is smooth, from the optimality condition, we have x = ∇f(u∗(x)) and ∇f∗(x) = u∗(x). Using this into

the above inequality, we get

〈x− y,∇f∗(x)−∇f∗(y)〉 = 〈x− y,u∗(x)− u∗(y)〉 ≥ 1

Lf‖x− y‖2 .

We conclude that f∗ is strongly convex with the parameter µf∗ = 1Lf

.

Another important convex function is the following max function:

f(x) := maxu∈U

〈x,Au〉 − ϕ(u)

, (15)

where U is a nonempty, closed and convex set, and ϕ is a strongly convex function with the parameter µϕ > 0.

Then, by the same proof, we can show that f is smooth, and its gradient is Lipschitz continuous with the Lipschitz

constant Lf := ‖A‖2µϕ

. We leave this as an exercise for students.

3 Proximal operators

A key tool to handle the nonsmoothness of a convex function is proximal operator. This notion is also the core of

many proximal-type algorithms in the sequel. We discuss it here.

14/29


3.1 Definition and basic properties

Given a proper, closed, and convex function f : Rp → R ∪ +∞, the proximal operator of f at x is defined as

proxf (x) := arg miny

f(y) +

1

2‖y − x‖2

. (16)

First, for any x ∈ Rp, proxf (x) is well-defined and unique. Why? Since the objective function of the minimization

problem is f(·)+ 12 ‖· − x‖2 which is strongly convex with the convexity parameter µ = 1. Here are some important

properties of proxf .

Lemma 3.1. Let f ∈ Γ0(Rp). Then proxf satisfies the following properties

1. proxf satisfies

x− proxf (x) ∈ ∂f(proxf (x)),

‖proxf (x)− proxf (y)‖2 ≤ 〈x− y,proxf (x)− proxf (y)〉 ∀x,y ∈ Rp.

2. proxf is non-expansive, i.e.:

‖proxf (x)− proxf (y)‖ ≤ ‖x− y‖ ∀x,y ∈ Rp.

3. (Fenchel-Moreau’s decomposition) If proxf∗ is the proximal operator of the conjugate function f∗ of f ,

then proxf and proxf∗ satisfy

proxf (x) + proxf∗(x) = x, ∀x ∈ dom (f) .

Proof: The first relation comes from the optimality condition of the strongly convex minimization problem (16).

Next, we have x − proxf (x) ∈ ∂f(proxf (x)) and y − proxf (y) ∈ ∂f(proxf (y)) from the first item. Using the

monotonicity of ∂f , we have 〈x− proxf (x)− (y − proxf (y)),proxf (x)− proxf (y)〉 ≥ 0. Rearrange this we obtain

the second estimate in the first item. The second item is a direct consequence of the first item using the Cauchy

- Schwarz inequality.

We finally prove the third item. We consider f∗(x) = supu 〈x,u〉 − f(u) the conjugate of f . The proximal

operator proxf∗(x) = v?(x) of f∗ at x is the solution of

proxf∗(x) = v?(x) = arg minv

f∗(v) + (1/2) ‖v − x‖2 ⇔ minv

maxu

〈v,u〉 − f(u) + (1/2) ‖v − x‖2

.

We write the optimality condition of this min-max problem to get 0 ∈ v?(x)−∂f(u?(x)) and u?(x)+v?(x)−x = 0.

The last relation implies v?(x) = x−u∗(x). Substituting this into the first relation to get 0 ∈ u?(x)−x+∂f(u?(x)).

Hence, u?(x) = proxf (x). Using u?(x) = proxf (x), v?(x) = proxf∗(x) and u?(x) + v?(x) = x, we can show that

proxf (x) + proxf∗(x) = x.

Exercise. Prove that proxγf (x) + γproxγ−1f∗(γ−1x) = x for any f ∈ Γ0(Rp), γ > 0 and x ∈ Rp.

Projection: Clearly, if f is the indicator function of a nonempty, closed and convex set X ⊆ Rp, i.e., f = δX ,

then δX ∈ Γ0(Rp). Hence, we can define the proximal operator of δX using (16), which turns out to be

proxδX (x) := arg miny

δX (y) +

1

2‖y − x‖2

= arg min

y∈X‖y − x‖ = πX (x), (17)

where πX is the projection operator onto X .

15/29


3.2 Tractable proximity

Computing proxf of a convex function f requires to solve a strongly convex program: miny

f(y) + 1

2 ‖y − x‖2

.

If f is smooth, we can bring this problem into solving a nonlinear system ∇f(y) + y = x, where the unknown is

y. Specially, when f is component-wise decomposable as f(x) =∑pi=1 fi(xi), where fi is convex, we can compute

proxf by solving p one-dimension minimization problems minuifi(ui) + (1/2) |ui − xi|2 for i = 1, · · · , p. This can

be done in a closed form or efficiently.

Definition 5 (Tractable proximity). Given f ∈ Γ0(Rp). We say that f is proximally tractable if proxf defined

by (16) can be computed efficiently.

By “efficiently”, we mean that proxf can be computed in a closed form (i.e., in an explicitly analytical form)

or can be computed in [low-order] polynomial time. We denote by Fprox(Rp) the class of proximally tractable

convex functions.

Here are a few concrete examples, but a non-exhausted list is given in Table 1, or can be found in [4, 12].

1. Separable functions: Many separable functions where their prox-operator can be computed in a closed

form. For example, f(x) := ‖x‖1 =∑ni=1 |xi| has a prox-operator proxλf (x) = sign(x) ⊗max|x| − λ, 0,

which is the well-known soft-thresholding operator used in image/signal processing. Another example

is group sparsity penalty: f(x) :=∑gj=1 ‖x[j]‖2.

2. Smooth functions: For several smooth functions, we can obtain a closed form proximal operator, whenever

we can solve the optimality ∇f(z) + z − x = 0 explicitly. For example, with f(x) := (1/2)‖Ax − b‖22, we

have proxλf (x) = (I + λA>A)−1(λA>b + x).

3. The indicator of simple sets: For a given nonempty, closed and convex set C, we consider the projection πConto C. Then, the prox-operator of the indicator function f(x) := δC(x) of C is given as proxλf (x) := πC(x).

When C is simple (e.g., subspace, box, cone, or simplex), computing πC is also cheap and efficient.

3.3 A short overview of the proximal-point algorithm

The proximal-point method is one of the fundamental methods in convex optimization. This method is often

combined with other schemes such as gradient descent or Newton methods to obtain implementable variants. Let

us briefly review it here.

Let g ∈ Γ0(Rp) be proper, closed, and convex. We consider the following convex minimization problem:

g? := minx∈Rp

g(x). (18)

Let us denote by S? the solution set of this problem and assume that S? is nonempty. The proximal-point

algorithm for solving (18) can be described as follows [6, 15]:

The update xk+1 := proxλkg(xk) can be explicitly written as

xk+1 := arg miny∈Rp

g(y) + 1

2λk‖y − xk‖2

.

Hence, computing xk+1 requires to solve a strongly convex problem. Using the Fermat rule, we have 0 ∈ (xk+1 −xk) + λk∂g(xk+1), which is called a monotone inclusion (or a generalized equation).

The following theorem proves a global convergence of (PPA), whose proof can be found in [6].

Theorem 3.2 (Convergence of PPA). Let xkk≥0 be a sequence generated by PPA. If 0 < λk < +∞ then

g(xk)− g? ≤ ‖x0 − x?‖22

2∑kj=0 λj

, ∀x? ∈ S?, k ≥ 0.

16/29


Table 1: A non-exhausted list of convex functions with tractablly proximal operator

Name Function Proximal operator Complexity

`1-norm f(x) := ‖x‖1 proxλf (x) = sign(x)⊗max|x|−λ, 0 O(n)

`2-norm f(x) := ‖x‖2 proxλf (x) = (1− λ/‖x‖2)+x O(n)

Support func-

tion

f(x) := maxy∈C x>y proxλf (x) = x− λπC(x)

indicator f(x) := δ[a,b](x) proxλf (x) = π[a,b](x) O(n)

indicator f(x) := δSp+(x) proxλf (x) = U[Σ]+V>, where x =

UΣV>O(p3)

indicator f(x) := δC , C := x | a>x = b proxλf (x) = πC(x) = x +(b−a>x‖a‖2

)a O(n)

indicator f(x) = δS(x),S := x | x ≥0, 1>x = 1

proxλf (x) = (x−ν1) for some ν ∈ R O(n)

Convex

quadratic

f(x) := (1/2)x>Qx + q>x proxλf (x) = (λI + Q)−1x O(n)→ O(n3)

Square `2-norm f(x) := (1/2)‖x‖22 proxλf (x) = (1/(1 + λ))x O(n)

log-function f(x) := − log(x) proxλf (x) = ((x2 + 4λ)1/2 + x)/2 O(1)

Here, [x]+ := max0,x, δC is the indicator function of the convex set C, sign is the sign function, and Sp+ is the

set of symmetric positive semidefinite matrices of size p× p.

Proximal-point algorithm (PPA)

1. Choose x0 ∈ Rp and a positive sequence λkk≥0 ⊂ R++.

2. For k = 0, 1, · · · , update xk+1 := proxλkg(xk).

If λk ≥ λ > 0, then g(xk)− g? ≤ ‖x0−x?‖22

2λ(k+1) , which shows that the convergence rate of PPA is O(1/k).

PPA is one of the most common algorithms for solving nonsmooth convex optimization problems of the form

(18). However, from a practical point of view, it is still a conceptual method, since at every iteration, it requires

to compute the proximal operator proxg of g at the current point xk. Unfortunately, computing proxg is almost as

hard as solving the original problem (18) in general. Therefore, PPA alone is not very useful in practice. However,

when PPA is combined with other methods such as splitting, or inexact variants, it becomes a powerful tool to

solve many convex problems as we will see later.

3.4 Distance generating functions and Bregman divergences

Given a closed, nonempty, and convex set X in Rp, and a norm ‖·‖ defined in Rp, we consider a continuous

and convex function ω from X to R. We say that ω is a distance generating function if it satisfies the following

properties:

• ω admits a selection ∇ω(x) of subgradients that is continuous in x ∈ X 0. Here, X 0 = dom (∂ω) the domain

of ∂ω.

• ω is compatible with ‖·‖, i.e., ω is strongly convex with the parameter µω > 0 w.r.t. the norm ‖·‖:

ω(y) ≥ ω(x) + 〈∇ω(x), y − x〉+ µω

2 ‖y − x‖2, ∀x ∈ X 0, y ∈ X , ∇ω(x) ∈ ∂ω(x).

Examples of these functions include ω(x) = 12 ‖x‖

22, the standard square `2-norm, and ω(x) =

∑pj=1 xj ln(xj) + p,

17/29


the entropy function defined on a standard simplex X ≡ ∆p :=x ∈ Rp+ |

∑pj=1 xj = 1

. In Nesterov’s smoothing

context, we call ω a proximity function, or a prox-function.

We define

x?c := arg minx∈X

ω(x) and diamω(X ) := maxx∈X

ω(x), (19)

the ω-center point of X and the ω-diameter of X , respectively. It is clearly x?c exists, and if X is bounded, then

diamω(X ) < +∞. Without loss of generality, we can always assume that ω(x?c) = 0. Otherwise, we can consider

ω(x) := ω(x)− ω(x?c).

Distance generating functions are often used to smooth a nonsmooth convex function in Nesterov’s smoothing

methods, or used in mirror descent methods [2, 9, 11]. For instance, let f be a convex function given by the

max-structure (15). Then, we can smooth it as follows:

fγ(x) := maxu∈U〈x,Au〉 − ϕ(u)− γω(u) , (20)

where γ > 0 is called a smoothness parameter. Clearly, we can show that

fγ(x) ≤ f(x) ≤ fγ(x) + γdiamω(X ), ∀x ∈ dom (f) .

Hence, when γ is sufficiently small, we can see that fγ approximates f . By Theorem 2.1 we can show that fγ is

smooth, and its gradient is Lipschitz with the Lipschitz constant Lf := ‖A‖2γµω

.

Given a distance generating function ω defined on X , we consider the following bifunction:

bω(x,y) := ω(x)− ω(y)− 〈∇ω(y),x− y〉, (21)

from Rp × Rp → R. This function is called the Bregman divergence between x and y (defined through ω) (also

called Bregman distance in some sense). Here are some basic properties of bω.

• Since ω is strongly convex, we have bω(x,y) ≥ µω

2 ‖x− y‖2. Hence, bω(x,y) ≥ 0 for all x,y, and bω(x,x) = 0.

• In general, bω is not a distance in standard metric sense. It may not be symmetric and may not satisfy the

triangle inequality.

• bω(·,y) is strongly convex with the same parameter µω.

We will use a Bregman distance to design mirror descent algorithms in the next lectures. We list here some

important and widely used Bregman divergences induced from the corresponding distance generating functions:

1. Euclidean distance: If ω(x) := 12 ‖x‖

22, then bω(x,y) := 1

2 ‖x− y‖22.

2. Weighted Euclidean distance: If ω(x) := 12xTQx for some Q ∈ Sp++, then bω(x,y) := 1

2 (x−y)TQ(x−y).

3. Kullback-Leibler divergence: If ω(x) :=∑pi=1 xi(log(xi) − 1) the entropy function then bω(x,y) :=∑p

i=1

[xi log

(xi

yi

)− xi + yi

], is the Kullback-Leibler divergence.

4. Itakura-Saito divergence: If ω(x) := −∑pi=1 log(xi), then bω(x,y) :=

∑pi=1

(xi

yi− log

(xi

yi

)− 1)

is the

Itakura-Saito divergence.

We note that only the functions in items 1 and 2 are distances. The two last functions do not satisfy the symmetric

axiom of distances, i.e., bω(x,y) 6= bω(y,x) for x 6= y.

18/29


4 Maximally monotone operators: A brief overview

In order to view splitting methods from a maximally monotone operator perspective, we briefly recall some concepts

and properties of such mappings. For more details on this topic, we refer the reader to many monographs including

[1, 5, 13, 14].

4.1 Maximally monotone operators

We consider a multivalued (or set-valued) mapping T : Rp ⇒ 2Rp

, where, for a given x ∈ Rp, it maps x to a set

T (x) in the range space 2Rp

containing all the subsets of Rp. Generally, we can define a multivalued mapping

on a given nonempty, closed, and convex set X instead of Rp. Similar to single-valued mappings, we define

dom (T ) := x ∈ Rp | T (x) 6= ∅ the domain of T , and gra (T ) := (x,y) ∈ dom (T )× Rp | y ∈ T (x) the graph

of T .

4.1.1 Definitions

Definition 6 (Maximal monotonicity). The multivalued mapping T : Rp ⇒ 2Rp

is said to be monotone (on its

domain) if, for any x,y ∈ dom (T ),

〈u− v,x− y〉 ≥ 0, ∀u ∈ T (x), v ∈ T (y).

The mapping T is said to be strongly monotone with the monotonicity parameter µT > 0 if, for any x,y ∈ dom (T ),

〈u− v,x− y〉 ≥ µT ‖x− y‖22 , ∀u ∈ T (x), v ∈ T (y).

The mapping T is said to be maximal if there exists no any other monotone mapping T such that gra(T)

contains

gra (T ).

We note that if T is single-valued, then T is monotone if 〈T (x)− T (y),x− y〉 ≥ 0 for all x,y ∈ dom (T ), and

T is strongly monotone with the parameter µT > 0 if 〈T (x)− T (y),x− y〉 ≥ µT ‖x− y‖22 for all x,y ∈ dom (T ).

Note that monotone function is different from a decreasing function. Figure 10 shows that the first function

is monotone, the second and the third one are non-monotone. The maximal concept is not obvious. Figure 11

x

x

T (x)

T (x)

T (x)

x

x

T (x)

T (x)

T (x)

x

x

T (x)

T (x)

T (x)

Monday, July 21, 14

Figure 10: Monotone function (left) and non-monotone function (middle and right)

provides an example to illustrate this concept. This example is given as follows:

(Maximal mapping) T (x) =

1 if x > 0,

1/2 if x = 0,

−1 if x < 0,

(Non-maximal mapping) T (x) =

1 if x > 0,

[−1, 1] if x = 0,

−1 if x < 0.

The most well-known maximally monotone operator is perhaps the subdifferential of a proper, closed, and

convex function f . In this case, the mapping T := ∂f is maximally monotone if f is convex, and it is strongly

maximally monotone if f is strongly convex, where µT = µf .

19/29


x0

T (x)

dom(T )

graph(T )

T (x) = x

x0

T (x)

dom(T )

graph(T )11

2

1

x0

T (x)

dom(T )

graph(T )1

1

Monday, July 21, 14

Figure 11: Maximally monotone operators

Clearly, if T1 and T2 are maximally monotone, then, for any α, β ≥ 0, αT1 + βT2 is also maximally monotone.

The identity mapping I is strongly maximally monotone with the parameter µI = 1. Given a maximally monotone

operator T , we define T−1(u) := x ∈ dom (T ) | u ∈ T (x) the inverse mapping of T at a given u.

Another example of maximally monotone operators is the normal cone NX (·) of a nonempty, closed, and

convex set X in Rp, which is defined as follows:

NX (x) :=

u ∈ Rp | 〈u,x− y〉 ≥ 0, ∀y ∈ X if x ∈ X ,∅ otherwise.

(22)

Figure 12 illustrates the monotonicity of NX (·) of a convex set X .

x

x

x+NX (x)

x+NX (x)

u u

x x

u 2 NX (x)

u 2 NX (x)

0 ↵ /2

X

Friday, July 11, 14

Figure 12: Normal cone and its maximal monotonicity

4.1.2 The resolvent of a maximally monotone operator

Definition 7 (Resolvent). Given a maximally monotone operator T : Rp ⇒ 2Rp

, we consider the inverse mapping

JT (·) := (I+T )−1(·). This mapping is called the resolvent of T . The mapping RT (·) := 2JT (·)− (·) is called the

reflection of T .

From this definition, we have JT (u) := x ∈ dom (T ) | u ∈ x + T (x). The well-definedness of JT (·) is guar-

anteed under the maximal monotonicity of T (see, e.g., [1]). We skip the proof.

The following proposition shows that both JT and RT are single-valued and nonexpansive.

Proposition 4.1. The resolvent JT := (I+T )−1 of a maximally monotone operator T is single-valued and co-

coercive, i.e.:

‖JT (x)− JT (y)‖2∗ ≤ 〈JT (x)− JT (y),x− y〉, ∀x,y ∈ dom (T ) . (23)

Consequently, JT is nonexpansive, i.e., ‖JT (x)− JT (y)‖∗ ≤ ‖x− y‖ for all x,y ∈ dom (T ).

20/29


Proof. Assume that JT is not single-valued. Then, there exist u and v in dom (T ) such that u ∈ JT (x), v ∈ JT (x)

and u 6= v for a given x. By the definition of JT , we have x ∈ u + T (u) and x ∈ v + T (v). Using the maximal

monotonicity of T , we have 0 ≤ 〈x− u− (x− v),u− v〉, which is equivalent to ‖u− v‖2 < 0. Hence, u 6= v.

This shows that JT is single-valued.

Next, since JT is single-valued. We define u := JT (x) and v := JT (y). By the definition of JT , we have

x ∈ u +T (u) and y ∈ v +T (v). Using the monotonicity of T , we have 〈x− u− (y − v),u− v〉 ≥ 0. Rearranging

this, we get ‖u− v‖2 ≤ 〈u− v,x− y〉, which is exactly (23). The last statement follows directly from (23) by

using Cauchy-Schwarz’s inequality.

Similarly, we can also show that the reflection RT is also non-expansive. From Proposition 4.1, we have

‖RT (x)−RT (y)‖2∗ = ‖2(JT (x)− JT (y))− (x− y)‖2∗ = 4 ‖JT (x)− JT (y)‖2∗ − 4〈JT (x)− JT (y),x− y〉+ ‖x− y‖2(23)

≤ ‖x− y‖2 .

Hence, RT is also non-expansive. We note that γT is also maximally monotone for any γ > 0 and any maximally

monotone mapping T . Hence, we can define JγT and RγT , respectively.

If T := ∂g, the subdifferential mapping of a proper, closed, and convex function g, then J∂g = proxg. In this

case, we can write

J∂g(x) = (I+∂g)−1(x) = proxg(x) = arg minu

g(u) + (1/2) ‖u− x‖2

.

Any property of resolvent JT remains true for the proximal operator proxg.

In particular, if g(·) := δX (·), the indicator function of a nonempty, closed, and convex set X in Rp, then

∂δX (·) = NX (·), the normal cone of X . Hence, proxδX = πX the projection onto X , i.e., J∂δX = πX . In this case,

the coerciveness of πX can imply

(πX (u)− u)>(y − πX (u)) ≥ ‖πX (u)− u‖22, ∀y ∈ X .

The single-valued of πX is illustrated in the following figure.

X

x+NX (x)x u

O

Unique!

Friday, July 11, 14

4.2 Monotone inclusions and variational inequalities

We consider the composite convex minimization discussed in Lecture 1:

φ? := minx∈Rp

φ(x) := f(x) + g(x) (24)

The optimality condition for (24) can be written as

0 ∈ ∂f(x) + ∂g(x) ≡ T1(x) + T2(x), (25)

which can be viewed as a monotone inclusion of the sum of two maximally monotone operators T1 = ∂f and

T2 = ∂g. Associated with a maximally monotone operator T , we defined JγT (x) := (I+γT )−1(x) the resolvent of

T . When T ≡ ∂g, then Jγ∂g(x) ≡ proxγg(x), the proximal operator of g.

21/29


Definition 8. Let T : Rp 7→ 2Rp

be a set-valued function. The problem of finding x? ∈ Rp such that

0 ∈ T (x?) (26)

is called an inclusion (also called a generalized equation). When T is single-valued, (26) becomes a system of

equations T (x?) = 0. If T is maximally monotone, then we call (26) a [maximally] monotone inclusion.

From this definition, we can see that the optimality condition (25) is indeed a maximally monotone inclusion.

Next, we consider a constrained convex program:

f? := minx∈X

f(x), (27)

where f is a smooth and convex function, and X is a nonempty, closed, and convex set in Rp. Then, the optimality

condition of this problem can be written as

∇f(x?)>(x− x?) ≥ 0, ∀x ∈ X . (28)

The inequality (28) is called the variational inequality associated to (27). We define it formally as follows.

Definition 9. Given a nonempty, closed, and convex set X in Rp. Let F : X 7→ Rp be a multivalued function.

The problem of finding x? ∈ X such that

F (x?)>(y − x?) ≥ 0, ∀x ∈ X , (VIP)

is called a variational inequality problem (VIP). If F is maximally monotone, we call (VIP) a [maximally]

monotone variational inequality.

Let NX be the normal cone of X . Then, we can write (9) as a maximally monotone inclusion of the form:

0 ∈ F (x?) +NX (x?).

Since F is maximally monotone, T := F + NX is also maximally monotone. Hence, (VIP) is a special case of

(26). The theory, applications, and numerical methods for monotone inclusions and variational inequalities can

be found, e.g., in [5, 7, 8, 16].

5 Duality theory

We first discuss the KKT (Karush-Kuhn-Tucker) condition of a smooth convex optimization problem. Then, we

recall some basic facts of the Lagrange duality theory.

5.1 Minimax theorem and its optimization view

Consider a bifunction L : X × Y → R ∪ ±∞, where X and Y are two nonempty, closed, and convex subsets in

Rp and Rn, respectively. This fact is always true:

supy∈Y

infx∈XL(x,y) ≤ inf

x∈Xsupy∈YL(x,y). (29)

If, in addition, L(·,y) is convex in X for any y ∈ Y, and L(x, ·) is concave on Y for any x ∈ X . Then, under some

mild conditions, we have

supy∈Y

infx∈XL(x,y) = inf

x∈Xsupy∈YL(x,y). (30)

The relation (30) is an informal statement of the min-max theorem in nonlinear/functional analysis.

22/29


Now, if we consider a convex problem

f? := minx∈Rp

f(x). (31)

Here, since f is proper, closed, and convex, we can write its double conjugate f∗∗(x) = supy

y>x− f∗(y)

=

f(x). Therefore, by using the min-max relation (30), we have

f? = minxf(x) = min

xmaxy

y>x− f∗(y)

= max

yminx

y>x− f∗(y)

=

−f∗(y) if y = 0

−∞ otherwise= −f∗(0).

For the composite convex problem:

minx∈Rp

F (x) := f(x) + g(x) , (32)

where f and g are proper, closed, and convex. A more direct way to derive the dual problem is as follows. if we

write (32) using conjugates of f and g, then

minx F (x) := f(x) + g(x) = minx∈Rp

F (x) := maxu

x>u− f∗(u)

+ maxv

x>v − g∗(v)

(30)= maxu maxv minx

x>(u + v)− f∗(u)− g∗(v)

= maxu maxv −f∗(u)− g∗(v) | u + v = 0= −minu f∗(u) + g∗(−u) .

Hence, we call the last problem in this chain the dual problem of (32). This principle can also be applied to other

problems as well.

5.2 Dual problem of a smooth convex program and KKT condition

Let us start with the following smooth equality constrained convex program:

f? :=

minx∈Rp

f(x)

s.t. Ax = b,(33)

where f : Rp → R is a smooth convex function, A ∈ Rm×p and b ∈ Rm.

The Lagrange function of (33) is defined as

L(x,y) := f(x)− y>(Ax− b), (34)

where y ∈ Rm is the Lagrange multiplier associated with the linear constraint Ax = b. We can define the dual

function of (33) as

d(y) := infx∈dom(f)

L(x,y) = infx∈dom(f)

f(x)− y>(Ax− b)

= − sup

x∈dom(f)

(A>y)x− f(x)

+ b>y = −f∗(A>y) + b>y. (35)

Note that the dual function is concave, but generally nonsmooth (why? ). Here, we use inf and sup since min or

max may not be attainable. The dual problem of (33) is defined as

d? := supy∈Rm

d(y). (36)

Clearly, we have

d? := supy∈Rm

d(y) = supy∈Rm

infx∈dom(f)

L(x,y) ≤ infx∈dom(f)

supy∈Rm

L(x, y) =

inf

x∈dom(f)f(x) if Ax = b,

+∞ otherwise.(37)

23/29


Hence, the dual objective function d(y) can either take +∞ or a lower bound of the optimal value f?. In this

case, we always have d(y) ≤ f(x) for all y ∈ Rm, x ∈ dom (f), and Ax = b, which is called weak duality. It

is obvious that d(y) ≤ d? ≤ f? ≤ f(x) for x ∈ F := x ∈ dom (f) | Ax = b and y ∈ Rm. This inequality also

leads to a so-called saddle point of the Lagrange function: (x?,y?) is called a saddle-point of L if

L(x?,y) ≤ L(x?,y?) ≤ L(x,y?), ∀x ∈ dom (f) ,y ∈ Rm. (38)

Now, we can write the optimality condition of (33) as∂L(x,y)∂x = ∇f(x)−A>y = 0,

∂L(x,y)∂y = b−Ax = 0.

(39)

This condition is also called the KKT (Karush-Kuhn-Tucker) condition of (33). It is indeed a nonlinear system.

If the Slater condition

ri (dom (f)) ∩ x ∈ Rp | Ax = b 6= ∅ (40)

holds, then (39) is necessary and sufficient for the pair (x?,y?) to be a primal and dual solution of (33)-(37). In

this case, we have f? = d?, which is referred to as strong duality (with zero duality gap).

5.3 Dual problem of a general convex program and KKT condition

We consider the following constrained convex optimization problem:

f? :=

minx∈Rp

f(x)

s.t. Ax− b ∈ K,x ∈ X ,

(41)

where f : Rp → R is a smooth and convex function, A ∈ Rm×p, b ∈ Rm, X is a nonempty, closed, and convex

subset in Rp, and K is a nonempty, closed, and convex cone in Rm.

We define a partial Lagrange function of this problem as follows:

L(x,y) := f(x)− 〈y,Ax− b〉, (42)

where y ∈ Rm is a Lagrange multiplier associated with the constraint Ax− b ∈ K. The dual function rendering

from this Lagrange function is defined as

d(y) = infx∈XL(x,y) = inf

x∈X

f(x)− 〈y,Ax− b〉

= − sup

x

〈A>y,x〉 − f(x)− δX (x)

+ b>y = −f∗X (A>y) + b>y. (43)

Here, we absorb X into f as fX (·) := f(·) + δX (·), where δX is the indicator of X . The dual problem becomes

d? = maxy∈Rm

d(y)− sK(y)

= − min

y∈Rm

f∗X (A>y)− b>y + sK(y)

, (44)

where sK is the support function of K. We have the following special cases:

• If K = 0m, then Ax− b = 0 and sK(y) = δRm(y). The dual problem becomes miny∈Rm

f∗X (A>y)− b>y

.

• If K := Rm+ , then Ax− b ≥ 0 and sK(y) = δRm+

(y). The dual problem becomes miny∈Rm

+

f∗X (A>y)− b>y

.

• If K := Sm+ , then Ax− b 0, and sK(y) = δSm+

(y). The dual problem becomes miny∈Sm

+

f∗X (A>y)− b>y

.

24/29


If we define the Lagrange function of (41) as L(x,y) := f(x) + 〈y,Ax− b〉. Then, (x?,y?) is a saddle point of Lif and only if

0 ∈ ∂f(x?) + A>y? +NX (x?),

Ax? − b ∈ K, x? ∈ Xy? ∈ −K∗,

(45)

where K∗ is the dual cone of K. If, in addition, we have

〈Ax? − b,y?〉 = 0, (46)

then x? is a primal optimal solution of (41). The last condition (46) is called the complementarity slackness

condition.

We can consider a more general constraint of the form A(x) − b ∈ K, where A is convex w.r.t. the convex

cone K, i.e.:

A((1− α)x + αy)− (1− α)A(x)− αA(y) ∈ K, ∀x,y ∈ Rp, ∀α ∈ [0, 1].

Here, A is not necessarily linear. For instance, if K = −Rn+, and g : Rp → Rn is convex, then g(x) ≤ 0 is convex

w.r.t. −Rn+. We refer the reader to [1] for more details on this topic.

Now, we present two examples of the optimality condition (45) and (46).

Example 16 (Linear programming). We consider the following linear programming problem:

minx∈Rp

c>x | Ax− b ≤ 0

.

Here, K := −Rn+ and X := Rp. The dual cone of −Rp+ is itself, −Rp+, and NRp(x?) = 0. The optimality

condition of this problem is

A>y? + c = 0, Ax? − b ≤ 0, y? ≥ 0, (Ax? − b) y? = 0.

This has exactly the form (45). This condition is known as KKT condition of LPs. The last condition is rendering

from the complementarity slackness condition (46), where is the element-wise product.

Example 17. In the second example, we consider the following general convex program:

f? := minx∈X

f(x) | Ax = b, gi(x) ≤ 0, i = 1, · · · ,m

, (47)

where f : Rp → R ∪ +∞ is convex and smooth, A ∈ Rn×p, b ∈ Rn, gi : Rp → R ∪ +∞ for i = 1, · · · ,m are

convex and smooth, and X is a nonempty, closed, and convex set in Rp. To avoid nontrivial case we assume that

X ∩ ri (dom (f)) 6= ∅.The corresponding dual problem is

d? := sup(µ,λ)

d(µ,λ) | µ ≥ 0

(48)

where the dual function is defined as

d(µ,λ) := infx∈X

L(x,µ,λ) := f(x) + µ>g(x) + λ>(Ax− b)

.

We say that problem (47) satisfies a so-called Slater’s condition if

∃x ∈ ri (X ) such that gi(x) < 0, ∀i = 1, · · · ,m, and Ax = b. (49)

Here, g = (g1, · · · , gm). Under this Slater condition, we have the following strong duality statement:

25/29


• There is no duality gap and exists at least one Lagrange multiplier pair (µ?,λ?).

• If the primal problem (47) has an optimal solution x?, then the triple (x?,µ?,λ?) satisfies the following

optimality condition (also called a generalized KKT condition):

〈∇f(x?) +∇g(x)>µ? + A>λ?,x− x?〉 ≥ 0, ∀x ∈ X (Lagrangian optimality)

x? ∈ X , g(x?) ≤ 0, Ax? = b (Primal feasibility)

µ? ≥ 0 (Dual feasibility)

µ?i gi(x?) = 0, i = 1, · · · ,m. (Complementarity slackness)

(50)

When X = Rp, then the first condition reduces to ∇f(x?) + ∇g(x)>µ? + A>λ? = 0. In this case, (50)

reduces to standard KKT condition widely used in the literature.

6 Convergence rates and computational complexity theory

To evaluate the efficiency of iterative methods, we often compare the rate of convergence of the sequences generated

by such methods. Let us recall it here. We also recall the big-O notation borrowed from complexity theory. Then,

we review the concepts of local convergence and global convergence also used in iterative methods.

6.1 Rate of convergence

We consider a sequence of real numbers akk≥0 in R. We say that ak is monotonically nonincreasing (nonde-

creasing) if ak+1 ≤ ak (or ak+1 ≥ ak) for all k ≥ 0. We say that ak is bounded if supk≥0 |ak| < +∞.

Definition 10. We say that a sequence ak in R converges to a? ∈ R if for any ε > 0, there exists kε depending

on ε such that for all k ≥ kε we have |ak − a?| < ε.

For example, the sequence

kk+1

k≥0

converges to a∗ = 1, or the sequencek sin

(1k

)k≥1

converges to a∗ = 1,

while the sequence

(−1)k

or the sequence sin(k)k≥0 does not converge even it is bounded.

6.1.1 Convergence rate

1. We say that ak converges to a? with a sublinear rate if |ak − a?| ≤ Ckd

for some given constants d ∈ (0,+∞)

and C > 0.

2. We say that ak converges to a? with a linear rate if |ak − a?| ≤ Cρk for some given constants ρ ∈ (0, 1)

and C > 0. Here, ρ is called a contraction factor of this sequence. This rate of convergence is also called a

R-linear convergence rate. But if we have |ak+1 − a?| ≤ ρ |ak − a?| for k ≥ 0, then we have Q-linear rate. 1

3. We say that ak converges to a? with a superlinear rate if |ak − a?| ≤ Ckρk for some given constant

ρ ∈ (0, 1) and Ck > 0 such that limk→∞ Ck = 0. Here, ρ is called the contraction factor of this sequence.

Similarly, we call this a R-superlinear convergence rate. If we have |ak+1−a?||ak−a?| → 0 as k → ∞, then we say

that ak converges to a? at a Q-superlinear convergence rate.

4. We say that ak converges to a? with a quadratic rate if |ak − a?| ≤ Cρ2k

for some given constants ρ ∈ (0, 1)

and C > 0. This rate of convergence is again a R-quadratic rate. If we have |ak+1−a?||ak−a?|2 ≤ C for all k ≥ 0,

then we have a Q-quadratic convergence rate.

Here are a few examples (we leave these examples as exercises for the reader):

• The sequence τk generated by τ0 := 1 and τk+1 := τk2 (√τ2k + 4 − τk) converges to zero at a sublinear rate

with d = 1.1Here, “R” stands for “root”, and “Q” means “quotient”.

26/29


• With τk defined above, the sequence βk with β0 = 1 and βk+1 := (1 − τk)βk converges to zero at a

sublinear rate with d = 2.

• The sequence ak defined by a0 = 1, and ak+1 := 34ak + 1

8 converges to a∗ = 12 at a linear rate with the

contraction factor ρ = 3/4.

• The sequence ak defined by a0 = 1, and ak+1 := 1k+2ak converges to a∗ = 0 at a superlinear rate.

• The sequence ak defined by a0 := 0.5, and ak+1 := a2k − 2ak + 2 converges to a∗ = 1 at a quadratic rate.

To illustrate these rates geometrically, we also look at the following figures.

100

101

102

10−10

10−8

10−6

10−4

10−2

100

CONVERGENCE RATES

Iterat ion (k)

SUBLINEARLINEAR

0 10 20 30 4010

−10

10−8

10−6

10−4

10−2

100

CONVERGENCE RATES

Iterat ion (k)

LINEARSUPERLINEAR

0 2 4 6 8 10 1210

−10

10−8

10−6

10−4

10−2

100

CONVERGENCE RATES

Iterat ion (k)

SUPERLINEARQUADRATIC

Figure 13: Sublinear: ak = 1k , Linear: ak = 0.5k, Superlinear: ak = 1

kk, and Quadratic: ak = 0.52k

6.2 A brief review of complexity theory in optimization

When we design an iterative optimization algorithm of the form xk+1 ∈ A(x0, · · · ,xk,Gk), we often query the

necessary information Gk at each iteration k to form the sequencexk

, where k is the iteration counter. This

information can be function values, subgradients, first-order, and/or second-order derivatives, etc. Depending on

the information Gk queried at each iteration, we have different classes of algorithms as follows.

• Zero-order methods: Using only function values of the objective function (and/or constraints).

• First-order methods: Using function values, and first-order derivatives (e.g., subgradients, gradients,

Jacobian, etc) of the objective function (and/or constraints).

• Quasi-second-order methods: Using function values, [sub]gradients, and approximation of Hessians from

gradients of the objective function (and/or constraints).

• Second-order methods: Using function values, gradients, and Hessians of the objective function (and/or

constraints).

The mechanism to query this information is called an oracle. The algorithm will call this oracle to query the

information Gk at each iteration. We will distinguish two notions of computational complexity:

• Analytical complexity: The total number of oracle calls in an algorithm to achieve a desired solution.

• Arithmetical complexity: The total number of arithmetic operations (floating-point operations (flops))

in an algorithm to achieve a desired solution.

Sometimes, we refer to the analytical complexity as an iteration-complexity, while the arithmetical complexity

can be called a computational complexity. The latter one can be considered as a combination of both iteration-

complexity and per-iteration complexity. Most of iterative optimization algorithms often characterizes the worst-

case iteration-complexity to achieve an approximate solution up to a given accuracy in a certain sense.

27/29


We also use the following big-O notation to denote worst-case complexity or convergence rate of an algorithm.

Given two functions f : R → R and g : R → R, we say that f(x) = O(g(x)) if there exists a constant C > 0

independent of x, and x0 such that |f(x)| ≤ C |g(x)| for x ≥ x0. Here are a few examples:

• O(1) means that f(x) is a constant.

• O(log(n)) is a logarithmic complexity (e.g., finding an item in a sorted array with a binary search). It means

that |f(n)| ≤ C log(n) for n ∈ N.

• O(n) is a linear time complexity (e.g., finding a minimum element of an array).

• O(np) (p > 0) is a polynomial time complexity. For instance, interior-point methods for conic programming

often have a worst-case polynomial time complexity.

• O(2n) is an exponential time complexity. This complexity often appears in combinatorial optimization. It

is also a worst-case complexity of simplex methods for LPs.

Note that, we also use big-O notation to characterize convergence rates. For example, O(

1n

), O(

1n2

), or O

(1√n

)stands for a sublinear convergence rate.

6.3 Local convergence vs. global convergence

We say that a sequencexk

in X generated by an algorithm A such that xk+1 ∈ A(x0, · · · ,xk) in a given

domain X locally converges to x? ∈ X if there exists a neighborhood N (x?) such that if we choose a starting

point x0 ∈ N (x?), then we have limk→∞ xk = x?. We can also describe this concept as follows. There exists

r0 > 0 such that for any x0 ∈ Br(x?) := x ∈ X | ‖x− x?‖ ≤ r such thatxk

converges to x?. In many

numerical algorithms, we often see that the sequencexk⊆ Br(x?), but it is not necessary.

local convergence region

global convergence

local convergence

x0 x1

xk

x?

We say that the sequencexk

in X generated by an algorithm A such that xk+1 ∈ A(x0, · · · ,xk) in a given

domain X globally converges to x? ∈ X if for any starting point x0 ∈ X ,xk

converges to x?. However, in many

cases, we do not have exactly the above convergence. Hence, we may use some metricM to describe a convergence

as limk→∞M(xk,x?) = 0. For instance, in gradient-type methods, we often use the objective function f as a

metric to measure the convergence, e.g., limk→∞(f(xk)− f?

)= 0, where f? is the optimal value.

As concrete examples of local and global convergence, gradient methods for convex optimization often have

global convergence guarantee, while Newton methods have a local quadratic convergence. In nonconvex optimiza-

tion, it is important to distinguish these two types of convergence.

References

[1] H. H. Bauschke and P. Combettes. Convex analysis and monotone operators theory in Hilbert spaces. Springer-

Verlag, 2nd edition, 2017.

[2] A. Beck and M. Teboulle. Smoothing and first order methods: A unified framework. SIAM J. Optim.,

22(2):557–580, 2012.

[3] S. Boyd and L. Vandenberghe. Convex Optimization. University Press, Cambridge, 2004.

28/29


[4] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms

for inverse problems in science and engineering, pages 185–212. Springer, 2011.

[5] F. Facchinei and J.-S. Pang. Finite-dimensional variational inequalities and complementarity problems, vol-

ume 1-2. Springer-Verlag, 2003.

[6] O. Guler. On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control

Optim., 29(2):403–419, 1991.

[7] D. Klatte and B. Kummer. Nonsmooth equations in optimization: regularity, calculus, methods and applica-

tions. Springer-Verlag, 2001.

[8] Boris S Mordukhovich. Variational analysis and generalized differentiation I: Basic theory, volume 330.

Springer Science & Business Media, 2006.

[9] Arkadi Nemirovski. Lectures on modern convex optimization. In Society for Industrial and Applied Mathe-

matics (SIAM). Citeseer, 2001.

[10] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87 of Applied Optimization.

Kluwer Academic Publishers, 2004.

[11] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127–152, 2005.

[12] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123–231, 2013.

[13] R. Rockafellar and R. Wets. Variational Analysis, volume 317. Springer, 2004.

[14] R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press,

1970.

[15] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and

Optimization, 14:877–898, 1976.

[16] R.T. Rockafellar and R. J-B. Wets. Variational Analysis. Springer-Verlag, 1997.

29/29

lecture 3: a brief overview on convex analysis and duality€¦ · lecture 3: a brief overview on...

Documents