lecture 3: a brief overview on convex analysis and duality€¦ · lecture 3: a brief overview on...
TRANSCRIPT
Selected Methods for Modern Optimizationin Data Analysis
Department of Statistics and Operations Research
UNC-Chapel Hill
Fall 2018
Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh
Lecture 3: A brief overview on convex analysis and duality
Index Terms: Convex sets, convex cones, convex functions, Fenchel’s conjugate, proximal operators,
monotone operators, duality theory, convergence rate, and computational complexity theory.
Copyright: This lecture is released under a Creative Commons License and Full Text of the License.
This lecture gives a very short overview for the following central concepts:
• Convex sets, polyhedral, convex hull, and convex cones.
• Convex functions, strictly/strongly convex functions, subgradients, and smooth convex functions.
• Fenchel conjugate and properties.
• Proximal operators, projections, proximity functions and Bregman divergences.
• Lagrange duality theory, saddle-points, Slater condition, and Karush-Kuhn-Tucker (KKT) condition.
• Maximal monotone operators.
• A brief review on convergence rates and computational complexity theory.
For more details on these topics, we refer to three monographs [1, 3, 14].
1 Convex sets and convex functions
Let Rp be a finite dimensional Euclidean space, and X be a subset of Rp. We assume that students have already
known some topology properties of X such as the interior of X and the boundary of X , and when X is closed or
open, etc. We denote by int (X ) the interior of X , cl (X ) the closure of X and ∂ (X ) the boundary of X . We say
that X is closed if X = cl (X ). We say that X is compact if it is closed and bounded. We use R+ and R++ for
sets of nonnegative and positive real numbers, respectively. We use Sp (respectively, Sp+ and Sp++) to denote the
set of symmetric (respectively, symmetric positive semidefinite and symmetric positive definite) matrices of size
p× p. A matrix X ∈ Sp+ is denoted by X 0. If X ∈ Sp++, we use X 0. Similarly, X 0 and X ≺ 0 mean that
−X ∈ Sp+ and −X ∈ Sp++, respectively. For P,Q ∈ Sp, we say that P Q if Q−P ∈ Sp+. If ‖ · ‖ is a norm in Rp,then ‖u‖∗ := max 〈u,v〉 | ‖v‖ ≤ 1 denotes its dual norm.
Two central concepts we use in this course are convex sets and convex functions. Let us quickly review them.
1.1 Convex sets
1.1.1 Definitions and examples
A set X is called an affine set if (1−α)x1 +αx2 ∈ X for all x1,x2 ∈ X and α ∈ R. In other words, an affine set
X is a set that contains any line going through two points x1,x2 ∈ X .
Definition 1 (Convex set). We say that X is convex if for all x1,x2 ∈ X and α ∈ [0, 1], then (1−α)x1+αx2 ∈ X .
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
x1
x2↵x1 + (1 ↵)x2
x1
x2
↵x1 +
(1↵)x
2
x1
x2
↵x1 +
(1↵)x
2
Figure 1: Convex set (left), nonconvex set (middle) and polytope (right)
There are several equivalent definitions of convex sets. For example, X is convex if it contains any closed
segment [x1,x2] connecting two points x1,x2 ∈ X .
Example 1. Here are a few quick examples:
1. There are two trivial convex sets: 0p and Rp. Of course, the empty set, ∅, is also convex.
2. A hyperplane H(a, b) :=x ∈ Rp | a>x = b
in Rp is convex. A half-plane H+(a, b) :=
x ∈ Rp | a>x ≤ b
in Rp is convex.
3. A ball Br(x) := x ∈ Rp | ‖x− x‖ ≤ r is convex, where the center x ∈ Rp and the radius r ≥ 0 are given.
4. A linear matrix inequality x ∈ Rp |∑pi=1 xiAi 0 is convex, where Ai (i = 1, · · · , p) are symmetric
matrices in Sq (why?).
Proof: (It is the intersection of⋂
u∈Rp
u>A(x)u ≤ 0
with A(x) =
∑pi=1 xiAi, which is linear in x).
1.1.2 Polyhedra
Definition 2 (Convex set). We say that X is polyhedral in Rp if it is the intersection of a finite number of
half-planes in Rp, i.e.,
P :=x ∈ Rp | a>i x ≤ bi, i = 1, · · · ,m
.
When P is bounded, we called P a polytope as in Figure 1 (right). We can represent a polyhedron into
a matrix form as P := x ∈ Rp | Ax ≤ b, where A ∈ Rm×p whose rows are a>i for i = 1, · · · ,m, and b :=
(b1, · · · ,bm)> ∈ Rm. Convex polyhedra form the feasible set of linear and quadratic programs as we have seen
before. They also represent the feasible set of many other problems as we will see in the next lectures.
Example 2. Here are a few quick examples:
1. A box in Rp as [a,b] := x ∈ Rp | aj ≤ xj ≤ bj , j = 1, · · · , p is a polyhedron in Rp.
2. A hyperplane H(a, b) :=x ∈ Rp | a>x = b
in Rp is also a polyhedron.
3. The `1-norm ball B‖·‖1 := x ∈ Rp | ‖x‖1 ≤ 1 is a polytope.
4. The standard simplex in Rp, ∆p := x ∈ Rp |∑pi=1 xi = 1, xi ≥ 0, i = 1, · · · , p is also a polytope.
1.1.3 Convex hulls
Given a set of n points xini=1 in Rp and n nonnegative numbers αini=1 ⊂ R+ such that∑ni=1 αi = 1, then∑p
i=1 αixi is called a convex combination of xini=1 with the coefficients αini=1. We consider a nonempty
set X in Rp, we define the following setx ∈ Rp | x =
n∑i=1
αixi, αi ∈ R+,
n∑i=1
αi = 1,xi ∈ X , n ≥ 1
. (1)
2/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
This set is called the convex hull of X , which is denoted by conv (X ).
Given n points xini=1 in Rp and n real numbers αini=1 ⊂ R such that∑ni=1 αi = 1 (not necessarily
nonnegative), then∑pi=1 αixi is called an affine combination of xini=1 with the coefficients αini=1. Given a
subset X in Rp, let aff (X ) be the set that contains all the affine combinations of any nonempty set of points in
X . Then aff (X ) is called the affine hull of X .
x(1)
x(2)
x(3)
x(4)
x(5)
x(6) convex hull of 6 points
Y
conv(Y)
x(1)
x(2)
z=
(1 ↵
)x(1) +↵x
(2)
Figure 2: Convex hulls. Left: The convex combination of two points, Middle: The convex hull of a finite set of
points, Right: The convex hull of a nonconvex set Y.
Example 3. Here are a few quick examples:
1. Given two points x(1) and x(2), then, conv(
x(1),x(2))
= [x(1),x(2)] the segment between these two points.
2. conv (ei,−ei | 1 ≤ i ≤ p) = B‖·‖1 := x ∈ Rp | ‖x‖1 ≤ 1, where ei is the i-unit vector of Rp.
1.1.4 Basic properties
We recall some basic properties of convex sets:
• The intersection of convex sets is convex. More precisely, if Xi, i ∈ I (I is an index set) are convex, then
∩i∈IXi is also convex.
• But the union of two convex sets is not necessarily convex.
• The scale of a convex set is convex, i.e., αX := αx | x ∈ X is convex for any convex set X in Rp and
α ∈ R.
• The Cartesian product X ⊗ Y of two convex sets X and Y is convex.
• The sum z := x + y | x ∈ X ,y ∈ Y of two convex set X and Y is convex.
1.1.5 Relative interior
We say that x ∈ X is a relative interior point of X if there exists (an open) neighborhood Vε(x) of x such that
Vε(x) ∩ aff (X ) ⊆ X .
Example 4. A real interval [a,b] has interior (a,b) in R but does not have interior in R2. In this case, [a,b] has
relative interior (a,b) in R2. Similarly, a plane does not have interior in R3 but it has relative interior in R3.
3/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
1.1.6 Extreme points
Now, we define extreme points, and extreme rays of a convex set.
Definition 3. Given a convex set X , a point x ∈ X is called an extreme point of X if x cannot be written as the
middle point of two separated points y, z ∈ X .
In other words, x ∈ X is an extreme point of X if x cannot be written as x = 12 (y + z), where y, z ∈ X and
y 6= z.
Figure 3 illustrates that x (denoted by E) is an extreme point of X , since it cannot be written as x = 12 (y+ z)
for y, z ∈ X and y 6= z. But x (denoted by N) is not an extreme point, since we can write x = 12 (y + z) where
y, z ∈ X . Here, x is a point of the boundary of X . Clearly, any extreme point should be on the boundary of X .
y
z
xy
z
x
y
z
x
zxy
y
z
xE E EE
N
zxy
N
Figure 3: Extreme points: E is an extreme point, and N is not an extreme point. Left: X has a finite number of
extreme points (red), Middle: X has an infinite number of extreme points, Right: All the points on the sphere
are extreme points.
The existence of extreme point is stated by Krein-Milman’s theorem [14]. We omit the proof here.
Theorem 1.1 (Krein-Milman’s theorem). Any nonempty, closed and convex set that does not contain a straight
line always has an extreme point.
1.1.7 Convex cones
A set K in Rp is called a cone if tx ∈ K for any t ≥ 0 and x ∈ K. The set K is called a convex cone if K is a
cone and convex. In other words, K is a convex cone if for all x1,x2 ∈ X and α, β ≥ 0, αx1 + βx2 ∈ K.
0
x1
x2↵x1+x2
K
Tuesday, June 17, 14
Figure 4: A convex cone
Example 5. Here are a few examples.
• The first orthant of Rp, Rp+, is a convex cone, called an orthant cone.
4/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
• The set of positive semidefinite matrices Sp+ := X ∈ Sp | X 0 is a convex cone (why?).
Proof: First, if X,Y ∈ Sp+, then for any α, β ≥ 0, we have αX + βY ∈ Sp+. It is a convex cone. Next, we
can also write Sp+ =⋂
u∈Rpu>Xu ≥ 0 with u>Xu = trace((uu>)X), which is linear in X.
• The Lorentz set Lp :=
(x, t) ∈ Rp+1 | ‖x‖2 ≤ t
is a convex cone, called an icream cone, or a second order
cone (why?).
Proof: Since ‖x‖2 ≤ t is equivalent to t2 − ‖x‖22 ≥ 0, or t− t−1x>x ≥ 0. By Schur’s complement, we have[t x>
x tI
] 0. This inequality presents a convex set.
One can also prove this directly. Take (x, t) ∈ Lp and (y, s) ∈ Lp, we have ‖x‖2 ≤ t and ‖y‖2 ≤ s.
Now, for any α ≥ 0 and β ≥ 0, we have ‖αx + βy‖2 ≤ α ‖x‖2 + β ‖y‖2 ≤ αt + βs, which shows that
α(x, t) + β(y, s) ∈ Lp.
Given a nonempty, closed, and convex set X , we call TX (x) and NX (x) the tangent and normal cone of X at
x ∈ X , respectively, which are defined as
TX (x) := cl (d ∈ Rp | x + td ∈ X for some t > 0) and NX (x) :=w ∈ Rp | w>(x− y) ≥ 0, ∀y ∈ X
. (2)
These two cones are key concepts to study the optimality condition of optimization problems.
1.2 Convex functions
In this section, we briefly review some basic concepts and properties of convex functions. The reader can refer to
[1, 3, 14] for more details.
1.2.1 Basic definitions
Consider a function f from Rp to R ∪ ±∞, which takes both −∞ and +∞ into its values.
1. We denote by dom (f) := x ∈ Rp | f(x) < +∞ the (effective) domain of f .
2. We say that f is proper if f(x) > −∞ for all x ∈ Rp, i.e., f does not take the −∞ value, and dom (f) 6= ∅.
3. We also denote by epi (f) := (x, t) ∈ dom (f)× R | f(x) ≤ t the epi-graph of f .
4. We say that f is closed if epi (f) is closed.
Now, we give a formal definition of convex function.
Definition 4. Given a nonempty subset X ⊆ Rp, we say that f is convex on X if for any x1,x2 ∈ X and any
α ∈ [0, 1], we have (1− α)x1 + αx2 ∈ X and
f((1− α)x1 + αx2) ≤ (1− α)f(x1) + αf(x2). (3)
We say that f is convex if it is convex on its domain. If f is said to be concave on X if −f is convex on X .
We can also show that f is convex iff its epi (f) is convex. We denote by Γ0(X ) the class of proper, closed and
convex functions on X . If a function f is not convex, then we say that f is nonconvex.
Example 6. Here are a few examples of convex functions.
1. The affine function f(x) := a>x + b for given a ∈ Rp and b ∈ R is convex on Rp.
2. The quadratic function (positive semidefinite Hessian) f(x) := 12 〈Qx,x〉 + q>x for Q ∈ Sp+ and q ∈ Rp is
convex on Rp.
5/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
x
f(x)
x1 x2
f(x2)
f(x1)
x
f(x)
x1 x2
f(x2)
f(x1)
↵x1 + (1 ↵)x2
f(↵x1 + (1 ↵)x2)
↵f(x1) + (1 ↵)f(x2)
x
f(x)
x1 x2
f(x2)
f(x1)
Figure 5: Nonconvex function (left), convex function (middle), concave function (right)
3. The logarithmic function f(x) := − log(x) for x ∈ R++ is convex on R++.
4. The logistic loss function f(x) := log(1 + ex) is convex on R.
5. The norm f(x) := ‖x‖ is convex on Rp. In particular, the square `2-norm f(x) := ‖x‖22 is convex on Rp.
1.2.2 The indicator and support functions
There are two special functions: the indicator and support functions of a convex set. Given a nonempty, closed,
and convex set X in Rp, we define
• The indicator function of X :
δX (x) :=
0 if x ∈ X+∞ otherwise
(4)
This function is proper, closed, and convex on Rp with dom (f) = X .
• The support function sX of X is
sX (x) := supu∈X〈x,u〉 . (5)
This function is also convex on Rp.
1.2.3 Basic properties of convex functions
Here are some basic properties of convex functions.
1. f is convex on Rp iff epi (f) is convex on Rp × R.
Proof: Assume that f is convex. For any (x, t) ∈ epi (f), (y, s) ∈ epi (f), with x,y ∈ dom (f) and t, s ∈ R,
we have f(x) ≤ t and f(y) ≤ t. Since f is convex, for any α ∈ [0, 1], f((1−α)x+αy) ≤ (1−α)f(x)+αf(y) ≤(1−α)t+αs. This shows that ((1−α)x+αy, (1−α)t+αs) ∈ epi (f). Conversely, for any x,y ∈ dom (f) we
take t = f(x) and s = f(y). This shows that (x, t), (y, s) ∈ epi (f). Since epi (f) is convex, for any α ∈ [0, 1],
we have (1−α)x +αy ∈ epi (f), which means that f((1−α)x +αy) ≤ (1−α)t+αs = (1−α)f(x) +αf(y).
This shows that f is convex.
2. If f and g are two convex functions on Rp, then αf + βg is also convex for any α, β ∈ R+.
3. If f : Rn → R ∪ +∞ is convex and A : Rp → Rn is a linear/affine function, then f(A(·)) is also convex.
4. Jensen’s inequality: Given a convex function f ∈ Γ0(Rp), n points xi ∈ dom (f), αi ≥ 0 for i = 1, · · · , nand
∑ni=1 αi = 1, we have
f
(n∑i=1
αixi
)≤
n∑i=1
αif(xi).
6/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
5. A function f : Rn → R ∪ +∞ is convex iff, for any x ∈ dom (f) and v ∈ Rp, its line restriction
ϕ(t) = f(x + tv) with dom (ϕ) := t ∈ R | x + tv ∈ dom (f)
is convex on dom (ϕ). This property is very important to verify whether a function is convex or not.
Proof: If f is convex, we have ϕ((1 − α)t + αs) = f(x + ((1 − α)t + αs)d) = f((1 − α)(x + td) + α(x +
sd)) ≤ (1 − α)f(x + td) + αf(x + sd) = (1 − α)ϕ(t) + αϕ(s) for any α ∈ [0, 1], and t, s. Conversely,
if ϕ((1 − α)t + αs) ≤ (1 − α)ϕ(t) + αϕ(s), then, for any x,y ∈ dom (f), we take d = y − x, we have
f((1 − α)(x + td) + α(x + sd)) ≤ (1 − α)f(x + td) + αf(x + sd). Take t = 0, and s = 1, we get
f((1− α)x + αy) ≤ (1− α)f(x) + αf(y).
6. If f1, · · · , fn are convex, then f(x) := max fi(x) | 1 ≤ i ≤ n is also convex. This is a useful property for
representing nonsmooth convex functions. For example, |x| = 2 max x, 0 − x.
Proof: We note that epi (f) = (x, t) | f(x) ≤ t, which means that fi(x) ≤ t for i = 1, · · · , n. Hence,
epi (f) =⋂ni=1 (x, t) | fi(x) ≤ t =
⋂ni=1 epi (fi). One can also verify this property directly using the
definition of convex functions above.
Example 7 (Sum of r-largest components). Let x ∈ Rp, we denote x[i] the i-th largest component of x, i.e.,
x[1] ≥ x[2] ≥ · · · ≥ x[p]. Then, the function f(x) =∑ri=1 x[i] is convex.
Indeed, it can be written as f(x) = max xi1 + · · ·+ xir | 1 ≤ i1 < i2 < · · · < ir ≤ n.
7. Let F (x,y) be a bifunction for y ∈ Y. If F (·,y) is convex w.r.t. to x, then the marginal f defined as
f(x) = supy∈Y
F (x,y)
of F indexed by y is also convex (why?).
Proof: We note that epi (f) = ∩y∈Yepi (F (·,y)).
Example 8. The conjugate of f is defined by f∗(x) := supy∈Rp 〈x,y〉 − f(y), which is convex.
8. If P : Rp × R → Rp is defined as P (x, t) = x/t with dom (P ) := (x, t) ∈ Rp × R | t > 0. Then, for any
convex set X ⊆ dom (P ), we have P (X ) := P (x, t) | (x, t) ∈ X ⊂ Rp is a convex set.
Proof: Let take (x, t), (y, s) ∈ C. We consider P ((1−α)(x, t) +α(y, t)) = (1−α)x+αy(1−α)t+αs =
(1− αs
(1−α)t+αs
)xt +(
αs(1−α)t+αs
)ys = (1− β)P (x) + βP (y), where β := αs
(1−α)t+αs ∈ [0, 1].
9. Let f be convex from Rp → R ∪ +∞. The perspective of f is defined as
g(x, t) := tf (x/t) with dom (g) := (x, t) ∈ Rp × R | x/t ∈ dom (f) , t > 0 .
In this case, g is also convex.
Example 9. The function g(x, t) :=‖x‖22t is convex on Rp × R++.
Proof: We note that (x, t, s) ∈ epi (g) ⇔ tf(x/t) ≤ s ⇔ f(x/t) ≤ s/t ⇔ (x/t, s/t) ∈ epi (f). Finally,
using Item 8, we obtain the result.
10. If F (x,y) is convex with respect to x and y, and Y is a nonempty, and convex set, then
f(x) = infy∈Y
F (x,y)
is convex.
7/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
Example 10. The Moreau envelope g(x) := infu∈Rp
f(u) +
1
2‖u− x‖22
is convex. The distance function
from x to a convex set X , dX (x) := infy∈X ‖y − x‖2, is convex.
Proof: Given x1,x2 ∈ dom (g), since g(xi) = infy∈X F (xi,y) for i = 1, 2, we have y1,y2 ∈ X such that
f(xi,yi) ≤ g(xi) + ε for any ε > 0. We verify the definition as f((1− α)x1 + αx2) = infy∈X F ((1− α)x1 +
αx2,y) ≤ F ((1−α)x1 +αx2, (1−α)y1 +αy2) ≤ (1−α)F (x1,y1) +αF (x2,y2) ≤ (1−α)g(x1) +αg(x2) + ε.
Letting ε to zero, we obtain g((1− α)x1 + αx2) ≤ (1− α)g(x1) + αg(x2). Hence, g is convex.
1.2.4 Strictly and strongly convex functions
We say that f is strictly convex on X if for any x,y ∈ X and α ∈ (0, 1) with x 6= y, we have f((1−α)x +αy) <
(1−α)f(x) +αf(y). We say that f is strongly convex with the convexity parameter µf > 0 if f(·)− µf
2 ‖·‖2
is
convex. This is equivalent to
f((1− α)x + αy) ≤ (1− α)f(x) + αf(y)− µf2
(1− α)α ‖y − x‖2 . (6)
f(x) f(y)
x y
f(x)
x
Set of minima
x y
f(x)
f(y)
f(x)
x
(1 ↵)f(x) + ↵f(y)
f((1 ↵)x + ↵y)
(1 ↵)x + ↵y
Figure 6: Nonstrongly convex function (left) vs. strongly convex function (right)
Example 11. We note that f(x) = − log(x) is strictly convex on R+ but is not strongly convex on R+. The
function f(x) := 12 ‖x‖
2is strongly convex with the convexity parameter µf = 1. If f is a convex function on Rp
and µf > 0, then f(·) +µf
2 ‖·‖2
is strongly convex with the convexity parameter µf . We can also shortly say that
f is µf -strongly convex when f is strongly convex with the strong convexity parameter µf > 0.
1.2.5 Subgradients and subdifferential
Given a convex function f , a vector w ∈ Rp is said to be a subgradient of f at x ∈ dom (f) if
f(x) ≥ f(x) + 〈w,x− x〉, ∀x ∈ dom (f) . (7)
We denote by ∂f(x) := w ∈ Rp | f(x) ≥ f(x) + 〈w,x− x〉, ∀x ∈ dom (f) the subdifferential of f at x. A
function f ∈ Γ0(Rp) is said to be subdifferentiable at x ∈ dom (f) if ∂f(x) is nonempty. If f is convex and
continuous on dom (f), then it is subdifferentiable at any point x ∈ ri (dom (f)). If f is differentiable at x ∈dom (f), then ∂f(x) = ∇f(x), where ∇f(x) is the gradient of f at x.
Here are some basic properties:
• For any x ∈ dom (f), if ∂f(x) is nonempty, then f is convex.
• If f ∈ Γ0(Rp), then ∂f(x) is nonempty and bounded for any x ∈ int (dom (f)).
• f(x?) = minx∈Rp
f(x) if and only if 0 ∈ ∂f(x?) (Fermat’s rule).
Example 12. The function f(x) = |x| on R is subdifferentiable at x = 0, and ∂f(0) = [−1, 1], the whole interval
[−1, 1]. If x 6= 0, then f is differentiable.
8/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
f(x)
x
f(x)
x
f(x) + hw1, x xif(x) + hw2, x xi
f(x)
x
f(x)
x
f(x) + hrf(x), x xi
The graph is above the
tangent line
f(x)
xxf(x) + hw2, x xi
epi(f)
Figure 7: Nonsmooth function (left), smooth function (middle), support of epi-graph (right)
Evaluation of subgradients: For simple functions, computing a subgradient can be done explicitly. However,
if the function is complicated, this task is not easy. We review some basic rules in order to compute a subgradient
of a convex function.
• If f ∈ Γ0(Rp), then ∂f(x) is closed and convex, and it is nonempty and compact if x ∈ ri (dom (f)).
• If g(x) = f(Ax + b) then, ∂g(x) = A>∂f(Ax + b).
• If f = αg + βh, then ∂f(x) = α∂g(x) + β∂h(x), for any x ∈ int (dom (f)) = int (dom (g)) ∩ int (dom (h)).
• If f(x) = max1≤i≤m
fi(x), then ∂f(x) = conv (⋃i ∂fi(x) | f(x) = fi(x)).
• Let fω ∈ Γ0(Rp) be a family of convex functions indexed by ω ∈ Ω. If f(x) = supω∈Ω fω(x), then
conv(⋃
i∈I(x) ∂fω(x))⊆ ∂f(x), where I(x) := ω ∈ Ω | fω(x) = f(x). If Ω is compact, and fω is contin-
uous w.r.t. ω, then we have the equality.
Example 13. Let f(x) = maxa>i x + bi | 1 ≤ i ≤ m
. Let us denote I(x) :=
i ∈ 1, 2, · · · ,m | f(x) = a>i x + bi
.
Then, ∂f(x) = conv (ai | i ∈ I(x)).Example 14. Let us consider f(X) := λmax(X) the maximum eigenvalue of a symmetric matrix X. Since we can
write λmax(X) = max‖u‖2≤1
u>Xu
. Since fu(X) = u>Xu, we have ∂fu(X) = u(X)u(X)>, where u(X) is the
unit eigenvector of X corresponding to the eigenvalue λmax(X). Hence, we have u(X)u(X)> ∈ ∂f(X).
1.3 Smooth convex functions
We assume that f is smooth on X , which means that f has first-order or up to second-order derivative, which is
continuous on X . We denote by ∇f and ∇2f its gradient [vector] and Hessian [matrix], respectively.
For a smooth function f on X , it is convex on X iff
(a) f(y) ≥ f(x) + 〈∇f(x),y − x〉 for all x,y ∈ X , or
(b) 〈∇f(x)−∇f(y),x− y〉 ≥ 0 for all x,y ∈ X .
If, in addition, f is second-order smooth (i.e., twice continuously differentiable) on X , then f is convex on X iff
∇2f(x) 0 for all x ∈ X .
Proof: (a) If f is convex, then f((1 − α)x + αy) ≤ (1 − α)f(x) + αf(y), or f(x + α(y − x)) − f(x) ≤α(f(y)−f(x)). We can write f(x+α(y−x))−f(x)
α ≤ f(y)−f(x). Taking the limit as α→ 0, we get 〈∇f(x),y − x〉 =
limα→0f(x+α(y−x))−f(x)
α ≤ f(y) − f(x). Conversely, let zα = (1 − α)x + αy. We have f(y) ≥ f(zα) +
〈∇f(zα),y − zα〉 and f(x) ≥ f(zα) + 〈∇f(zα),x− zα〉. Multiplying the first inequality by α and the sec-
ond one by 1 − α, then summing up the result, we have (1 − α)f(x) + αf(y) ≥ f(zα). Here, we note that
(1− α)(x− zα) + α(y − zα) = 0.
9/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
(b) By adding (a) with exchange variable between x and y, we get (b). Conversely, let zt = x + t(y − x) for
any t ∈ [0, 1]. We have zt − x = t(y − x). By the mean-value theorem, we have f(y)− f(x)− 〈∇f(x),y − x〉 =∫ 1
0〈∇f(x + t(y − x))−∇f(x),y − x〉dt =
∫ 1
01t 〈∇f(zt)−∇f(x), zt − x〉dt ≥ 0 due to (b). Hence, we get (a).
1.3.1 Lipschitz gradient convex functions
We say that ∇f is Lipschitz continuous on X with the Lipschitz constant Lf ∈ [0,+∞) if
‖∇f(x)−∇f(y)‖∗ ≤ Lf ‖x− y‖ , ∀x,y ∈ X . (8)
where ‖·‖∗ is the dual norm of ‖·‖. We can say that f has Lipschitz gradient ∇f (or call Lf -Lipschitz gradient,
or Lf -smooth). We denote by F1,1L (X ) the class of convex functions f and f has Lipschitz gradient ∇f with the
Lipschitz constant Lf ∈ [0,+∞).
If f ∈ F1,1L (X ) and g ∈ F1,1
L (X ), then h = f + g is Lipschitz gradient with Lh = Lf + Lg, and h = cf is
Lipschitz gradient with Lh = |c|Lf for any c ∈ R. If f ∈ F1,1L (X ) and g ∈ F1,1
L (f(X )), then h = g f is Lipschitz
gradient with Lh = LfLg.
f(x)
x
f(x)
x
f(x) + hrf(x), x xi
f(x) + hrf(x), x xi +Lf
2 kx xk2
f(x) + hrf(x), x xi Lf
2 kx xk2
Figure 8: A Lipschitz gradient convex function
Example 15. Here are few examples.
1. The function f(x) := 12 ‖Ax− b‖22 has gradient ∇f(x) = A>(Ax−b) on Rp, which is Lipschitz continuous
on Rp with the Lipschitz constant Lf := ‖A‖2 (see the operator or matrix norm defined above).
2. The function f(x) := log(1 + ea>x+b) has gradient ∇f(x) =
(ea>x+b
1+ea>x+b
)a, which is Lipschitz continuous on
Rp with the Lipschitz constant Lf := ‖a‖24 .
3. The gradient of the function f(x) := − log(a>x− b) is NOT Lipschitz continuous on its domain dom (f) :=x ∈ Rp | a>x− b > 0
.
From the Lipschitz gradient assumption, we can derive the following bounds [10]:
Theorem 1.2. 1. Bound on the gradient (co-coericiveness) [Baillon-Haddad’s theorem]: For any x,y ∈ dom (f),
we have
〈∇f(x)−∇f(y),x− y〉 ≥ 1
Lf‖∇f(x)−∇f(y)‖2∗ . (9)
2. Bound on the function values: For any x,y ∈ dom (f), we have
1
2Lf‖∇f(x)−∇f(y)‖2∗ ≤ f(y)− f(x)− 〈∇f(x),y − x〉 ≤ Lf
2‖y − x‖2 . (10)
10/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
This property is called the monotonicity of ∇f . As we will see, these inequalities are very useful to develop
first-order algorithms in the sequel.
Proof: (1) Let us fix x ∈ dom (f), and consider a function ϕ(y) := f(y)− f(x)−∇f(x)>(y−x), which is smooth
and convex. Moreover, ∇ϕ(·) = ∇f(·)−∇f(x) is also Lipschitz continuous with the same Lipschitz constant Lf .
For any y ∈ dom (f), by convexity of f , we have ϕ(y) ≥ ϕ(x) = 0. Hence, x is an optimal solution of ϕ. Using the
Lipschitz continuity of ∇ϕ, we have 0 = ϕ(x) ≤ ϕ(y− 1Lf∇ϕ(y)) = ϕ(y)− 1
2Lf‖∇ϕ(y)‖2∗. Using the definition of
ϕ, we can write this as f(x) + 〈∇f(x),y − x〉+ 12Lf‖∇f(y)−∇f(x)‖2∗ ≤ f(y). Exchanging the role of x and y
in this inequality, and summing this inequality and the resulting one, we obtain (10).
(2) Using the mean-valued theorem, we have f(y) − f(x) =∫ 1
0∇f(x + τ(y − x))>(y − x)dτ . Hence, we can
show that |f(y) − f(x) − ∇f(x)>(y − x)| =∣∣∣∫ 1
0[∇f(x + τ(y − x))−∇f(x)]
>(y − x)dτ
∣∣∣ ≤ ∫ 1
0‖∇f(x + τ(y −
x))−∇f(x)‖∗ ‖y − x‖ dτ ≤ Lf ‖y − x‖2∫ 1
0τdτ =
Lf
2 ‖y − x‖2. This implies the right hand side of (10).
To prove the left-hand side of (10), we consider the function φ(y) := f(y) − ∇f(x)>(y − x) for a fixed
x ∈ Rp. We have ∇φ(y) = ∇f(y) − ∇f(x), which leads to ∇φ(x) = 0. Hence, y = x is the solution of
miny φ(y). On the other hand, ∇φ is also Lipschitz continuous with the same Lipschitz constant Lf , we have
φ(y−L−1f ∇φ(y)) ≤ φ(y)+∇φ(y)>(y−L−1
f ∇φ(y)−y)+Lf
2 ‖y − L−1f ∇φ(y)− y‖2 ≤ φ(y)− 1
2Lf‖∇φ(y)‖2∗. Using
these arguments, we can derive
φ(x) ≤ φ(y − L−1f ∇φ(y)) ≤ φ(y)− 1
2Lf‖∇φ(y)‖2∗ .
Using φ(y) = f(y)−∇f(x)>(y − x) in the above inequality, we get
f(x) ≤ f(y)−∇f(x)>(y − x)− 1
2Lf‖∇f(y)−∇f(x)‖2∗ ,
which is the left-hand side of (10). We obtain (10) by adding two symmetric forms of (10).
If we further assume that f is twice continuously differentiable with Lf -Lipschitz gradient and µf -strongly
convex, then it is necessary and sufficient that
µf I ∇2f(x) Lf I, ∀x ∈ Rp,
where I is the identity mapping. As a standard example, we consider a quadratic function f(x) := (1/2)〈Qx,x〉+q>x, where Q is symmetric positive definite. We have ∇2f(x) = Q. Hence, f is strongly convex and Lipschitz
gradient continuous with µf = λmin(Q) > 0 and Lf := λmax(Q).
1.3.2 Strongly convex and smooth functions
For strongly convex and smooth functions, we have the following properties (We skip the proof, the reader can
find it in [10]):
1. Strong monotonicity: 〈∇f(x)−∇f(y),x− y〉 ≥ µf ‖x− y‖2 for x,y ∈ Rp.
2. Bound on function values: f(y) ≥ f(x) + 〈∇f(x),y − x〉+µf
2 ‖y − x‖2 for x,y ∈ Rp.
3. Bound on function values with gradients: f(y) ≤ f(x) + 〈∇f(x),y − x〉+ 12µf‖∇f(y)−∇f(x)‖2 for x,y ∈
Rp.
4. Co-coercive property: 〈∇f(y)−∇f(x),y − x〉 ≤ 1µf‖∇f(y)−∇f(x)‖2 for x,y ∈ Rp.
When the function f is both Lf -Lipschitz continuous and µf -strongly convex, we have the following property:
〈∇f(x)−∇f(y),x− y〉 ≥ µfLfLf + µf
‖x− y‖2 +1
Lf + µf‖∇f(x)−∇f(y)‖2 .
11/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
To prove this property, we consider φ(x) := f(x)− (µf/2) ‖x‖2 and use (10) for φ.
Let f be strongly convex and smooth with the strong convexity parameter µf > 0. We consider the following
unconstrained convex problem:
f? := minx∈Rp
f(x). (11)
The optimality condition (i.e., Fermat’s rule) is ∇f(x?) = 0 which is necessary and sufficient for x? to be a
uniquely optimal solution of this problem. Moreover, we have the following properties
f(x) ≥ f(x?) +µf2‖x− x?‖2, (12)
for all x ∈ dom (f). Note that, by changing the gradient ∇f to subgradients in the following proof, the existence
and uniqueness of x? and the inequality (12) still hold for (11) even when f is nonsmooth.
Proof: We note that f(x) ≥ f(x?) + 〈∇f(x?),x− x?〉+µf
2 ‖x− x?‖2 = f? +µf
2 ‖x− x?‖2 for all x ∈ dom (f). If
x? is another solution of the above problem, then f(x?) = f?. Using the above inequality, we have f? = f(x?) ≥f? µ2 ‖x? − x?‖2. Hence, we have ‖x? − x?‖2 ≤ 0, we have x? = x?. We conclude that the problem has unique
solution.
To prove the existence of x?, we take any α ∈ R, and consider Lα(f) := x ∈ dom (f) | f(x) ≤ α, the
sublevel set of f . Since f is closed, Lα(f) is closed. Now, we show that it is bounded. Assume by contradiction
that Lα(f) is unbounded. Hence, there existsxk⊂ Lα(f) such that ‖xk‖ → +∞. This also implies that
‖xk− (x0− 1µf∇f(x0))‖2 →∞. By strong convexity of f , we have f(xk) ≥ f(x0) + 〈∇f(x0),xk − x0〉+ µf
2 ‖xk−x0‖22 = f(x0)+
µf
2 ‖xk−(x0− 1µf∇f(x0))‖22− 1
2µf‖∇f(x0)‖2 →∞. This is contradict with the fact that f(xk) ≤ α.
Hence, Lα(f) is bounded. By Weierstrass’s theorem, problem minx∈Rp f(x) has a solution x?.
The inequality (12) holds due to Property 2 above, and the fact that ∇f(x?) = 0.
2 Fenchel conjugates
Let f ∈ Γ0(Rp) be a proper, closed, and convex function. The Fenchel conjugate of f is defined as
f∗(y) := supx∈dom(f)
〈x,y〉 − f(x) . (13)
Clearly, f∗ is convex even f is nonconvex. Moreover, f∗∗∗ = f∗. This concept can be illustrated in Figure 9 for
both convex and non-convex functions.
f(x)
yT x
x
0
(0,f(y))
x
yT x
f(x)
xT y
x
f(x)
(0,f(y))
0
Figure 9: The conjugate of a non-convex function (left), and of a convex function (right). Here, f∗(y) is the
maximum gap between the linear function x>y (red line) and f(x), as shown in dashed line.
It is proven that if f ∈ Γ0(Rp) (i.e., f is proper, closed, and convex), then f∗∗ = f , and f∗ ∈ Γ0(Rp).
f∗∗(x) = supy〈x,y〉 − f∗(y) = sup
y
〈x,y〉 − sup
v〈y,v〉 − f(v)
= sup
yinfv〈y,x− v〉+ f(v) ≤ f(x).
12/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
It is also clear that
f(x) + f∗(y) ≥ 〈x,y〉, ∀x,y and y ∈ ∂f(x) ⇔ x ∈ ∂f∗(y). (14)
From this definition, if we take f(u) := δU (u) the indicator of a nonempty, closed, and convex set U ⊂ Rp, then
f∗(x) = supu∈U 〈x,u〉, which is the support function of U . Hence, the conjugate of the indicator function of a
convex set is its support function. If f(u) = ‖u‖, then
f∗(x) = supu〈x,u〉 − ‖u‖ = δB‖·‖∗ (0,1)(x) =
0 ‖x‖∗ ≤ 1
+∞ otherwise,
the indicator of the dual norm ‖·‖∗. To prove this, we use 〈x,u〉 ≤ ‖u‖ ‖x‖∗.Here are some well-known examples. More examples can be found in [1].
1. If f(x) = − log(x) for x ∈ R++, then f∗(y) = − log(−y)− 1 for y ∈ −R++.
2. If f(x) = − log det(X) for X ∈ Sp++, then f∗(Y) = − log det(−Y)− p for Y ∈ −Sp++.
3. If f(x) = log(1 + ex) for x ∈ R, then f∗(y) = y ln(y) + (1 − y) ln(1 − y) for 0 ≤ y ≤ 1, which is the
cross-entropy function (Here, we use a convention (or fact) that 0 ln(0) = 0).
For a nonconvex function f , its conjugate f∗ is convex since f∗(x) := supu F (x,u), where F (x,u) := 〈x,u〉 −f(u) is linear (therefore, convex) in x indexed by u (applying the marginal theorem). Two well-known examples:
1. Consider the cardinality function f(x) := card(x) = # |x| = ‖x‖0, the number of nonzero elements of x.
Without loss of generality, we can consider x ∈ [−1, 1]p. The conjugate of this function is
f∗(u) := supx∈[−1,1]p
〈u,x〉 − card(x) .
Proof: Let σ ∈ 1, 2, · · · , p be an order of x such that |uσi| ≥
∣∣uσi−1
∣∣ (meaning that we sort the absolute
coordinates of u in ascending order). Then, we can write f∗ as
f∗(u) := maxr∈1,··· ,p
supx∈[−1,1]r
r∑i=1
uσixσi− r.
If we fix the order r ∈ 1, · · · , p, then supx∈[−1,1]r ∑ri=1 uσixσi − r achieves at xi = sign (ui). Hence, we
can write
f∗(u) := maxr∈1,··· ,p
r∑i=1
(|uσi | − 1) =
p∑i=1
max 0, |ui| − 1 .
Now, we take the conjugate of f? as
f∗∗(x) := supu∈Rp
〈x,u〉 − f∗(u) = supu∈Rp
〈x,u〉 −
p∑i=1
max 0, |ui| − 1
=
p∑i=1
supui
xiui −max 0, |ui| − 1 .
It is not hard to show that supuixiui −max 0, |ui| − 1 = |xi| by solving one-variable maximization
problem with xi ∈ [−1, 1]. Hence, f∗∗(x) =∑pi=1 |xi| = ‖x‖1. Indeed, since xi ∈ [−1, 1], we have
xiui ≤ |ui|. We consider the following cases:
• If |ui| ≤ 1, then we have supuixiui −max 0, |ui| − 1 = supui
xiui = |xi| with ui = sign (xi).
• If ui > 1, supuixiui − |ui|+ 1 = 1 if xi = 1, and supui
xiui − |ui|+ 1 = +∞ if xi ∈ (−1, 1).
• If ui < −1, supuixiui − |ui|+ 1 = 1 if xi = −1, and supui
xiui − |ui|+ 1 = +∞ if xi ∈ (−1, 1).
13/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
This allows us to conclusion that supuixiui −max 0, |ui| − 1 = |xi|.
2. What about the rank function f(X) := rank (X)? Its convex envelope is the nuclear norm f∗∗(X) :=
‖X‖∗. By SVD decomposition, we have rank (X) := ‖σ(X)‖0, where σ(X) := (σ1(X), · · · , σr(X)) with
r = min m,n is the vector of singular values of X. Hence, we can write the function in X as a function in
σ as
f(X) = ‖σ‖0 with X = Udiag(σ)VT , UTU = I, VTV = I .
We note that nuclear norm is ‖X‖∗ :=∑ri=1 |σi(X)| = ‖σ(X)‖1.
One important property of Fenchel conjugate functions is the following result.
Theorem 2.1. If f is µf -strongly convex with µf > 0, then f∗ is smooth, and its gradient is Lipschitz continuous
with the Lipschitz constant Lf∗ = 1µf
. Conversely, if f is Lf -Lipschitz gradient with Lf > 0, then f∗ is µf -strongly
convex with µf∗ = 1Lf
.
Proof: We consider f∗(x) = supu 〈x,u〉 − f(u) = −minu f(u)− 〈x, u〉. Since, f is µf -strongly convex,
f(·) − 〈x, ·〉 is also µf strongly convex, the minimization problem has unique solution u∗(x) for any x. By the
marginal theorem, f∗ is smooth, and its gradient is given by ∇f∗(x) = u∗(x).
Now, we show that ∇f∗ is Lipschitz continuous. Indeed, from the optimality condition of the minimization
problem, we have x ∈ ∂f(u∗(x)). Hence, f(u∗(y)) ≥ f(u∗(x)) + 〈x,u∗(y)− u∗(x)〉 +µf
2 ‖u∗(y)− u∗(x)‖2.
Similarly, f(u∗(x)) ≥ f(u∗(y)) + 〈y,u∗(x)− u∗(y)〉 +µf
2 ‖u∗(y)− u∗(x)‖2. Summing up both inequalities, we
obtain µf ‖u∗(y)− u∗(x)‖2 ≤ 〈u∗(y)− u∗(x),y − x〉. Hence, by using Cauchy-Schwarz inequality, we can show
that ‖u∗(y)− u∗(x)‖ ≤ 1µf‖x− y‖. We conclude that ∇f∗ is Lipschitz continuous with the constant Lf∗ = 1
µf.
To prove the second part, since f is Lipschitz gradient, ∇f is co-coercive as shown in (10), we have
〈∇f(u∗(x))−∇f(u∗(y)),u∗(x)− u∗(y)〉 ≥ 1
Lf‖∇f(u∗(x))−∇f(u∗(y))‖2 .
Since f is smooth, from the optimality condition, we have x = ∇f(u∗(x)) and ∇f∗(x) = u∗(x). Using this into
the above inequality, we get
〈x− y,∇f∗(x)−∇f∗(y)〉 = 〈x− y,u∗(x)− u∗(y)〉 ≥ 1
Lf‖x− y‖2 .
We conclude that f∗ is strongly convex with the parameter µf∗ = 1Lf
.
Another important convex function is the following max function:
f(x) := maxu∈U
〈x,Au〉 − ϕ(u)
, (15)
where U is a nonempty, closed and convex set, and ϕ is a strongly convex function with the parameter µϕ > 0.
Then, by the same proof, we can show that f is smooth, and its gradient is Lipschitz continuous with the Lipschitz
constant Lf := ‖A‖2µϕ
. We leave this as an exercise for students.
3 Proximal operators
A key tool to handle the nonsmoothness of a convex function is proximal operator. This notion is also the core of
many proximal-type algorithms in the sequel. We discuss it here.
14/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
3.1 Definition and basic properties
Given a proper, closed, and convex function f : Rp → R ∪ +∞, the proximal operator of f at x is defined as
proxf (x) := arg miny
f(y) +
1
2‖y − x‖2
. (16)
First, for any x ∈ Rp, proxf (x) is well-defined and unique. Why? Since the objective function of the minimization
problem is f(·)+ 12 ‖· − x‖2 which is strongly convex with the convexity parameter µ = 1. Here are some important
properties of proxf .
Lemma 3.1. Let f ∈ Γ0(Rp). Then proxf satisfies the following properties
1. proxf satisfies
x− proxf (x) ∈ ∂f(proxf (x)),
‖proxf (x)− proxf (y)‖2 ≤ 〈x− y,proxf (x)− proxf (y)〉 ∀x,y ∈ Rp.
2. proxf is non-expansive, i.e.:
‖proxf (x)− proxf (y)‖ ≤ ‖x− y‖ ∀x,y ∈ Rp.
3. (Fenchel-Moreau’s decomposition) If proxf∗ is the proximal operator of the conjugate function f∗ of f ,
then proxf and proxf∗ satisfy
proxf (x) + proxf∗(x) = x, ∀x ∈ dom (f) .
Proof: The first relation comes from the optimality condition of the strongly convex minimization problem (16).
Next, we have x − proxf (x) ∈ ∂f(proxf (x)) and y − proxf (y) ∈ ∂f(proxf (y)) from the first item. Using the
monotonicity of ∂f , we have 〈x− proxf (x)− (y − proxf (y)),proxf (x)− proxf (y)〉 ≥ 0. Rearrange this we obtain
the second estimate in the first item. The second item is a direct consequence of the first item using the Cauchy
- Schwarz inequality.
We finally prove the third item. We consider f∗(x) = supu 〈x,u〉 − f(u) the conjugate of f . The proximal
operator proxf∗(x) = v?(x) of f∗ at x is the solution of
proxf∗(x) = v?(x) = arg minv
f∗(v) + (1/2) ‖v − x‖2 ⇔ minv
maxu
〈v,u〉 − f(u) + (1/2) ‖v − x‖2
.
We write the optimality condition of this min-max problem to get 0 ∈ v?(x)−∂f(u?(x)) and u?(x)+v?(x)−x = 0.
The last relation implies v?(x) = x−u∗(x). Substituting this into the first relation to get 0 ∈ u?(x)−x+∂f(u?(x)).
Hence, u?(x) = proxf (x). Using u?(x) = proxf (x), v?(x) = proxf∗(x) and u?(x) + v?(x) = x, we can show that
proxf (x) + proxf∗(x) = x.
Exercise. Prove that proxγf (x) + γproxγ−1f∗(γ−1x) = x for any f ∈ Γ0(Rp), γ > 0 and x ∈ Rp.
Projection: Clearly, if f is the indicator function of a nonempty, closed and convex set X ⊆ Rp, i.e., f = δX ,
then δX ∈ Γ0(Rp). Hence, we can define the proximal operator of δX using (16), which turns out to be
proxδX (x) := arg miny
δX (y) +
1
2‖y − x‖2
= arg min
y∈X‖y − x‖ = πX (x), (17)
where πX is the projection operator onto X .
15/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
3.2 Tractable proximity
Computing proxf of a convex function f requires to solve a strongly convex program: miny
f(y) + 1
2 ‖y − x‖2
.
If f is smooth, we can bring this problem into solving a nonlinear system ∇f(y) + y = x, where the unknown is
y. Specially, when f is component-wise decomposable as f(x) =∑pi=1 fi(xi), where fi is convex, we can compute
proxf by solving p one-dimension minimization problems minuifi(ui) + (1/2) |ui − xi|2 for i = 1, · · · , p. This can
be done in a closed form or efficiently.
Definition 5 (Tractable proximity). Given f ∈ Γ0(Rp). We say that f is proximally tractable if proxf defined
by (16) can be computed efficiently.
By “efficiently”, we mean that proxf can be computed in a closed form (i.e., in an explicitly analytical form)
or can be computed in [low-order] polynomial time. We denote by Fprox(Rp) the class of proximally tractable
convex functions.
Here are a few concrete examples, but a non-exhausted list is given in Table 1, or can be found in [4, 12].
1. Separable functions: Many separable functions where their prox-operator can be computed in a closed
form. For example, f(x) := ‖x‖1 =∑ni=1 |xi| has a prox-operator proxλf (x) = sign(x) ⊗max|x| − λ, 0,
which is the well-known soft-thresholding operator used in image/signal processing. Another example
is group sparsity penalty: f(x) :=∑gj=1 ‖x[j]‖2.
2. Smooth functions: For several smooth functions, we can obtain a closed form proximal operator, whenever
we can solve the optimality ∇f(z) + z − x = 0 explicitly. For example, with f(x) := (1/2)‖Ax − b‖22, we
have proxλf (x) = (I + λA>A)−1(λA>b + x).
3. The indicator of simple sets: For a given nonempty, closed and convex set C, we consider the projection πConto C. Then, the prox-operator of the indicator function f(x) := δC(x) of C is given as proxλf (x) := πC(x).
When C is simple (e.g., subspace, box, cone, or simplex), computing πC is also cheap and efficient.
3.3 A short overview of the proximal-point algorithm
The proximal-point method is one of the fundamental methods in convex optimization. This method is often
combined with other schemes such as gradient descent or Newton methods to obtain implementable variants. Let
us briefly review it here.
Let g ∈ Γ0(Rp) be proper, closed, and convex. We consider the following convex minimization problem:
g? := minx∈Rp
g(x). (18)
Let us denote by S? the solution set of this problem and assume that S? is nonempty. The proximal-point
algorithm for solving (18) can be described as follows [6, 15]:
The update xk+1 := proxλkg(xk) can be explicitly written as
xk+1 := arg miny∈Rp
g(y) + 1
2λk‖y − xk‖2
.
Hence, computing xk+1 requires to solve a strongly convex problem. Using the Fermat rule, we have 0 ∈ (xk+1 −xk) + λk∂g(xk+1), which is called a monotone inclusion (or a generalized equation).
The following theorem proves a global convergence of (PPA), whose proof can be found in [6].
Theorem 3.2 (Convergence of PPA). Let xkk≥0 be a sequence generated by PPA. If 0 < λk < +∞ then
g(xk)− g? ≤ ‖x0 − x?‖22
2∑kj=0 λj
, ∀x? ∈ S?, k ≥ 0.
16/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
Table 1: A non-exhausted list of convex functions with tractablly proximal operator
Name Function Proximal operator Complexity
`1-norm f(x) := ‖x‖1 proxλf (x) = sign(x)⊗max|x|−λ, 0 O(n)
`2-norm f(x) := ‖x‖2 proxλf (x) = (1− λ/‖x‖2)+x O(n)
Support func-
tion
f(x) := maxy∈C x>y proxλf (x) = x− λπC(x)
indicator f(x) := δ[a,b](x) proxλf (x) = π[a,b](x) O(n)
indicator f(x) := δSp+(x) proxλf (x) = U[Σ]+V>, where x =
UΣV>O(p3)
indicator f(x) := δC , C := x | a>x = b proxλf (x) = πC(x) = x +(b−a>x‖a‖2
)a O(n)
indicator f(x) = δS(x),S := x | x ≥0, 1>x = 1
proxλf (x) = (x−ν1) for some ν ∈ R O(n)
Convex
quadratic
f(x) := (1/2)x>Qx + q>x proxλf (x) = (λI + Q)−1x O(n)→ O(n3)
Square `2-norm f(x) := (1/2)‖x‖22 proxλf (x) = (1/(1 + λ))x O(n)
log-function f(x) := − log(x) proxλf (x) = ((x2 + 4λ)1/2 + x)/2 O(1)
Here, [x]+ := max0,x, δC is the indicator function of the convex set C, sign is the sign function, and Sp+ is the
set of symmetric positive semidefinite matrices of size p× p.
Proximal-point algorithm (PPA)
1. Choose x0 ∈ Rp and a positive sequence λkk≥0 ⊂ R++.
2. For k = 0, 1, · · · , update xk+1 := proxλkg(xk).
If λk ≥ λ > 0, then g(xk)− g? ≤ ‖x0−x?‖22
2λ(k+1) , which shows that the convergence rate of PPA is O(1/k).
PPA is one of the most common algorithms for solving nonsmooth convex optimization problems of the form
(18). However, from a practical point of view, it is still a conceptual method, since at every iteration, it requires
to compute the proximal operator proxg of g at the current point xk. Unfortunately, computing proxg is almost as
hard as solving the original problem (18) in general. Therefore, PPA alone is not very useful in practice. However,
when PPA is combined with other methods such as splitting, or inexact variants, it becomes a powerful tool to
solve many convex problems as we will see later.
3.4 Distance generating functions and Bregman divergences
Given a closed, nonempty, and convex set X in Rp, and a norm ‖·‖ defined in Rp, we consider a continuous
and convex function ω from X to R. We say that ω is a distance generating function if it satisfies the following
properties:
• ω admits a selection ∇ω(x) of subgradients that is continuous in x ∈ X 0. Here, X 0 = dom (∂ω) the domain
of ∂ω.
• ω is compatible with ‖·‖, i.e., ω is strongly convex with the parameter µω > 0 w.r.t. the norm ‖·‖:
ω(y) ≥ ω(x) + 〈∇ω(x), y − x〉+ µω
2 ‖y − x‖2, ∀x ∈ X 0, y ∈ X , ∇ω(x) ∈ ∂ω(x).
Examples of these functions include ω(x) = 12 ‖x‖
22, the standard square `2-norm, and ω(x) =
∑pj=1 xj ln(xj) + p,
17/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
the entropy function defined on a standard simplex X ≡ ∆p :=x ∈ Rp+ |
∑pj=1 xj = 1
. In Nesterov’s smoothing
context, we call ω a proximity function, or a prox-function.
We define
x?c := arg minx∈X
ω(x) and diamω(X ) := maxx∈X
ω(x), (19)
the ω-center point of X and the ω-diameter of X , respectively. It is clearly x?c exists, and if X is bounded, then
diamω(X ) < +∞. Without loss of generality, we can always assume that ω(x?c) = 0. Otherwise, we can consider
ω(x) := ω(x)− ω(x?c).
Distance generating functions are often used to smooth a nonsmooth convex function in Nesterov’s smoothing
methods, or used in mirror descent methods [2, 9, 11]. For instance, let f be a convex function given by the
max-structure (15). Then, we can smooth it as follows:
fγ(x) := maxu∈U〈x,Au〉 − ϕ(u)− γω(u) , (20)
where γ > 0 is called a smoothness parameter. Clearly, we can show that
fγ(x) ≤ f(x) ≤ fγ(x) + γdiamω(X ), ∀x ∈ dom (f) .
Hence, when γ is sufficiently small, we can see that fγ approximates f . By Theorem 2.1 we can show that fγ is
smooth, and its gradient is Lipschitz with the Lipschitz constant Lf := ‖A‖2γµω
.
Given a distance generating function ω defined on X , we consider the following bifunction:
bω(x,y) := ω(x)− ω(y)− 〈∇ω(y),x− y〉, (21)
from Rp × Rp → R. This function is called the Bregman divergence between x and y (defined through ω) (also
called Bregman distance in some sense). Here are some basic properties of bω.
• Since ω is strongly convex, we have bω(x,y) ≥ µω
2 ‖x− y‖2. Hence, bω(x,y) ≥ 0 for all x,y, and bω(x,x) = 0.
• In general, bω is not a distance in standard metric sense. It may not be symmetric and may not satisfy the
triangle inequality.
• bω(·,y) is strongly convex with the same parameter µω.
We will use a Bregman distance to design mirror descent algorithms in the next lectures. We list here some
important and widely used Bregman divergences induced from the corresponding distance generating functions:
1. Euclidean distance: If ω(x) := 12 ‖x‖
22, then bω(x,y) := 1
2 ‖x− y‖22.
2. Weighted Euclidean distance: If ω(x) := 12xTQx for some Q ∈ Sp++, then bω(x,y) := 1
2 (x−y)TQ(x−y).
3. Kullback-Leibler divergence: If ω(x) :=∑pi=1 xi(log(xi) − 1) the entropy function then bω(x,y) :=∑p
i=1
[xi log
(xi
yi
)− xi + yi
], is the Kullback-Leibler divergence.
4. Itakura-Saito divergence: If ω(x) := −∑pi=1 log(xi), then bω(x,y) :=
∑pi=1
(xi
yi− log
(xi
yi
)− 1)
is the
Itakura-Saito divergence.
We note that only the functions in items 1 and 2 are distances. The two last functions do not satisfy the symmetric
axiom of distances, i.e., bω(x,y) 6= bω(y,x) for x 6= y.
18/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
4 Maximally monotone operators: A brief overview
In order to view splitting methods from a maximally monotone operator perspective, we briefly recall some concepts
and properties of such mappings. For more details on this topic, we refer the reader to many monographs including
[1, 5, 13, 14].
4.1 Maximally monotone operators
We consider a multivalued (or set-valued) mapping T : Rp ⇒ 2Rp
, where, for a given x ∈ Rp, it maps x to a set
T (x) in the range space 2Rp
containing all the subsets of Rp. Generally, we can define a multivalued mapping
on a given nonempty, closed, and convex set X instead of Rp. Similar to single-valued mappings, we define
dom (T ) := x ∈ Rp | T (x) 6= ∅ the domain of T , and gra (T ) := (x,y) ∈ dom (T )× Rp | y ∈ T (x) the graph
of T .
4.1.1 Definitions
Definition 6 (Maximal monotonicity). The multivalued mapping T : Rp ⇒ 2Rp
is said to be monotone (on its
domain) if, for any x,y ∈ dom (T ),
〈u− v,x− y〉 ≥ 0, ∀u ∈ T (x), v ∈ T (y).
The mapping T is said to be strongly monotone with the monotonicity parameter µT > 0 if, for any x,y ∈ dom (T ),
〈u− v,x− y〉 ≥ µT ‖x− y‖22 , ∀u ∈ T (x), v ∈ T (y).
The mapping T is said to be maximal if there exists no any other monotone mapping T such that gra(T)
contains
gra (T ).
We note that if T is single-valued, then T is monotone if 〈T (x)− T (y),x− y〉 ≥ 0 for all x,y ∈ dom (T ), and
T is strongly monotone with the parameter µT > 0 if 〈T (x)− T (y),x− y〉 ≥ µT ‖x− y‖22 for all x,y ∈ dom (T ).
Note that monotone function is different from a decreasing function. Figure 10 shows that the first function
is monotone, the second and the third one are non-monotone. The maximal concept is not obvious. Figure 11
x
x
T (x)
T (x)
T (x)
x
x
T (x)
T (x)
T (x)
x
x
T (x)
T (x)
T (x)
Monday, July 21, 14
Figure 10: Monotone function (left) and non-monotone function (middle and right)
provides an example to illustrate this concept. This example is given as follows:
(Maximal mapping) T (x) =
1 if x > 0,
1/2 if x = 0,
−1 if x < 0,
(Non-maximal mapping) T (x) =
1 if x > 0,
[−1, 1] if x = 0,
−1 if x < 0.
The most well-known maximally monotone operator is perhaps the subdifferential of a proper, closed, and
convex function f . In this case, the mapping T := ∂f is maximally monotone if f is convex, and it is strongly
maximally monotone if f is strongly convex, where µT = µf .
19/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
x0
T (x)
dom(T )
graph(T )
T (x) = x
x0
T (x)
dom(T )
graph(T )11
2
1
x0
T (x)
dom(T )
graph(T )1
1
Monday, July 21, 14
Figure 11: Maximally monotone operators
Clearly, if T1 and T2 are maximally monotone, then, for any α, β ≥ 0, αT1 + βT2 is also maximally monotone.
The identity mapping I is strongly maximally monotone with the parameter µI = 1. Given a maximally monotone
operator T , we define T−1(u) := x ∈ dom (T ) | u ∈ T (x) the inverse mapping of T at a given u.
Another example of maximally monotone operators is the normal cone NX (·) of a nonempty, closed, and
convex set X in Rp, which is defined as follows:
NX (x) :=
u ∈ Rp | 〈u,x− y〉 ≥ 0, ∀y ∈ X if x ∈ X ,∅ otherwise.
(22)
Figure 12 illustrates the monotonicity of NX (·) of a convex set X .
x
x
x+NX (x)
x+NX (x)
u u
x x
u 2 NX (x)
u 2 NX (x)
0 ↵ /2
X
Friday, July 11, 14
Figure 12: Normal cone and its maximal monotonicity
4.1.2 The resolvent of a maximally monotone operator
Definition 7 (Resolvent). Given a maximally monotone operator T : Rp ⇒ 2Rp
, we consider the inverse mapping
JT (·) := (I+T )−1(·). This mapping is called the resolvent of T . The mapping RT (·) := 2JT (·)− (·) is called the
reflection of T .
From this definition, we have JT (u) := x ∈ dom (T ) | u ∈ x + T (x). The well-definedness of JT (·) is guar-
anteed under the maximal monotonicity of T (see, e.g., [1]). We skip the proof.
The following proposition shows that both JT and RT are single-valued and nonexpansive.
Proposition 4.1. The resolvent JT := (I+T )−1 of a maximally monotone operator T is single-valued and co-
coercive, i.e.:
‖JT (x)− JT (y)‖2∗ ≤ 〈JT (x)− JT (y),x− y〉, ∀x,y ∈ dom (T ) . (23)
Consequently, JT is nonexpansive, i.e., ‖JT (x)− JT (y)‖∗ ≤ ‖x− y‖ for all x,y ∈ dom (T ).
20/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
Proof. Assume that JT is not single-valued. Then, there exist u and v in dom (T ) such that u ∈ JT (x), v ∈ JT (x)
and u 6= v for a given x. By the definition of JT , we have x ∈ u + T (u) and x ∈ v + T (v). Using the maximal
monotonicity of T , we have 0 ≤ 〈x− u− (x− v),u− v〉, which is equivalent to ‖u− v‖2 < 0. Hence, u 6= v.
This shows that JT is single-valued.
Next, since JT is single-valued. We define u := JT (x) and v := JT (y). By the definition of JT , we have
x ∈ u +T (u) and y ∈ v +T (v). Using the monotonicity of T , we have 〈x− u− (y − v),u− v〉 ≥ 0. Rearranging
this, we get ‖u− v‖2 ≤ 〈u− v,x− y〉, which is exactly (23). The last statement follows directly from (23) by
using Cauchy-Schwarz’s inequality.
Similarly, we can also show that the reflection RT is also non-expansive. From Proposition 4.1, we have
‖RT (x)−RT (y)‖2∗ = ‖2(JT (x)− JT (y))− (x− y)‖2∗ = 4 ‖JT (x)− JT (y)‖2∗ − 4〈JT (x)− JT (y),x− y〉+ ‖x− y‖2(23)
≤ ‖x− y‖2 .
Hence, RT is also non-expansive. We note that γT is also maximally monotone for any γ > 0 and any maximally
monotone mapping T . Hence, we can define JγT and RγT , respectively.
If T := ∂g, the subdifferential mapping of a proper, closed, and convex function g, then J∂g = proxg. In this
case, we can write
J∂g(x) = (I+∂g)−1(x) = proxg(x) = arg minu
g(u) + (1/2) ‖u− x‖2
.
Any property of resolvent JT remains true for the proximal operator proxg.
In particular, if g(·) := δX (·), the indicator function of a nonempty, closed, and convex set X in Rp, then
∂δX (·) = NX (·), the normal cone of X . Hence, proxδX = πX the projection onto X , i.e., J∂δX = πX . In this case,
the coerciveness of πX can imply
(πX (u)− u)>(y − πX (u)) ≥ ‖πX (u)− u‖22, ∀y ∈ X .
The single-valued of πX is illustrated in the following figure.
X
x+NX (x)x u
O
Unique!
Friday, July 11, 14
4.2 Monotone inclusions and variational inequalities
We consider the composite convex minimization discussed in Lecture 1:
φ? := minx∈Rp
φ(x) := f(x) + g(x) (24)
The optimality condition for (24) can be written as
0 ∈ ∂f(x) + ∂g(x) ≡ T1(x) + T2(x), (25)
which can be viewed as a monotone inclusion of the sum of two maximally monotone operators T1 = ∂f and
T2 = ∂g. Associated with a maximally monotone operator T , we defined JγT (x) := (I+γT )−1(x) the resolvent of
T . When T ≡ ∂g, then Jγ∂g(x) ≡ proxγg(x), the proximal operator of g.
21/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
Definition 8. Let T : Rp 7→ 2Rp
be a set-valued function. The problem of finding x? ∈ Rp such that
0 ∈ T (x?) (26)
is called an inclusion (also called a generalized equation). When T is single-valued, (26) becomes a system of
equations T (x?) = 0. If T is maximally monotone, then we call (26) a [maximally] monotone inclusion.
From this definition, we can see that the optimality condition (25) is indeed a maximally monotone inclusion.
Next, we consider a constrained convex program:
f? := minx∈X
f(x), (27)
where f is a smooth and convex function, and X is a nonempty, closed, and convex set in Rp. Then, the optimality
condition of this problem can be written as
∇f(x?)>(x− x?) ≥ 0, ∀x ∈ X . (28)
The inequality (28) is called the variational inequality associated to (27). We define it formally as follows.
Definition 9. Given a nonempty, closed, and convex set X in Rp. Let F : X 7→ Rp be a multivalued function.
The problem of finding x? ∈ X such that
F (x?)>(y − x?) ≥ 0, ∀x ∈ X , (VIP)
is called a variational inequality problem (VIP). If F is maximally monotone, we call (VIP) a [maximally]
monotone variational inequality.
Let NX be the normal cone of X . Then, we can write (9) as a maximally monotone inclusion of the form:
0 ∈ F (x?) +NX (x?).
Since F is maximally monotone, T := F + NX is also maximally monotone. Hence, (VIP) is a special case of
(26). The theory, applications, and numerical methods for monotone inclusions and variational inequalities can
be found, e.g., in [5, 7, 8, 16].
5 Duality theory
We first discuss the KKT (Karush-Kuhn-Tucker) condition of a smooth convex optimization problem. Then, we
recall some basic facts of the Lagrange duality theory.
5.1 Minimax theorem and its optimization view
Consider a bifunction L : X × Y → R ∪ ±∞, where X and Y are two nonempty, closed, and convex subsets in
Rp and Rn, respectively. This fact is always true:
supy∈Y
infx∈XL(x,y) ≤ inf
x∈Xsupy∈YL(x,y). (29)
If, in addition, L(·,y) is convex in X for any y ∈ Y, and L(x, ·) is concave on Y for any x ∈ X . Then, under some
mild conditions, we have
supy∈Y
infx∈XL(x,y) = inf
x∈Xsupy∈YL(x,y). (30)
The relation (30) is an informal statement of the min-max theorem in nonlinear/functional analysis.
22/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
Now, if we consider a convex problem
f? := minx∈Rp
f(x). (31)
Here, since f is proper, closed, and convex, we can write its double conjugate f∗∗(x) = supy
y>x− f∗(y)
=
f(x). Therefore, by using the min-max relation (30), we have
f? = minxf(x) = min
xmaxy
y>x− f∗(y)
= max
yminx
y>x− f∗(y)
=
−f∗(y) if y = 0
−∞ otherwise= −f∗(0).
For the composite convex problem:
minx∈Rp
F (x) := f(x) + g(x) , (32)
where f and g are proper, closed, and convex. A more direct way to derive the dual problem is as follows. if we
write (32) using conjugates of f and g, then
minx F (x) := f(x) + g(x) = minx∈Rp
F (x) := maxu
x>u− f∗(u)
+ maxv
x>v − g∗(v)
(30)= maxu maxv minx
x>(u + v)− f∗(u)− g∗(v)
= maxu maxv −f∗(u)− g∗(v) | u + v = 0= −minu f∗(u) + g∗(−u) .
Hence, we call the last problem in this chain the dual problem of (32). This principle can also be applied to other
problems as well.
5.2 Dual problem of a smooth convex program and KKT condition
Let us start with the following smooth equality constrained convex program:
f? :=
minx∈Rp
f(x)
s.t. Ax = b,(33)
where f : Rp → R is a smooth convex function, A ∈ Rm×p and b ∈ Rm.
The Lagrange function of (33) is defined as
L(x,y) := f(x)− y>(Ax− b), (34)
where y ∈ Rm is the Lagrange multiplier associated with the linear constraint Ax = b. We can define the dual
function of (33) as
d(y) := infx∈dom(f)
L(x,y) = infx∈dom(f)
f(x)− y>(Ax− b)
= − sup
x∈dom(f)
(A>y)x− f(x)
+ b>y = −f∗(A>y) + b>y. (35)
Note that the dual function is concave, but generally nonsmooth (why? ). Here, we use inf and sup since min or
max may not be attainable. The dual problem of (33) is defined as
d? := supy∈Rm
d(y). (36)
Clearly, we have
d? := supy∈Rm
d(y) = supy∈Rm
infx∈dom(f)
L(x,y) ≤ infx∈dom(f)
supy∈Rm
L(x, y) =
inf
x∈dom(f)f(x) if Ax = b,
+∞ otherwise.(37)
23/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
Hence, the dual objective function d(y) can either take +∞ or a lower bound of the optimal value f?. In this
case, we always have d(y) ≤ f(x) for all y ∈ Rm, x ∈ dom (f), and Ax = b, which is called weak duality. It
is obvious that d(y) ≤ d? ≤ f? ≤ f(x) for x ∈ F := x ∈ dom (f) | Ax = b and y ∈ Rm. This inequality also
leads to a so-called saddle point of the Lagrange function: (x?,y?) is called a saddle-point of L if
L(x?,y) ≤ L(x?,y?) ≤ L(x,y?), ∀x ∈ dom (f) ,y ∈ Rm. (38)
Now, we can write the optimality condition of (33) as∂L(x,y)∂x = ∇f(x)−A>y = 0,
∂L(x,y)∂y = b−Ax = 0.
(39)
This condition is also called the KKT (Karush-Kuhn-Tucker) condition of (33). It is indeed a nonlinear system.
If the Slater condition
ri (dom (f)) ∩ x ∈ Rp | Ax = b 6= ∅ (40)
holds, then (39) is necessary and sufficient for the pair (x?,y?) to be a primal and dual solution of (33)-(37). In
this case, we have f? = d?, which is referred to as strong duality (with zero duality gap).
5.3 Dual problem of a general convex program and KKT condition
We consider the following constrained convex optimization problem:
f? :=
minx∈Rp
f(x)
s.t. Ax− b ∈ K,x ∈ X ,
(41)
where f : Rp → R is a smooth and convex function, A ∈ Rm×p, b ∈ Rm, X is a nonempty, closed, and convex
subset in Rp, and K is a nonempty, closed, and convex cone in Rm.
We define a partial Lagrange function of this problem as follows:
L(x,y) := f(x)− 〈y,Ax− b〉, (42)
where y ∈ Rm is a Lagrange multiplier associated with the constraint Ax− b ∈ K. The dual function rendering
from this Lagrange function is defined as
d(y) = infx∈XL(x,y) = inf
x∈X
f(x)− 〈y,Ax− b〉
= − sup
x
〈A>y,x〉 − f(x)− δX (x)
+ b>y = −f∗X (A>y) + b>y. (43)
Here, we absorb X into f as fX (·) := f(·) + δX (·), where δX is the indicator of X . The dual problem becomes
d? = maxy∈Rm
d(y)− sK(y)
= − min
y∈Rm
f∗X (A>y)− b>y + sK(y)
, (44)
where sK is the support function of K. We have the following special cases:
• If K = 0m, then Ax− b = 0 and sK(y) = δRm(y). The dual problem becomes miny∈Rm
f∗X (A>y)− b>y
.
• If K := Rm+ , then Ax− b ≥ 0 and sK(y) = δRm+
(y). The dual problem becomes miny∈Rm
+
f∗X (A>y)− b>y
.
• If K := Sm+ , then Ax− b 0, and sK(y) = δSm+
(y). The dual problem becomes miny∈Sm
+
f∗X (A>y)− b>y
.
24/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
If we define the Lagrange function of (41) as L(x,y) := f(x) + 〈y,Ax− b〉. Then, (x?,y?) is a saddle point of Lif and only if
0 ∈ ∂f(x?) + A>y? +NX (x?),
Ax? − b ∈ K, x? ∈ Xy? ∈ −K∗,
(45)
where K∗ is the dual cone of K. If, in addition, we have
〈Ax? − b,y?〉 = 0, (46)
then x? is a primal optimal solution of (41). The last condition (46) is called the complementarity slackness
condition.
We can consider a more general constraint of the form A(x) − b ∈ K, where A is convex w.r.t. the convex
cone K, i.e.:
A((1− α)x + αy)− (1− α)A(x)− αA(y) ∈ K, ∀x,y ∈ Rp, ∀α ∈ [0, 1].
Here, A is not necessarily linear. For instance, if K = −Rn+, and g : Rp → Rn is convex, then g(x) ≤ 0 is convex
w.r.t. −Rn+. We refer the reader to [1] for more details on this topic.
Now, we present two examples of the optimality condition (45) and (46).
Example 16 (Linear programming). We consider the following linear programming problem:
minx∈Rp
c>x | Ax− b ≤ 0
.
Here, K := −Rn+ and X := Rp. The dual cone of −Rp+ is itself, −Rp+, and NRp(x?) = 0. The optimality
condition of this problem is
A>y? + c = 0, Ax? − b ≤ 0, y? ≥ 0, (Ax? − b) y? = 0.
This has exactly the form (45). This condition is known as KKT condition of LPs. The last condition is rendering
from the complementarity slackness condition (46), where is the element-wise product.
Example 17. In the second example, we consider the following general convex program:
f? := minx∈X
f(x) | Ax = b, gi(x) ≤ 0, i = 1, · · · ,m
, (47)
where f : Rp → R ∪ +∞ is convex and smooth, A ∈ Rn×p, b ∈ Rn, gi : Rp → R ∪ +∞ for i = 1, · · · ,m are
convex and smooth, and X is a nonempty, closed, and convex set in Rp. To avoid nontrivial case we assume that
X ∩ ri (dom (f)) 6= ∅.The corresponding dual problem is
d? := sup(µ,λ)
d(µ,λ) | µ ≥ 0
(48)
where the dual function is defined as
d(µ,λ) := infx∈X
L(x,µ,λ) := f(x) + µ>g(x) + λ>(Ax− b)
.
We say that problem (47) satisfies a so-called Slater’s condition if
∃x ∈ ri (X ) such that gi(x) < 0, ∀i = 1, · · · ,m, and Ax = b. (49)
Here, g = (g1, · · · , gm). Under this Slater condition, we have the following strong duality statement:
25/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
• There is no duality gap and exists at least one Lagrange multiplier pair (µ?,λ?).
• If the primal problem (47) has an optimal solution x?, then the triple (x?,µ?,λ?) satisfies the following
optimality condition (also called a generalized KKT condition):
〈∇f(x?) +∇g(x)>µ? + A>λ?,x− x?〉 ≥ 0, ∀x ∈ X (Lagrangian optimality)
x? ∈ X , g(x?) ≤ 0, Ax? = b (Primal feasibility)
µ? ≥ 0 (Dual feasibility)
µ?i gi(x?) = 0, i = 1, · · · ,m. (Complementarity slackness)
(50)
When X = Rp, then the first condition reduces to ∇f(x?) + ∇g(x)>µ? + A>λ? = 0. In this case, (50)
reduces to standard KKT condition widely used in the literature.
6 Convergence rates and computational complexity theory
To evaluate the efficiency of iterative methods, we often compare the rate of convergence of the sequences generated
by such methods. Let us recall it here. We also recall the big-O notation borrowed from complexity theory. Then,
we review the concepts of local convergence and global convergence also used in iterative methods.
6.1 Rate of convergence
We consider a sequence of real numbers akk≥0 in R. We say that ak is monotonically nonincreasing (nonde-
creasing) if ak+1 ≤ ak (or ak+1 ≥ ak) for all k ≥ 0. We say that ak is bounded if supk≥0 |ak| < +∞.
Definition 10. We say that a sequence ak in R converges to a? ∈ R if for any ε > 0, there exists kε depending
on ε such that for all k ≥ kε we have |ak − a?| < ε.
For example, the sequence
kk+1
k≥0
converges to a∗ = 1, or the sequencek sin
(1k
)k≥1
converges to a∗ = 1,
while the sequence
(−1)k
or the sequence sin(k)k≥0 does not converge even it is bounded.
6.1.1 Convergence rate
1. We say that ak converges to a? with a sublinear rate if |ak − a?| ≤ Ckd
for some given constants d ∈ (0,+∞)
and C > 0.
2. We say that ak converges to a? with a linear rate if |ak − a?| ≤ Cρk for some given constants ρ ∈ (0, 1)
and C > 0. Here, ρ is called a contraction factor of this sequence. This rate of convergence is also called a
R-linear convergence rate. But if we have |ak+1 − a?| ≤ ρ |ak − a?| for k ≥ 0, then we have Q-linear rate. 1
3. We say that ak converges to a? with a superlinear rate if |ak − a?| ≤ Ckρk for some given constant
ρ ∈ (0, 1) and Ck > 0 such that limk→∞ Ck = 0. Here, ρ is called the contraction factor of this sequence.
Similarly, we call this a R-superlinear convergence rate. If we have |ak+1−a?||ak−a?| → 0 as k → ∞, then we say
that ak converges to a? at a Q-superlinear convergence rate.
4. We say that ak converges to a? with a quadratic rate if |ak − a?| ≤ Cρ2k
for some given constants ρ ∈ (0, 1)
and C > 0. This rate of convergence is again a R-quadratic rate. If we have |ak+1−a?||ak−a?|2 ≤ C for all k ≥ 0,
then we have a Q-quadratic convergence rate.
Here are a few examples (we leave these examples as exercises for the reader):
• The sequence τk generated by τ0 := 1 and τk+1 := τk2 (√τ2k + 4 − τk) converges to zero at a sublinear rate
with d = 1.1Here, “R” stands for “root”, and “Q” means “quotient”.
26/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
• With τk defined above, the sequence βk with β0 = 1 and βk+1 := (1 − τk)βk converges to zero at a
sublinear rate with d = 2.
• The sequence ak defined by a0 = 1, and ak+1 := 34ak + 1
8 converges to a∗ = 12 at a linear rate with the
contraction factor ρ = 3/4.
• The sequence ak defined by a0 = 1, and ak+1 := 1k+2ak converges to a∗ = 0 at a superlinear rate.
• The sequence ak defined by a0 := 0.5, and ak+1 := a2k − 2ak + 2 converges to a∗ = 1 at a quadratic rate.
To illustrate these rates geometrically, we also look at the following figures.
100
101
102
10−10
10−8
10−6
10−4
10−2
100
CONVERGENCE RATES
Iterat ion (k)
SUBLINEARLINEAR
0 10 20 30 4010
−10
10−8
10−6
10−4
10−2
100
CONVERGENCE RATES
Iterat ion (k)
LINEARSUPERLINEAR
0 2 4 6 8 10 1210
−10
10−8
10−6
10−4
10−2
100
CONVERGENCE RATES
Iterat ion (k)
SUPERLINEARQUADRATIC
Figure 13: Sublinear: ak = 1k , Linear: ak = 0.5k, Superlinear: ak = 1
kk, and Quadratic: ak = 0.52k
6.2 A brief review of complexity theory in optimization
When we design an iterative optimization algorithm of the form xk+1 ∈ A(x0, · · · ,xk,Gk), we often query the
necessary information Gk at each iteration k to form the sequencexk
, where k is the iteration counter. This
information can be function values, subgradients, first-order, and/or second-order derivatives, etc. Depending on
the information Gk queried at each iteration, we have different classes of algorithms as follows.
• Zero-order methods: Using only function values of the objective function (and/or constraints).
• First-order methods: Using function values, and first-order derivatives (e.g., subgradients, gradients,
Jacobian, etc) of the objective function (and/or constraints).
• Quasi-second-order methods: Using function values, [sub]gradients, and approximation of Hessians from
gradients of the objective function (and/or constraints).
• Second-order methods: Using function values, gradients, and Hessians of the objective function (and/or
constraints).
The mechanism to query this information is called an oracle. The algorithm will call this oracle to query the
information Gk at each iteration. We will distinguish two notions of computational complexity:
• Analytical complexity: The total number of oracle calls in an algorithm to achieve a desired solution.
• Arithmetical complexity: The total number of arithmetic operations (floating-point operations (flops))
in an algorithm to achieve a desired solution.
Sometimes, we refer to the analytical complexity as an iteration-complexity, while the arithmetical complexity
can be called a computational complexity. The latter one can be considered as a combination of both iteration-
complexity and per-iteration complexity. Most of iterative optimization algorithms often characterizes the worst-
case iteration-complexity to achieve an approximate solution up to a given accuracy in a certain sense.
27/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
We also use the following big-O notation to denote worst-case complexity or convergence rate of an algorithm.
Given two functions f : R → R and g : R → R, we say that f(x) = O(g(x)) if there exists a constant C > 0
independent of x, and x0 such that |f(x)| ≤ C |g(x)| for x ≥ x0. Here are a few examples:
• O(1) means that f(x) is a constant.
• O(log(n)) is a logarithmic complexity (e.g., finding an item in a sorted array with a binary search). It means
that |f(n)| ≤ C log(n) for n ∈ N.
• O(n) is a linear time complexity (e.g., finding a minimum element of an array).
• O(np) (p > 0) is a polynomial time complexity. For instance, interior-point methods for conic programming
often have a worst-case polynomial time complexity.
• O(2n) is an exponential time complexity. This complexity often appears in combinatorial optimization. It
is also a worst-case complexity of simplex methods for LPs.
Note that, we also use big-O notation to characterize convergence rates. For example, O(
1n
), O(
1n2
), or O
(1√n
)stands for a sublinear convergence rate.
6.3 Local convergence vs. global convergence
We say that a sequencexk
in X generated by an algorithm A such that xk+1 ∈ A(x0, · · · ,xk) in a given
domain X locally converges to x? ∈ X if there exists a neighborhood N (x?) such that if we choose a starting
point x0 ∈ N (x?), then we have limk→∞ xk = x?. We can also describe this concept as follows. There exists
r0 > 0 such that for any x0 ∈ Br(x?) := x ∈ X | ‖x− x?‖ ≤ r such thatxk
converges to x?. In many
numerical algorithms, we often see that the sequencexk⊆ Br(x?), but it is not necessary.
local convergence region
global convergence
local convergence
x0 x1
xk
x?
We say that the sequencexk
in X generated by an algorithm A such that xk+1 ∈ A(x0, · · · ,xk) in a given
domain X globally converges to x? ∈ X if for any starting point x0 ∈ X ,xk
converges to x?. However, in many
cases, we do not have exactly the above convergence. Hence, we may use some metricM to describe a convergence
as limk→∞M(xk,x?) = 0. For instance, in gradient-type methods, we often use the objective function f as a
metric to measure the convergence, e.g., limk→∞(f(xk)− f?
)= 0, where f? is the optimal value.
As concrete examples of local and global convergence, gradient methods for convex optimization often have
global convergence guarantee, while Newton methods have a local quadratic convergence. In nonconvex optimiza-
tion, it is important to distinguish these two types of convergence.
References
[1] H. H. Bauschke and P. Combettes. Convex analysis and monotone operators theory in Hilbert spaces. Springer-
Verlag, 2nd edition, 2017.
[2] A. Beck and M. Teboulle. Smoothing and first order methods: A unified framework. SIAM J. Optim.,
22(2):557–580, 2012.
[3] S. Boyd and L. Vandenberghe. Convex Optimization. University Press, Cambridge, 2004.
28/29
STOR @ UNC Numerical optimization: Theory, algorithms, and applications Quoc Tran-Dinh
[4] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms
for inverse problems in science and engineering, pages 185–212. Springer, 2011.
[5] F. Facchinei and J.-S. Pang. Finite-dimensional variational inequalities and complementarity problems, vol-
ume 1-2. Springer-Verlag, 2003.
[6] O. Guler. On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control
Optim., 29(2):403–419, 1991.
[7] D. Klatte and B. Kummer. Nonsmooth equations in optimization: regularity, calculus, methods and applica-
tions. Springer-Verlag, 2001.
[8] Boris S Mordukhovich. Variational analysis and generalized differentiation I: Basic theory, volume 330.
Springer Science & Business Media, 2006.
[9] Arkadi Nemirovski. Lectures on modern convex optimization. In Society for Industrial and Applied Mathe-
matics (SIAM). Citeseer, 2001.
[10] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87 of Applied Optimization.
Kluwer Academic Publishers, 2004.
[11] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127–152, 2005.
[12] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123–231, 2013.
[13] R. Rockafellar and R. Wets. Variational Analysis, volume 317. Springer, 2004.
[14] R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press,
1970.
[15] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and
Optimization, 14:877–898, 1976.
[16] R.T. Rockafellar and R. J-B. Wets. Variational Analysis. Springer-Verlag, 1997.
29/29