convexity, smoothness, duality, and stabilityyaoliang/mynotes/convex.pdfconvexity, smoothness,...

26
Convexity, Smoothness, Duality, and Stability Yao-Liang Yu [email protected] Machine Learning Department Carnegie Mellon University December 14, 2015 This note is about the interplay between convexity, smoothness, and Stability, through duality. The note is still largely under construction and will update from time to time. Contents 1 Topological background 2 2 Convex Functions 10 3 Uniformly convex and uniformly smooth functions 23

Upload: others

Post on 21-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

Convexity, Smoothness, Duality, and Stability

Yao-Liang [email protected]

Machine Learning DepartmentCarnegie Mellon University

December 14, 2015

This note is about the interplay between convexity, smoothness, and Stability, throughduality. The note is still largely under construction and will update from time to time.

Contents

1 Topological background 2

2 Convex Functions 10

3 Uniformly convex and uniformly smoothfunctions 23

Page 2: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

1 Topological background

1 Topological backgroundIn this section we collect some useful topological results. A significant portion is devoted to uniformspaces as the writer starts to appreciate this notion hence wishes to learn more about them.

Theorem 1.1: Many useful spaces are not pseudo-metrizable

The space 0, 1ω is metrizable iff |ω| ≤ ℵ0.Proof: The only if part is easy. For the if part, note that 0, 1ω is not first countable if |ω| > ℵ0.

Now take some (pseudo) metric spaces Xγ : γ ∈ Γ and consider its product∏γ Xγ . If |Γ| > ℵ0

(and each space Xγ contains two topologically distinct points), then the product space is not (pseudo)metrizable.

As we will see, many important topological spaces can be treated as a subspace of a product of metrizablespaces (e.g. [0, 1]J for some index set J). Theorem 1.1 suggests that these spaces may not be metrizablebut nevertheless they still enjoy enough “metric” structure. So we want to study and characterize thesespaces.

Definition 1.2: Topology

A topology on a set X is a collection of sets (nhoods) Ux ⊆ 2X : x ∈ X such that:

(I). U ∈ Ux =⇒ x ∈ U ;

(II). If U ∈ Ux, then there exists some V ∈ Ux such that U ∈ Uy for all y ∈ V ;

(III). U ,V ∈ Ux =⇒ U ∩ V ∈ Ux;

(IV). U ∈ Ux,U ⊆ V =⇒ V ∈ Ux.

A set U is called open iff U ∈ Ux for all x ∈ U . Trivially ∅ and X are open. A set is closed iffits complement is open.The collection Ux ⊆ 2X : x ∈ X is called an nhood basis iff

(I). U ∈ Ux =⇒ x ∈ U ;

(II). If U ∈ Ux, then there exists some V ∈ Ux such that for all y ∈ V there exists someW ∈ Uy,W ⊆ U ;

(III). U ,V ∈ Ux =⇒ there exists some W ⊆ U ∩ V ,W ∈ Ux.

Enlarging the basis by including all supersets we recover the nhood. Removing the last condition weget a subbasis of the topology; we can recover the basis by taking all finite intersections.

Theorem 1.3: The closure operator, Kurotowski

Theorem 1.4: Convergence class

Definition 1.5: Uniformity

A uniformity on a product set X ×X is a collection of sets D ⊆ X ×X such that

(I). D ∈ D =⇒ D ⊇ ∆ := (x, x) : x ∈ X;

(II). D ∈ D =⇒ D−1 ∈ D;

(III). If D ∈ D, then there exists some E ∈ D,E E ⊆ D;

December 14, 2015 revision: 1 main 2

Page 3: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

1 Topological background

(IV). D,E ∈ D =⇒ D ∩ E ∈ D;

(V). D ∈ D,D ⊆ E =⇒ E ∈ D,

where D−1 := (y, x) : (x, y) ∈ D and D E = (x, y) : (x, z) ∈ E, (z, y) ∈ D for some z ∈ X.Uniformity is introduced to measure the “distance” between two points: There is a clear analogy

between (I), (II), (III) and the definition of distance. (IV) and (V) are needed to extract a topologyfrom a uniformity.

If we omit (V) (and weaken (IV) a bit) we obtain a basis, while if we omit both (IV) and (V) weobtain a subbasis.

Note that ∆ D = D ∆ = D. Using (I): E E ⊆ D =⇒ E ⊆ D. Thus, if (III) is satisfied, thenwe can strengthen it: for all n ∈ N, there exists some E ∈ D such that E · · · E︸ ︷︷ ︸

n

⊆ D, where w.l.o.g.

we can assume E is symmetric, i.e. E = E−1.Note also that usually ∆ 6∈ D. A topological space that admits a compatible (see Definition 1.7

below) uniformity will be called uniformizable.The notion uniformity was first introduced in Weil [1937], who allegedly also invented the empty

set symbol ∅ (from Norwegian alphabet).

Alert 1.6: Intersection / union of uniformities

Unlike topology, intersection of uniformities need not be a (subbasis of) uniformity: For any x ∈ [0, 1]let Dx be the collection of supersets of ∆ ∪ 1, x ∪ x, 1. Clearly Dx is a uniformity but for x 6= y,Dx∩Dy is not even a subbasis: it is the collection of supersets of ∆ := ∆∪1, x∪x, 1∪1, y∪y, 1but ∆ ∆ = ∆ ∪ x, y ∪ y, x hence (III) in Definition 1.5 is violated for ∆.Clearly, intersection of uniformities is a (subbasis of) uniformity iff they all contain a common

(subbasis of) uniformity. Thus, for a collection of (subbases of) uniformities, we can define thesmallest uniformity that contains all of them (since their union is a subbasis of uniformity).

Definition 1.7: (Uniform) topology from uniformity

Let D be a uniformity on X × X, then its induced topology (a.k.a. uniform topology) on X isdefined using nhoods:

∀x ∈ X, Dx := Dx : D ∈ D, Dx := y : (x, y) ∈ D. (1)

We easily verify that Dxx∈X is a (basis, subbasis of) topology on X if D is a (basis, subbasis of)uniformity on X ×X.

It is apparent that there is an intimate relation between topology and uniformity, and our centralquestion is to reveal what kind of topology is derived from a uniformity.

Example 1.8: Pseudo-metric uniformity

Let (X, d) be a pseudo-metric space, then it admits a natural uniformity whose basis is:

D := (x, y) : d(x, y) < rr>0 (2)

Not surprisingly, the topology of X induced by the metric d coincides with the one induced by theabove uniformity.

Proposition 1.9: Interior preserves uniformity

If D is a member of the uniformity D, then intD ∈ D, where the interior is taken w.r.t. the productof the uniform topology on X.Proof: Since D ∈ D there exists some symmetric E ∈ D such that E ⊆ E E E ⊆ D. We claim

December 14, 2015 revision: 1 main 3

Page 4: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

1 Topological background

that E ⊆ intD: Indeed, (x, y) ∈ E =⇒ Ex × Ey ⊆ E E E ⊆ D.

Therefore, the collection of open symmetric sets D ∈ D is a basis of D.It is now clear that D ∈ D =⇒ D is an nhood of ∆. However, the converse need not hold:

Consider D := (x, y) ∈ R2 : |x − y| < 11+|y| which is an nhood of the diagonal, but D cannot

contain any member of the basis in (2).

Proposition 1.10: Closed sets as intersection of open sets

A topological space X is R0 (i.e., for all x, y ∈ X, x has an nhood not containing y iff y has annhood not containing x) iff all closed set A ⊆ X can be written as the intersection of (a family of)open supersets of A.Proof: Let X be R0 and A ⊆ X be closed. For any x 6∈ A and y ∈ A, the open set Ac contains xbut not y, hence there is an open set Uy containing y but not x. Therefore the open set ∪y∈AUycontains U but not x. Since x is arbitrary, we know A is the intersection of all open supersets.Conversely, suppose any closed set A is the intersection of a family of open supersets. Take any

x, y ∈ X such that there is an open set U that contains x but not y. Thus the closed set U c = ∩αVαfor a family of open sets Vα ⊇ U c, and obviously y ∈ U c, x 6∈ U c. Therefore, there exists some αsuch that y ∈ Vα, x 6∈ Vα.

Note that a topological space is T1 iff it is T0 and R0. Moreover, a regular topological space is R0.

Proposition 1.11: Closure in uniform space

For any subset A in a uniform space (X,D), clA =⋂

D∈DDA, where DA :=

⋃x∈A

Dx. Similarly, for

any B ⊆ X ×X, clB =⋂

D∈DD B D.

Proof: x ∈ clA ⇐⇒ ∀D ∈ D,Dx ∩A 6= ∅ ⇐⇒ ∀ symmetric D ∈ D, x ∈ DA.Similarly, (x, y) ∈ clB ⇐⇒ ∀ symmetric D ∈ D,Dx × Dy ∩ B 6= ∅ ⇐⇒ ∀ symmetric D ∈D, (x, y) ∈ D B D.

Clearly, both intersections can be restricted to any symmetric basis (but not subbasis).It follows that for any E ∈ D, clE ⊆ E E E, hence the collection of closed symmetric sets D ∈ D

is a basis of D. Thus, the uniform topology is at least regular: Let x ∈ U for any open set U , thenthere exists a closed symmetric set D such that x ∈ Dx ⊆ U . Since the section map is closed, Dx isclosed. Since Dx is an nhood, there is an open set V such that x ∈ V ⊆ Dx ⊆ U .

Since the section map is open, DA is an open nhood of A if D is open. Thus, clA can be writtenas the intersection of a family of open supersets of A.

Proposition 1.12: T0 = T2 in regular space

A regular topological space is T0 iff T1 iff T2.Proof: Since a regular space is R0, it is T1 iff T0. If a regular space is T1, then it can separatedisjoint points since a point is closed.

Therefore, the uniform topology is Hausdorff (T2) iff (clx =)⋂

D∈DDx = x iff

⋂D∈D

D = ∆.

Definition 1.13: Uniform continuity

Let (X,D) and (Y, E) be two uniform spaces. We call the function f : X → Y uniformly continuous

December 14, 2015 revision: 1 main 4

Page 5: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

1 Topological background

iff for all E ∈ E , the set (x, y) : (f(x), f(y)) ∈ E ∈ D.

Theorem 1.14: Composition preserves uniform continuity

Let f : (X,D) → (Y, E), g : (Y, E) → (Z,F) be uniform continuous, then g f is also uniformcontinuous.Proof: For any F ∈ F ,

(x, y) :

(g(f(x)), g(f(y))

)∈ F

=

∈D︷ ︸︸ ︷(x, y) : (f(x), f(y)) ∈

(a, b) : (g(a), g(b)) ∈ F

︸ ︷︷ ︸

∈E

,

since both f and g are uniform continuous.

Theorem 1.15: Uniform continuous is continuous

A uniform continuous function f : (X,D)→ (Y, E) is continuous (w.r.t. the uniform topology).Proof: Fix any x and f(x). For any nhood Ef(x) of f(x), E ∈ E , the set D := (y, z) : (f(y), f(z)) ∈E ∈ D. Then f(Dx) ⊆ Ef(x).

The following definitions are slight modifications from their topological counterparts (with continuityenhanced to uniform continuity).

Definition 1.16: Making functions uniformly continuous

Let f : X → (Y, E), then the sets

(x, y) : (f(x), f(y)) ∈ EE∈E

is easily verified to be a basis of uniformity. Therefore, by including all supersets we construct acoarsest uniformity on X that makes f uniform continuous.Similarly, there exists a coarsest uniformity W on X that makes a family of functions fα : X →

(Y α, Eα) all uniformly continuous. Moreover, f : (Z,F)→ (X,W) is uniformly continuous iff fα fis uniformly continuous for all α.

Definition 1.17: Subspace uniformity

Let A be a subset of the uniform space (X,D). We call A a (uniform) subspace of X if it is equippedwith the coarsest uniformity such that the inclusion map ι : A→ X, a 7→ a is uniform continuous.More precisely, the subspace uniformity on A is (A×A)∩D. Not surprisingly, the subspace topologyon A coincides with the topology induced by the subspace uniformity.

Definition 1.18: Product uniformity

Let (Y α, Eα) be a collection of uniform spaces, then its product uniform space is defined as the∏α Y

α such that the projections πα :∏α Y

α → Y α are uniformly continuous. Again, the producttopology coincides with the topology induced by the product uniformity. Moreover, the functionf : (X,D)→ (

∏α Y

α,∏α Eα) is uniformly continuous iff πα f is uniformly continuous for all α.

December 14, 2015 revision: 1 main 5

Page 6: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

1 Topological background

Definition 1.19: Quotient uniformity

Definition 1.20: Topological embedding

Let fα : X → Xα be a family of functions and define the evaluation map e : X →∏αXα, with

[e(x)]α = fα(x). We usually can choose the space Xα, and we would like to know when the evaluationmap e is a topological embedding.The functions fα : X → Xα separate points in X iff for all x 6= y in X there exists some α such

that fα(x) 6= fα(y). This is equivalent to the evaluation map being 1-1.The functions fα : X → Xα separate points from closed sets in X iff for all x ∈ X and disjoint

closed set A ⊆ X there exists some α such that fα(x) 6∈ cl fα(A).

Theorem 1.21: Topological (uniform) embedding

The evaluation map e : X →∏αXα is a topological (uniform) embedding iff the functions

fα : X → Xα separate points and X is equipped with the coarsest topology (uniformity) that makesevery fα (uniformly) continuous.Proof: We only prove the uniform case. The topological case is completely analogous.

First note that the evaluation map e is 1-1 iff fα separate points.The evaluation map is uniformly continuous iff for all α, Dα ∈ Dα, the sets

(x, y) ∈ X ×X :(e(x), e(y)

)∈

(u, v) ∈∏α

Xα ×∏α

Xα : (uα, vα) ∈ Dα

,

which after simplification are

(x, y) :(fα(x), fα(y)

)∈ Dα, (3)

generate a uniformity coarser than the uniformity on X.The evaluation map, being 1-1, has uniformly continuous inverse (when restricted onto its range)

iff for all D in the uniformity of X, the sets

(u, v) ∈ Im(e)× Im(e) :(e−1(u), e−1(v)

)∈ D = (e(x), e(y)) : (x, y) ∈ D

generate a uniformity coarser than the product uniformity on Im(e) × Im(e). The latter has asubbasis as follows:

(u, v) ∈ Im(e)× Im(e) : (uα, vα) ∈ Dα = (e(x), e(y)) :(fα(x), fα(y)

)∈ Dα.

Thus, e has uniformly continuous inverse iff the uniformity on X is coarser than the uniformitygenerated by (3).

Proposition 1.22: Separating points from closed sets

A collection of continuous real-valued functions fα on a topological space X separates points fromclosed sets in X iff the sets f−1

α (V ) : V open in Xαα form a base for the topology on X.Proof: ⇐: Let x ∈ X be disjoint from the closed set A ⊆ X. Then x ∈ Ac. Since Ac is open, thereexists an α and an open set V in Xα such that x ∈ f−1

α (V ) ⊆ Ac. Thus fα(x) ∈ V ⊆ fα(Ac). Sincefα(A) ⊆ V c, fα(x) 6∈ V c ⊇ cl fα(A).⇒: Let U be an open set in X and x ∈ U . There exists an α such that fα(x) 6∈ cl fα(U c) =: V c.

Thus fα(x) ∈ V , i.e. x ∈ f−1α (V ), which is open since V is open and fα is continuous. Since

fα(U c) ∩ V = ∅, f−1α (V ) ⊆ U . Therefore the sets f−1

α (V ) form a base.

December 14, 2015 revision: 1 main 6

Page 7: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

1 Topological background

It follows that the topology on X is the coarsest topology that makes all fα continuous.

Corollary 1.23: Characterizing completely regular spaces

The topological space X is completely regular iff it is endowed with the weak topology generated byall continuous (and bounded) real-valued functions.Proof: ⇒: Since X is completely regular, the class of continuous (and bounded) functions separatepoints from closed sets. From Proposition 1.22 we know X is endowed with the weak topology.⇐: Let x 6∈ A for any closed set A. Then x ∈ Ac hence x ∈

⋂ni=1 f

−1i (Vi) ⊆ Ac for some open

intervals Vi in R. Clearly we can take Vi = (ai,∞) or Vi = (−∞, bi). By changing fi to −fi we canassume Vi = (ai,∞). By changing fi to (fi − ai)+ we can assume ai ≡ 0 and fi ≥ 0. Let f =

∏i fi.

Then x ∈⋂i f−1i (R++) = f−1(R++) ⊆ Ac. Since f(A) = 0, the continuous function f separates x

from A.

Corollary 1.24: Certifying topological embedding

If fα is a collection of continuous real-valued functions on a topological space X that separatepoints from points and closed sets, then the evaluation map e : X →

∏αXα is an embedding.

If X is T1, then the functions fα automatically separate points since they separate points fromclosed sets.

Proposition 1.25: Embedding continuous functions

Let Y be a continuous image of X, then C(Y ) → C(X) with the pointwise topology.Proof: Let t : Y → X be the continuous surjection, and consider the map ψ : C(Y )→ C(X), f 7→f t. The claim follows from the surjectiveness of t: ψ is 1-1, and fα → f iff fα t→ f t.

Frequently Y will be a (topological) subspace of X.

Theorem 1.26: Metrization Lemma

Let Dn be a sequence of decreasing sets in X × X such that D0 = X × X, (x, x) ∈ Dn, andDn+1 Dn+1 Dn+1 ⊆ Dn for all n and x ∈ X. Then there is a function d : X ×X → R+ such that

(I). d(x, x) = 0 for all x ∈ X;

(II). d(x, y) + d(y, z) ≥ d(x, z) for all x, y, z ∈ X;

(III). Dn ⊆ (x, y) : d(x, y) < 2−n ⊆ Dn−1 for all n.

If each Dn is symmetric, then d can be chosen symmetric too.Proof: Since D0 = X ×X the following function is well-defined on X ×X:

f(x, y) =

2−n, (x, y) ∈ Dn−1 \Dn

0, (x, y) ∈ ∩nDn. (4)

We take the chain approximation of f :

d(x, y) = inf

n∑i=0

f(xi, xi+1) : x0 = x, xn+1 = y, n ∈ N

. (5)

Note that d : X × X → [0, 1/2]. As (x, x) ∈ ∩nDn, d(x, x) = 0 for all x. Clearly d satisfies thetriangle inequality, and it is symmetric if each Dn is so.

December 14, 2015 revision: 1 main 7

Page 8: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

1 Topological background

By construction d ≤ f hence Dn ⊆ Jf < 2−nK ⊆ Jd < 2−nK. For the other direction we first useinduction to prove that

∀ x0, . . . , xn+1 ∈ X, f(x0, xn+1) ≤ 2

n∑i=0

f(xi, xi+1). (6)

Indeed, let a =∑ni=0 f(xi, xi+1). If a = 0, then (xi, xi+1) ∈ ∩nDn hence using Dn+1Dn+1Dn+1 ⊆

Dn we know (x0, xn+1) ∈ ∩nDn, i.e. the claim holds. So we need only consider 2−2 > a > 0 in thefollowing. Find the largest k such that

∑ki=0 f(xi, xi+1) ≤ a/2. If k does not exist then the claim

trivially holds, otherwise k ≤ n since a > 0. Note that we can assume∑n+1i=k+1 f(xi, xi+1) ≤ a/2 for

otherwise the claim holds again trivially. By the induction hypothesis we thus have f(x0, xk) ≤2∑ki=0 f(xi, xi+1) ≤ a, f(xk+1, xn+1) ≤ a, and of course f(xk, xk+1) ≤ a (otherwise we have

nothing to prove). Let m be the smallest integer such that 2−m ≤ a. Since 2−2 > a we knowm ≥ 3. then (x0, xk), (xk, xk+1), (xk+1, xn+1) ∈ Dm−1. Using Dn+1 Dn+1 Dn+1 ⊆ Dn we know(x0, xn+1) ∈ Dm−2, i.e. f(x0, xn+1) ≤ 2−m+1 = 2 · 2−m ≤ 2a. The induction is now complete.

Using (6) we know d(x, y) < 2−n =⇒ f(x, y) < 2−n+1 =⇒ (x, y) ∈ Dn−1.

Corollary 1.27: Metrization

A uniform space is pseudo-metrizable iff its uniformity has a countable base.Proof: Use induction we can extract from the countable base a deceasing sequence Dn thatsatisfies the conditions in Theorem 1.26, therefore there is a pseudo-metric whose pseudo-uniformityis equivalent to the original uniformity (see (III) in Theorem 1.26). The converse is clear from thepseudo-metric uniformity.

Theorem 1.28: Characterizing pseudo-metric uniformity

Let (X,D) be a uniform space and d : X ×X → R be a pseudo-metric on X. Then d is uniformlycontinuous (w.r.t. the product uniformity) iff D is finer than the pseudo-uniformity generated by d.

Proof: A moment’s reflection convinces us that d is uniformly continuous iff for all r > 0 thereexists some D ∈ D such that

((x, y), (u, v)

): (x, u) ∈ D and (y, v) ∈ D ⊆

((x, y), (u, v)

): |d(x, y)− d(u, v)| < r.

Taking the restriction u = v = y we know ((x, y), (y, y)

): (x, y) ∈ D = LHS ⊆ RHS. Thus, if d is

uniformly continuous, for all r > 0 there exists some D ∈ D such that D ⊆ (x, y) : d(x, y) < r.Conversely, for any r > 0, find D ⊆ (x, y) : d(x, y) < r/2. Then, if (x, u) ∈ D and (y, v) ∈ D,

d(x, y)− d(u, v) ≤ d(x, u) + d(u, y)− d(u, v) ≤ d(x, u) + d(y, v) < r,

and similarly d(u, v)− d(x, y) < r. So d is uniformly continuous, and the proof is complete.

As a consequence, the pseudo-uniformity generated by a pseudo-metric is the coarsest uniformitythat makes the pseudo-metric uniformly continuous (w.r.t. the product uniformity).

Corollary 1.29: Relating uniformity to pseudo-metric

Every uniformity is the coarsest uniformity that makes a family of pseudo-metrics uniformlycontinuous (w.r.t. the product uniformity).Proof: By Theorem 1.26 every countable family from the uniformity on X corresponds to auniformly continuous pseudo-metric. The coarsest claim follows from Theorem 1.28.

December 14, 2015 revision: 1 main 8

Page 9: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

1 Topological background

Theorem 1.30: Uniform topology through pseudo-metrics

Let (X,D) be a uniform space whose uniformity is generated by a family of pseudo-metrics dκ. Thenfor all A ⊆ X, clA =

⋂κ

[clκA := x ∈ X : dκ(x,A) = 0

]. Moreover, the net xα → x iff for all κ,

dκ(xα, x)→ 0.Proof: We need only note that the balls y ∈ X : dκ(x, y) < rκ,x∈X,r>0 consist of a subbasis ofthe topology on X.

We also note that the function f : (Z,F) → (X,D) is uniformly continuous iff for all κ, thepseudo-metric (x, y) 7→ dκ(f(x), f(y)) is uniformly continuous on Z × Z.

Definition 1.31: Cauchy net (in complete regular space)

A net xγ : γ ∈ Γ in a uniform space (X,D) is called Cauchy iff for all D ∈ D there exists someγ ∈ Γ such that α, β ≥ γ =⇒ (xα, xβ) ∈ D.

Equivalently, if the uniformity is generated by the family of pseudo-metrics dκ, then the net isCauchy iff for all κ, dκ(xα, xβ)→ 0.We want to emphasize that the notion of Cauchy net is well-defined in any uniform space, or

equivalently any complete regular space.

Proposition 1.32: Uniformly continuous functions preserve Cauchy nets

Let f : (X,D)→ (Y, E) be a uniformly continuous function, then xα is Cauchy in X =⇒ f(xα) isCauchy in Y .

Proposition 1.33: Convergence and Cauchy

A convergent net is Cauchy, and a Cauchy net converges to each of its cluster point.Proof: Let the pseudo-metrics dκ generate the uniformity. If the net xα → x, then dκ(xα, xβ) ≤dκ(xα, x) + dκ(x, xβ)→ 0, i.e., the convergent net xα is Cauchy.Let x be a cluster point of the Cauchy net xα, i.e. for all κ, lim inf dκ(xα, x) = 0. But for all

ε ≥ 0, lim sup dκ(xα, x) ≤ lim supα dκ(xα, xβ) + dκ(xβ , x) ≤ ε, by choosing β smartly.

The catch here is that a Cauchy net need not have a cluster point, and when it always does, we’d betterbe very serious about the underlying space.

Definition 1.34: Complete uniform space

A uniform space is complete iff every Cauchy net in it has at least one cluster point, or equivalentlyiff every Cauchy net converges to some point.

Proposition 1.35: Subspace of complete space

A closed subspace of a complete uniform space is complete. Conversely, a complete subspace of aHausdorff uniform space is closed.

December 14, 2015 revision: 1 main 9

Page 10: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

Proposition 1.36: Sufficient condition for compacta to be uncountable

Let X be a locally compact Hausdorff space. If for all x ∈ X, x is not open, then |X| > ℵ0.Proof: Using the one-point compactification we can assume w.l.o.g. that X is compact. Theone-point sets x remain open.Take any nonempty open set U in X. Fix x. Since x is not open we can choose y ∈ U, y 6= x.

Because X is Hausdorff there exist open sets V 3 y,W 3 x, V ∩W = ∅, implying that y ∈ clV 63 x.Of course we can make V ⊆ U by intersecting the latter.Now let f : N → X be any function. We prove that f cannot be surjective. Indeed, let

xn = f(n), n = 1, 2, . . . , and we construct open sets Vn ⊆ Vn−1 such that xn 6∈ clVn, where V0 = X.Since X is compact, there exists x ∈ ∩n clVn, but x 6= xn for all n.

Clearly, if every point x in X is a limit point, then x is not open.

Definition 1.37: Topological group

Recall that a semi-group is a set G that we can define an associative multiplication operator· : S × S → S. A group is a semi-group that has an identity so that we can also define the inverseoperator −1 : S → S.

A topological group is a group equipped with a topology so that the multiplication and the inverseare continuous, or equivalently the map (x, y) 7→ xy−1 is continuous.Clearly, the maps x 7→ x−1, x 7→ yx, x 7→ xz, x 7→ yxz are group homeomorphisms.

Proposition 1.38: Homogeneity of topological groups

Ne is an nhood basis at the identity e of the group G iff xNe or Nex is an nhood basis at anyelement x ∈ G.

Therefore a topological group is locally compact (connected, path connected) if it is locally compact(connected, path connected) at the identity (or any other element).

Theorem 1.39: Continuity of group homomorphism

The group homomorphism φ : G→ H between two topological groups G and H is continuous iff itis continuous at a single point x ∈ G.

2 Convex FunctionsLet our domain be some vector space X. Instead of always assuming the field to be real, we note that acomplex vector space can always be treated as a real vector space: we simply “forget” the multiplicationwith complex numbers. Thus, we will use XR to denote this “realization”. Be reminded that XR is thesame space (of points) as X and they share the same topology and vector addition. The only difference isthe scalar multiplication in XR is the restriction of that of X to real scalars. By definition many algebraicproperties, such as convexity below, are the same in X or XR.

Definition 2.1: Convex set

A point set C ⊆ X is called convex if

∀x,y ∈ C, [x,y] := λx + (1− λ)y : λ ∈ [0, 1] ⊆ C. (7)

Proposition 2.2: Intersection and union of convex sets

Arbitrary intersection and increasing union of convex sets is convex.Thus, lim infα Cα = ∪α ∩β≥α Cβ is convex.

December 14, 2015 revision: 1 main 10

Page 11: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

Clearly, arbitrary union or lim supα = ∩α ∪β≥α Cβ may not be convex.

Definition 2.3: Convex function, Jensen [1905] (in Danish)

The extended real-valued function f : X → (−∞,∞] is called convex if

∀x,y,∀λ ∈ (0, 1), f(λx + (1− λ)y) ≤ λf(x) + (1− λ)f(y). (8)

It is necessary that the (effective) domain of f , i.e. dom f := x ∈ X : f(x) <∞, is a convex set.We call f strictly convex iff the equality in (8) holds only when x = y.According to wikipedia (https://en.wikipedia.org/wiki/Johan_Jensen_(mathematician)), Jensen

(Danish) never held any academic position and proved his mathematical results in his spare time.

In the above definitions we have used the fact that X is a vector space, so that we can add vectors andmultiply them with (real) scalars. It is quite remarkable that such a simple definition leads to a hugebody of interesting results, a tiny part of which we shall be able to present below.

Proposition 2.4: Distributive law of convex sets

Let C,D ⊆ X. Then for all α, β ≥ 0,

α(C +D) = αC + αD, (α+ β)C ⊆ αC + βC. (9)

The latter also holds as equality when C is convex.Proof: We need only concern the case when C is convex. Let x,y ∈ C, then α

α+βx + βα+βy ∈ C

too. Hence, αx + βy ∈ (α+ β)C.

Letting C = 1,−1 ⊆ R shows that in general the equality may not hold for nonconvex sets.

Definition 2.5: Subadditive function

An extended real-valued function f : X → (−∞,∞] is subadditive if for all x,y

f(x + y) ≤ f(x) + f(y). (10)

For a subadditive function f , if 0 ∈ int(dom f), then f is continuous on int(dom f) iff it is continuousat the origin: By subadditivity

−f(x− y) ≤ f(y)− f(x) ≤ f(y − x).

Note that we always have f(x) + f(−x) ≥ 0. If 0 ∈ dom(f), then necessarily ∞ > f(0) ≥ 0.

We are now ready for our first important class of convex functions.

Definition 2.6: Sublinear function, always convex

A subadditive function p that is also positive homogeneous (i.e. p(tx) = tp(x) for all t ≥ 0 andx ∈ dom p) is called sublinear. If 0 ∈ dom(p), then necessarily p(0) = 0.

Theorem 2.7: Cauchy’s functional equation

The solution to Cauchy’s functional equation

∀x, y ∈ R, f(x+ y) = f(x) + f(y),

if not linear, must have dense graph.Proof: Use induction we know for all r ∈ Q and x, f(rx) = rf(x). Let λ = f(1) and suppose for

December 14, 2015 revision: 1 main 11

Page 12: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

some z we have f(z) = λz+ δ for some δ 6= 0. Fix any rational ball, i.e. the ball with rational center(p, q) and rational radius r > 0. Let x = p+ s(z − t) for some rational s and t. We have

y = f(x) = f(p+ s(z − t)) = λp+ sδ + λs(z − t).

Certainly we can choose s and then t so that ‖(x, y) − (p, q)‖ < r, i.e., the graph of f meets anyrational ball.

Thus, if f is additionally continuous, or bounded on any interval, or monotone on any interval, ithas to be linear.To get a solution that is not linear: take a Hamel basis of the vector space R over the field Q,

assign arbitrary values there, and extend to all of R using linearity.

Proposition 2.8: Determining linearity

A sublinear function p is linear (on its domain) iff for all x ∈ dom p, p(λx) = λp(x) for all |λ| = 1.Proof: Indeed, using subadditivity and homogeneity

p(∑i

αixi) ≤∑i

p(αixi) =∑i

αip(xi) (11)

−p(∑i

αixi) = p(−∑i

αixi) ≤∑i

−αip(xi). (12)

Thus we have p(∑i αixi) =

∑i αip(xi).

Definition 2.9: Balanced set

A set A ⊆ X is balanced if for all |λ| ≤ 1, λA ⊆ A. It is star-sharped (at the origin) if for allλ ∈ [0, 1], λA ⊆ A. Clearly a star-shaped set contains the origin and a balanced set is star-shapedand symmetric. More generally, A is star-shaped at x if A− x is star-shaped. We easily verify thata set is convex iff it is star-shaped at any of its points.

All (convex) nhoods of the origin contains a (convex) balanced (open) nhood: The multiplication(λ,x) 7→ λx is continuous at (0,0) hence for all nhood V , there exists δ > 0 and (open) nhoodU such that W :=

⋃|λ|≤δ λU ⊆ V is an (open) balanced nhood. Additionally, if V is convex,

convW ⊆ V is a convex balanced nhood.The union, intersection, convex hull, and closure of a balanced (star-shaped) set is balanced

(star-shaped). So we can define the balanced hull of a set, i.e., the smallest balanced superset

bh(A) =⋃|λ|≤1

λA. (13)

A set is called absolutely convex iff it is convex and balanced. Equivalently, for all |α|+ |β| ≤ 1,αA+ βA ⊆ A. We can define the absolute convex hull of a set:

absconv(A) :=

n∑i=1

λiai : n ∈ N,ai ∈ A,n∑i=1

|λi| ≤ 1

= conv(bhA) ⊇ bh(convA). (14)

Definition 2.10: Absorbing set

A set A ⊆ X is absorbing if for all x ∈ X there exists some r ≥ 0 such that for all |λ| ≥ r we havex ∈ λA. It is weakly absorbing if ∪t≥0tA = X. (Think of say −1, 1 in R.) For real vector spaces,the two notions coincide for star-shaped sets.An absolutely convex set is absorbing in its linear hull (not so for a star-shaped balanced set,

December 14, 2015 revision: 1 main 12

Page 13: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

think of the two axis in R2).The superset and finite intersection of an absorbing set is absorbing.Every nhood (of origin) is absorbing: The multiplication (λ,x) 7→ λx is continuous at (0,x), hence

for all nhood V there exists some δ > 0 and nhood U such that λ(x + U ) ⊆ V for all |λ| ≤ δ, inparticular, x ∈ 1

λV .

Definition 2.11: Core of a set

For a set A ⊆ X, its core is defined as:

coreA := x ∈ A : A− x is absorbing = x ∈ A : ∀d ∈ X,∃t > 0, s.t. x + B(t)d ⊆ A, (15)

where B(r) is the unit ball with radius r of the field. Sometimes we only need the core in the senseof the real field, i.e. the core in XR, hence we also define

rcoreA := x ∈ A : ∀d ∈ X,∃t > 0, s.t. [x,x + td] ⊆ A. (16)

Clearly, intA ⊆ coreA ⊆ rcoreA ⊆ A, but note that unlike the interior, the definition of core doesnot require a topology on X.Note that if A is (mid-point) convex then coreA = rcoreA: for any d ∈ X, there exists s, t > 0

such that [x− 2td,x + 2td] ∈ A, [x− 2sid,x + 2sid] ∈ A, hence [x− (t+ si)d,x + (t+ si)d] ∈ A.However, in general coreA ⊂ rcoreA: rotate in the plane and shrink the radius to 0.

Definition 2.12: Positive homogeneous (p.h.) function

The function p : X → (−∞,∞] is p.h. iff for all λ > 0 and x ∈ X, p(λx) = λp(x).P.h. functions enjoy the following property (see Definition 2.14 for the gauge pB):

• p(0) ∈ 0,∞.

• If p ≥ 0, then p = pB for any x : p(x) < 1 ⊆ B ⊆ x : p(x) ≤ 1. Thus gauges exhaust allnonnegative p.h. functions.

• x : p(x) ≤ λ = λx : p(x) ≤ 1 for all λ > 0. Similarly if we use strict inequality. Thisproperty in fact characterizes nonnegative p.h., since any extended real-valued function iscompletely determined by its sublevel sets:

f(x) = infλ : x ∈ Aλ where Aλ := x : f(x) ≤ λ. (17)

• p is finite-valued iff the open “unit ball” Bp := x : p(x) < 1 is real absorbing (absorbing w.r.t.the real field) iff, provided p ≥ 0, it is a gauge of a real absorbing set. Thus gauges of realabsorbing sets exhaust all nonnegative finite-valued p.h. functions.

• If p ≥ 0, p is sublinear (equivalently subadditive or convex) iff the (open) closed unit ballBp := x : p(x) ≤ 1 is convex iff it is a gauge of a convex set. Thus gauges of (real absorbing)convex sets exhaust all nonnegative (finite-valued) sublinear functions.

• If p ≥ 0, p is symmetric iff its (open) closed ball is balanced iff it is a gauge of a balanced set.Thus gauges of balanced sets exhaust all symmetric nonnegative p.h. functions.

• p is (upper) lower semicontinuous iff its (open) closed unit ball Bp is (open) closed iff, providedp ≥ 0, it is a gauge of a (open) closed star-shaped set.

• p is continuous at origin iff it is bounded on an nhood iff, provided p ≥ 0, it is a gauge of a(star-shaped) nhood (hence finite-valued). In particular, p.h. functions continuous at originmap bounded sets into bounded intervals, and the converse holds if there exists a boundednhood (such as any normable space). Similarly, a nonnegative sublinear function is continuous

December 14, 2015 revision: 1 main 13

Page 14: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

iff it is a gauge of a convex nhood. Thus gauges of convex nhoods exhaust all nonnegativecontinuous sublinear functions.

We mention three important positive homogeneous functions.

Definition 2.13: Seminorm

A nonnegative finite-valued sublinear function p is called a seminorm if for all λ and x, p(λx) = |λ|p(x),i.e., the open (or closed) unit ball Bp is balanced (and convex and absorbing). Thus gauges ofabsolutely convex absorbing sets exhaust all seminorms.A seminorm is continuous iff its open unit ball is an nhood. Thus gauges of absolutely convex

nhoods exhaust all continuous seminorms.A seminorm is called a norm iff p(x) = 0 ⇐⇒ x = 0, i.e., its unit ball is bounded on each ray (or

simply bounded on any normable space).

Definition 2.14: Minkowski’s gauge function

For any set A ⊆ X we associate the extended nonnegative-valued function:

pA(x) := infλ ≥ 0 : x ∈ λA. (18)

• pA is always positive homogeneous, and pγA = 1γ pA for all γ > 0 (bigger ball, smaller gauge).

• pA(0) = 0 ⇐⇒ 0 ∈ A.

• A ∩ R+x ⊆ B ∩ R+x =⇒ pA(x) ≥ pB(x), in particular, pclA ≤ pA. Conversely, if B isstar-shaped (on R+x), then pA(x) > pB(x) =⇒ A ∩R+x ⊆ B ∩R+x.

• pA∩B ≥ pA ∨ pB, with equality if A,B are star-shaped. (If pA(x) = pB(x) then A ∩R+x =B ∩R+x except one of them may not include the endpoint.)

• pA∪B = pA ∧ pB .

• pA+B ≤ (p−1A + p−1

B )−1.

• A is symmetric, i.e. λx ∈ A for all x ∈ A and |λ| = 1, =⇒ pA is symmetric, i.e. pA(λx) =pA(x) for all x and |λ| = 1. For a symmetric set A, pA = pbh(A). The converse doesnot hold: take A to be an arbitrary unbounded (on both ends) subset of a line Rx, thenp(x) = p(−x) ≡ 0.

• A is convex =⇒ pA is subadditive: immediate consequence of Proposition 2.4. The converseis not true: take A = −1, 0, 1, then pA(x) = |x| is subadditive.

• pconvA ≤ pA, and the inequality can be strict: take a small ball in say R2 and put a big trianglearound it. Let A be the ball union the three vertices.

Definition 2.15: Support function

A dual notion of the gauge is the support function defined on the dual space X∗:

σA(x∗) := supx∈A〈x;x∗〉 . (19)

The following properties are clear:

• σA is sublinear and σA(0) = 0;

• σA+B = σA + σB , σA∪B = σA ∨ σB .

• A ⊆ B =⇒ σA ≤ σB .

December 14, 2015 revision: 1 main 14

Page 15: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

• σA = σconvA.

The closed unit ball of σA will be called the polar of A, and denoted as A. Note that A isalways convex and 0 ∈ A. A is absorbing iff σA is finite-valued.

Definition 2.16: Ray open / closed

We call a set A ray open/closed if for all x, R+x ∩A is open/closed w.r.t. the inherited topologyfrom R+x. Clearly, if A is open/closed (w.r.t. some topology of X) then it is ray open/closed.

Proposition 2.17: Reconstruct the set from gauge

Let A ⊆ X and pA(·) its gauge function.

• We always have A ⊆ x : pA(x) ≤ 1;

• If A is star-shaped, then x : pA(x) < 1 ⊆ A;

• If A is ray open, then A ⊆ x : pA(x) < 1;

• If A is ray closed, then x : pA(x) = 1 ⊆ A.

• If A is star-shaped, then A = x : pA(x) < 1 if A is ray open, and A = x : pA(x) ≤ 1 if Ais ray closed.

Proof: We only prove the last four items.If pA(x) < 1, then either x = 0 or x ∈ λA for some 0 < λ < 1, namely x/λ ∈ A. In either case

x ∈ A if A is star-shaped.If A is ray open, then for each x ∈ A there exists some δ > 0 such that (1 + δ)x ∈ A, i.e.,

pA(x) ≤ 11+δ < 1.

If pA(x) = 1, then there exists λn → 1 such that x/λn ∈ A. If A is ray closed, then x ∈ A.Finally, we note that for a star-shaped set A = ∪λ∈[0,1]λA, and x : pA(x) = 0 ⊆ A.

Proposition 2.18: Interior/closure preserve convexity

For any star-shaped (convex) set A, coreA, rcoreA, intA and clA are star-shaped (convex).Proof: Let A be convex and x,y ∈ coreA. Fix any λ ∈ (0, 1). For all direction d, there exist t, s > 0such that x+B(t)d,y+B(s)d ⊆ A, where B(r) is the centered ball of radius r of the underlying field.Then λx + (1− λ)y + B(λt)d ⊆ λ[x + B(t)d] + (1− λ)[y + B(s)] ⊆ A, i.e. λx + (1− λ)y ∈ coreA.

Let A be convex and x,y ∈ intA, i.e. x + V ∈ A,y + W ∈ A for some nhood V ,W . Take thenhood U = V ∩W we know x+ U ∈ A,y+ U ∈ A. Using convexity we know λx+ (1−λ)y+ U ⊆λx + (1− λ)y + λU + (1− λ)U = λ(x + U ) + (1− λ)(y + U ) ⊆ A. Thus [x,y] ⊆ intA.Let A be convex and x,y ∈ clA. Then there exist nets xα → x,yβ → y. But then A 3

λxα + (1− λ)yβ → λx + (1− λ)y, i.e. [x,y] ⊆ clA.The star-shaped case is similar and omitted.

In fact, for the star-shaped case we can interpret the interior and closure as taking on each ray.This avoids putting any topology on X.

Note that core(coreA) ⊆ coreA, with equality if A is convex: for any d, e and x ∈ coreA, thereexists t > 0 such that x + B(2t)d,x + B(2t)e ⊆ A, hence x + B(t)d + B(t)e ⊆ A, i.e. x ∈ core coreA.The equality can fail in general, for instance when coreA is a singleton. Similar for the real core.

December 14, 2015 revision: 1 main 15

Page 16: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

Proposition 2.19: Topological cancellations of convex sets

If x ∈ clC and y ∈ intC for a convex set C, then (x,y] := λx + (1 − λ)y : λ ∈ [0, 1) ∈ intC.Moreover, int clC = intC and cl intC = clC, provided that intC 6= ∅.Proof: Since y ∈ intC, there exists an nhood V such that y+V ∈ C. Then λx+(1−λ)y+(1−λ)V =λx + (1− λ)(y + V ) ⊆ C. As λ < 1, (1− λ)V is also an nhood. The proof for (x,y] ∈ intC is nowcomplete.

For the second claim let x ∈ clC and y ∈ intC (whose existence is assumed), then we can choosepoints in (x,y] to converge to x. This proves cl intC = clC. On the other hand, let x ∈ int clC andy ∈ intC ⊆ int clC. For any λ ∈ (0, 1) we have x−λ(x−y) = (1−λ)x+λy ∈ intC, while for λ suffi-ciently small we have x+λ(x−y) ∈ int clC ⊆ clC. Thus x = 1

2 (x−λ(x−y))+ 12 (x+λ(x−y)) ∈ intC.

This proves int clC = intC.

We can of course replace the interior with the relative interior (and replace the closure with therelative closure, too).In a finite dimensional space we always have riC 6= ∅ for a nonempty convex set C: there exist

affinely independent vectors x1, . . . ,xd+1 in C, but their convex hull is a non-degenerate simplexhence contains an interior point. It also follows that a (nonempty) convex set in a finite dimensionalspace is closed iff it is ray closed.

For infinite dimensional spaces, riC can be empty and the second claim above may not hold: Takefor instance a (real) infinite dimensional Frechét space X (locally convex complete metrizable TVS)and a discontinuous linear functionala f . Let C := x ∈ X : −1 ≤ f(x) ≤ 1. Clearly C is convexand 0 ∈ coreC. Since C is symmetric convex and 0 6∈ intC (otherwise f would be continuous),int(C) = ∅. However, int clC = core clC 3 0, see the comment of Corollary 2.32. Note also that theconvex set C is not closed but ray closed.aConstructed as follows: take a countable subset x1,x2, . . . from a Hamel basis, a countable decreasing nhood

basis V1,V2, . . ., and nonzero real numbers t1, t2, . . . so that tixi ∈ Vi for all i. Consider the linear functionalf(xi) = 1/ti and f(y) = 0 for other elements in the Hamel basis. Clearly tixi → 0 but f(tixi) ≡ 1. Thisconstruction shows that the continuous dual of any metrizable TVS is strictly smaller than its algebraic dual.

The gauge function is also useful in proving the following result:

Theorem 2.20: Compact convex sets are homeomorphic

Compact convex sets with nonempty interior in Rd are homeomorphic.Proof: W.l.o.g., we assume C is a compact convex set with zero in its interior. We prove that C ishomeomorphic to the closed unit ball B of the underlying normed space.We define the (scaling) map s using the gauge function:

s(z) :=

pC(z)‖z‖ z, z ∈ C \ 0

0, z = 0. (20)

Clearly, s maps C into B — continuously (since C is a convex nhood, the gauge pC is a continuoussublinear function). Because C is compact, pC(z) = 0 iff z = 0. If z1 6= z2 is not on the same ray,we clearly have s(z1) 6= s(z2). If z1 6= z2 is on the same ray, again we have s(z1) 6= s(z2) since C isstar-shaped (and compact). This proves the map s is 1-1. Since 0 ∈ coreC and C is compact andstar-shaped, pC(·) takes values in [0, 1] when restricted to the intersection of C with any direction(with both endpoints attainable). By the intermediate value theorem the map s is onto.

To summarize, we have constructed a continuous bijection s from the compact space C onto theHausdorff space B. The inverse of s is automatically continuous.

The theorem does not hold for compact star-shaped sets, if we recall the following convenient rule:

For any n, homeomorphic sets have homeomorphic sets of points that cut them into n (path)connected components.

December 14, 2015 revision: 1 main 16

Page 17: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

Now the compact star-shaped set can be cut into 2 connected components by a single pointwhile the ball cannot.

The crucial property we needed in the proof (but missing for the star-shaped set above) is thecontinuity of the gauge function pC . Thus, compact sets (upon translation) that are star-shaped,with 0 in its real core, and having continuous gauge (restricted to the set) are homeomorphic, suchas the `p ball with 0 < p < 1.We have written our proof in a way to seemingly suggest that the closed ball of any normed

space is compact. This is of course not true, and the catch is that no compact sets in an infinitedimensional Hausdorff TVS can have nonempty interior. In other words, we have accidentally proved:A normed space has a compact nhood iff it is finite dimensional.

We need a technical lemma that is very interesting in its own right.

Proposition 2.21: Closed sets are zeros of smooth functions

Let A ⊆ Rd be a closed set, then there exists a C∞ function f : Rd → R+ such that f−1(0) = A.Proof: For each x ∈ Ac, take the ball Bx := B(x, 1

2d(x, A)), where d(x, A) = mina∈A ‖x−a‖. SinceA is closed, d(x, A) > 0 hence Bx ∩ A = ∅. The set of balls Bxx∈Ac is an open covering of themetric space Ac, hence there exists a partition of unity ϕxx∈Ac . On each ball Bx consider the C∞function fx : Rd → R+ defined as

fx(z) =

exp

(1

‖z−x‖− 12 d(x,A)

), z ∈ Bx

0, otherwise.

Putting f(z) =∑

x∈Ac ϕx(z)fx(z) completes our proof.

Clearly closedness of the set is necessary as the zeros of any continuous function is closed.

Theorem 2.22: Open star-shaped sets are C∞-diffeomorphic

Every open star-shaped set Ω in Rd is C∞-diffeomorphic to Rd.Proof: W.l.o.g. assume 0 ∈ Ω and Ω is star-shaped at 0. Thanks to Proposition 2.21 there exists aC∞ function φ : Rd → R+ with Ωc = φ−1(0).Define the function f : Ω→ R as

f(x) =

[1 +

(∫ 1

0

1φ(νx) dν

)2

‖x‖22

]︸ ︷︷ ︸

λ(x)

·x =

1 +

(∫ ‖x‖20

1φ(tx/‖x‖2) dt

)2 · x, (21)

which clearly is C∞ on Ω.Take two points x1,x2 ∈ Ω. If x1 and x2 are not on the same ray then clearly f(x1) 6= f(x2) since

λ(x) ≥ 1. On the other hand, if x1 and x2 are on the same ray, again f(x1) 6= f(x2) since λ(sx) isan increasing function of s (better seen from the second equality above). Therefore f is 1-1.Since Ω is star-shaped at 0 we know the interval [0, x

pΩ(x) ) ∈ Ω for any x ∈ Ω. Consider the C∞

function g(s) := f(sx) where s ∈ [0, 1pΩ(x) ). Obviously g(0) = 0. If pΩ(x) = 0, then clearly g(s) is

onto R+x (by intermediate value theorem). If pΩ(x) > 0, then φ( xpΩ(x) ) = 0 since x

pΩ(x) ∈ Ωc. Bythe mean value theorem we know

|φ( xpΩ(x) )− φ(ν x

pΩ(x) )| = φ(ν xpΩ(x) ) ≤M(1− ν)

for some constant M (that does not depend on ν, since |φ′| attains its maximum on the interval[0, x

pΩ(x) ]). Therefore λ(x) diverges to infinity as x→ xpΩ(x) . Thanks again to the intermediate value

theorem we know g(s) maps again onto R+x. In summary, f is onto.

December 14, 2015 revision: 1 main 17

Page 18: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

To show f−1 is C∞ we need only show f ′ never vanishes, thanks to the inverse function theorem.Suppose to the contrary there exists some h 6= 0 such that

f ′(x)h = λ(x)h + 〈λ′(x),h〉x = 0.

Then x 6= 0 and h = µx for some µ 6= 0. Plug in back we obtain λ(x) + 〈λ′(x),x〉 = 0 =

λ(x) + dλ(sx)ds

∣∣s=1

, which is impossible as λ(x) ≥ 1 and λ(sx) is increasing w.r.t. s.

This result seems to be folklore, but the beautiful proof here is from page 60 of Gonnord & Tosel’sbook "Calcul Différentiel", ellipses, 1998.

Theorem 2.23: Nullity of the boundary of convex sets, Lang [1986]

The boundary of a convex set in Rd is null w.r.t. every product measure µ := ⊗di=1µi on the Borelfield with non-atomic Σ-finite marginals µ1, . . . , µd.Proof: Fix the convex set C ⊆ Rd. First note that the boundary ∂C = clC \ intC is closed henceBorel. Let

M := B ⊆ Rd Borel : µ(B ∩ ∂C) ≤ (1− 3−d)µ(B).

We claim that all rectangles are inM. Indeed, let A =∏di=1(−ai, bi]. W.l.o.g. assume µi is finite,

and using non-atomicness we can find ai < xi < yi < bi such that µi((ai, xi)) = µi((xi, yi)) =µi((yi, bi)) = 3−1µi((ai, bi]). Thus, we partition the rectangle A into 3d open rectangles of equalmeasure. If clC meets all of the 2d corner open rectangles, then the center open rectangle is in intCbecause of convexity. Either way we have µ(A ∩ ∂C) ≤ (1− 3−d)µ(A), proving our claim that allrectangles are inM. AsM is clearly closed upon taking countable unions, we knowM is exactlythe Borel sets (finite unions of rectangles form an algebra). Therefore, µ(∂C) ≤ (1− 3−d)µ(∂C), i.e.,µ(∂C) = 0.

The same proof works for “order solid” sets, i.e., x,y ∈ C =⇒ z ∈ C for all x ≤ z ≤ y.

Corollary 2.24: Measurability of convex sets, Lang [1986]

Convex sets in Rd are measurable w.r.t. every complete σ-finite product measure on the Borel field.

Proof: Note that every σ-finite measure ν on the Borel field of a separable metric space can bewritten as

ν =

∞∑k=1

νk + µ,

where νk concentrates on a single point and µ is non-atomic. Therefore the product σ-finite measurecan be written as a countable sum of

⊗d′

i=1νi ×⊗dj=d′+1µj

Since νi concentrates on say xi, C is measurable iff its section Cx1,...,xd′ is measurable. This isindeed so since the section remains convex hence its boundary is null.

In particular, convex sets are Lebesgue measurable hence have volume. Another proof: For anyx ∈ ∂C, half of the open ball centered at x does not meet C due to convexity. By the Lebesguedensity theorem we know the boundary is null.The same proof again works for “order solid” sets.

December 14, 2015 revision: 1 main 18

Page 19: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

Corollary 2.25: Measure of convex sets

For any convex set C ⊆ Rd, we have µ(C) = µ(clC) = µ(intC) w.r.t. every complete σ-finitenon-atomic product measure µ on the Borel field.This result is not true for non-convex sets: Take the complement of a Cantor set with Lebesgue

measure 1 > ε > 0; its closure has full measure 1 while its interior has measure 1− ε.

Corollary 2.26: Measurability of monotone functions, Lang [1986]

A monotone function f : Rd → Rm is (Borel) measurable w.r.t. every complete product σ-finitemeasure on the Borel field.Proof: We need only verify the measurability of the “order solid” set [[a ≤ f ≤ b]].

Thus, monotone functions are Lebesgue measurable.

Example 2.27: Convex sets need not be Borel

The union of the open unit ball and any non-measurable subseta of the unit sphere is a non-Borelconvex set. This convex set is also non-measurable w.r.t. the uniform distribution on the sphere(which is complete, non-atomic but not product).aTo construct such a set, take the unit interval [0, 1] and consider all equivalence classes [r] := x ∈ R : x− r ∈ Q.

There are uncountably many such sets [r] and each of them has at least one representative in [0, 1]. Let V ⊆ [0, 1]be the set such that V ∩ [r] is singleton for all r ∈ R. Yes, here we need axiom of choice for the existence of V .Let q1, q2, . . . enumerate the rationals in [−1, 1], then [0, 1] ⊆

⋃k V + qk ⊆ [−1, 2]. But the sets V + qk are disjoint

hence we cannot assign a measure to it. To extend the construction to the unit sphere, consider a homeomorphismfrom the unit interval to the sphere.

Theorem 2.28: Bounded convex functions are locally Lipschitz

If a convex function f : X → (−∞,∞] is lower bounded by m on a set A and upper bounded by Mon A+ W where W is a star-shaped nhood. Then

∀u,v ∈ A, |f(u)− f(v)| ≤ (M −m) · pW (u− v). (22)

Proof: Let u,v ∈ A, and y = u + 1α+δ (u − v) where α = pW (u − v) and δ > 0 is arbitrary.

Then pW (y − u) = αα+δ < 1, hence y − u ∈ W according to Proposition 2.17 (item II). Therefore,

y ∈ u + W ⊆ A + W . Clearly, u = α+δα+δ+1y + 1

α+δ+1v. Using the boundedness assumption andconvexity:

f(u)− f(v) ≤ α+δα+δ+1 [f(y)− f(v)] ≤ (α+ δ)(M −m) = (M −m) · pW (u− v) + δ(M −m).

The proof is complete since δ > 0 is arbitrary, and we can swap u and v.

Conveniently, sometimes we need only verify the upper boundedness in Theorem 2.28.

Proposition 2.29: Upper bounded implies lower bounded

For a convex function f : X → (−∞,∞], it is automatically lower bounded on a set A if:

(I). A = −A and f is upper bounded on A; or

(II). A is bounded, and f is upper bounded on a star-shaped nhood of A.

Proof: Firstly, if f is upper bounded by M on the symmetric set A, then we have

∀x ∈ A, f(x) ≥ 2f(0)− f(−x) ≥ 2f(0)−M > −∞. (23)

December 14, 2015 revision: 1 main 19

Page 20: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

Secondly, let f be upper bounded by M on A + V for some star-shaped nhood V , and A bebounded. Then A−A ⊆ λV for some sufficiently large λ > 0. Therefore, for any x,y ∈ A, we havepV (y − x) ≤ λ. Fix an arbitrary δ > 0 and use convexity:

f(y) ≤ λ+ δ

λ+ δ + 1f(y +

1

λ+ δ(y − x)) +

1

λ+ δ + 1f(x).

Note that pV ( 1λ+δ (y − x)) ≤ λ

λ+δ < 1, hence y + 1λ+δ (y − x) ∈ A + V since V is star-shaped.

Therefore f(x) ≥ (λ+ δ + 1)f(y)− (λ+ δ)M. As δ > 0 and y ∈ A are arbitrary, we have

∀x ∈ A, f(x) ≥ λ[supy∈A

f(y)− supz∈A+V

f(z)] + supy∈A

f(y) > −∞, (24)

where recall that λ is the smallest positive number such that A−A ⊆ λV .

In both cases we obtain explicit estimates of the lower bound (cf. the numbered Eqs). Clearly,lower bounded does not imply upper bounded for convex functions (think of the hinge loss). To seethe necessity of A = −A in (I) or A bounded in (II), think of a linear function capped on the right.

Theorem 2.30: Continuity of convex functions

Let f : X → (−∞,∞] be a convex function and x ∈ int dom f be arbitrary. Then the following areequivalent:

(I). f is continuous on int dom f ;

(II). f is continuous at x;

(III). f is upper semicontinuous at x;

(IV). f is upper bounded on an nhood of x;

Proof: Clearly, (I) =⇒ (II) =⇒ (III) =⇒ (IV).(IV) =⇒ (II): W.l.o.g. we can take a symmetric nhood U + x such that f is upper bounded

and U ⊇ V + W where V and W are balanced nhoods. Then the continuity of f at x follows fromTheorem 2.28 and item (I) of Proposition 2.29. (Recall that the p.h. function pV is continuous atorigin iff V is an nhood.)(IV) =⇒ (I): Suppose f is upper bounded on x + V ⊆ dom f for some balanced nhood V . We

show (IV) holds for any z ∈ int dom f . Find w ∈ int dom f such that z = λx + (1− λ)w for someλ ∈ (0, 1]. Take U = λV . Clearly, for all v ∈ V ,

f(z + λv) = f(λ(x + v) + (1− λ)w) ≤ λf(x + v) + (1− λ)f(w).

Thus f is upper bounded on z + U ⊆ dom f .

Corollary 2.31: Continuity of convex functions: finite dimensional case

Any convex function on a finite dimensional space is continuous on the interior of its domain.Proof: In finite dimensional spaces, the simplex around a point is an nhood. Thus due to convexitythe function is upper bounded at the simplex nhood of each interior point of its domain.

Corollary 2.32: Continuity of convex functions: lower semicontinuous case

Any lower semicontinuous (l.s.c.) convex function on a barrelled space is continuous on the interior,or equivalently the core, of its domain.

December 14, 2015 revision: 1 main 20

Page 21: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

Proof: Set λ such that f(x) < t for some x ∈ int dom f (or the core). Consider the sublevel setC = y : f(x + y) ≤ t which is closed, nonempty, and convex. Thanks to Corollary 2.31, theconvex restriction g(λ) := f(x+ λy) is continuous at 0 for any y hence upper bounded by t on somenonempty open set |λ| < δ. Thus C is absorbing hence an nhood. It follows that f is bounded onan nhood of x and Theorem 2.30 applies.

Recall that a topological vector space is barrelled iff every closed (absolutelya) convex absorbingset (i.e. a barrel) is an nhood. Every Baire TVS (e.g. F-space, complete metrizable TVS) is barrelled:Let V be a closed convex absorbing set. Thus X =

⋃t>0 tV =

⋃n nV . By Baire’s theorem one of

nV hence V has nonempty interior, say x + U ⊆ V . Use absorbing again we know −x/n ∈ V forsome n > 0. Then 1

n+1U = 1n+1 (x + U ) + n

n+1 (−x/n) ⊆ V is an open nhood of 0. In summary, fora closed convex set C in a barelled space, coreC = intC.

The l.s.c. assumption cannot be dropped even on Hilbert spaces: there exist discontinuous linearfunctionals.aThe usual definition also requires the set to be balanced, which is motivated by the unit balls of l.s.c. seminorms.To see balanced is unnecessary, let C be closed convex absorbing and consider the set C := x ∈ C : λx ∈C for all |λ| ≤ 1, which we easily verify to be a balanced closed convex absorbing subset of C. Translating tofunctions, every (l.s.c.) nonnegative finite-valued sublinear function f is “equivalent” to a (l.s.c.) seminorm: letp(x) := maxf(λx) : |λ| ≤ 1, then f(x) ≤ p(x) ≤ f(x) + f(ix) (or f(x) + f(−x) for the real field).

Theorem 2.33: Continuity of convex functions: real case

Any convex function on R is upper semicontinuous (u.s.c.) when restricted to its domain.Proof: The domain of any convex function f on R is an interval (a, b) with possibly one orboth endpoints included. By Corollary 2.31 f is continuous on (a, b). Convexity also demandsf(a) ≥ lim supxa f(x). Indeed, for any λ ∈ [0, 1], f(λa+ (1− λ)x) ≤ λf(a) + (1− λ)f(x). Similararguments apply to the other endpoint.

One needs to interpret this result carefully: f as an extended-valued function on the whole spaceR need not be u.s.c, as we can approach from the infinity side. Still, this result is nontrivial: considerthe solid disk in R2; put the interior to 0, half of its boundary to 2, and the rest to 1.

Theorem 2.34: Directional derivative

Let f : X → (−∞,∞] be convex and x ∈ core dom f . Then for all d ∈ X the univariate functiont 7→ f(x+td)−f(x)

t is finite and increasing on the interval [−δ, 0) ∪ (0, δ] for some δ > 0 (that maydepend on x and d). Thus, the right directional derivative

f ′+(x;d) := limt↓0

f(x + td)− f(x)

t≥ lim

t↑0

f(x + td)− f(x)

t=: f ′−(x;d) = −f ′+(x;−d) (25)

is a well-defined finite sublinear function of d. Likewise, the left derivative f ′− is positive homogeneousand superaddtive.Proof: The existence of δ > 0 so that f(x+ td) <∞ on [−δ, 0)∪ (0, δ] follows from the assumptionx ∈ core dom f . Let 0 < s ≤ t ≤ δ, then 0 < s

t ≤ 1 and using convexity

x + sd = st (x + td) + (1− s

t )x =⇒ f(x + sd)− f(x) ≤ st [f(x + td)− f(x)].

Thus, the increasing property on (0, δ] is clear. The proof for the other interval [−δ, 0) is analogous.Due to monotonicity and sandwiching the directional derivatives are well-defined and finite. Their

positive homogeneity is clear. To see the subadditivity, use convexity again: for t > 0,

f(x + t(d1 + d2))− f(x)

t≤ f(x + 2td1)− f(x)

2t+f(x + 2td2)− f(x)

2t.

December 14, 2015 revision: 1 main 21

Page 22: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

2 Convex Functions

Taking the limit t ↓ 0 shows the subadditivity of f ′+.

Definition 2.35: Bregman divergence, Bregman [1967]

The Bregman divergence induced by a convex function f is: for all y ∈ dom f,x ∈ core dom f ,

Df (y,x) := f(y)− f(x)− f ′+(x;y − x) ≥ tf(y) + (1− t)f(x)− f(ty + (1− t)x)

t≥ 0, (26)

where t > 0 is any sufficiently small number.The Bregman divergence need not be symmetric (i.e. Df (y,x) 6= Df (x,y)) or satisfy the triangle

inequality (i.e. Df (y,x) ≤ Df (y, z) + Df (z,x)). In fact, it is defined over the asymmetric productdom f × core dom f .

Theorem 2.36: Weak convexity of the Bregman divergence

Let f be a convex function and y ∈ dom f . Then for all x ∈ core dom f and 0 ≤ λ ≤ 1

Df (x,x) = 0, (27)Df ((1− λ)x + λy,x) ≤ λDf (y,x), (28)

i.e. the Bregman divergence is convex w.r.t. the first argument on each line segment connecting tothe second argument.Proof: The first equality is clear. We easily verify

Df ((1− λ)x + λy,x) = f((1− λ)x + λy)− f(x)− f ′+(x;λ(y − x))

≤ (1− λ)f(x) + λf(y)− f(x)− λf ′+(x;y − x)

= λDf (y,x).

Since Df (x,x) = 0, this shows a weak form of convexity of the Bregman divergence.

Due to asymmetry, the claim is no longer true if we swap the order of the arguments. Thesignificance of this weak convexity lies in its generality: there is no topology involved.

Definition 2.37: Gateaux derivative

The function f is said Gateaux differentiable at x ∈ rcore dom f if the limit

f ′(x;d) = limt→0

f(x + td)− f(x)

t(29)

exists for all d ∈ X. Usually we also require the derivative to be a linear functional of d.René Gateaux was killed in WWI before he could defend his doctoral thesis (on integration on

functional spaces), see Mazliak [2015] for a detailed account of this history.

Theorem 2.38: Gateaux differentiable = Directional derivative linear

A convex function is Gateaux differentiable at a point x ∈ core dom f iff

∀d ∈ X, f ′+(x;−d) = −f ′+(x;d),

i.e., the directional derivative (hence the Gateaux derivative) is linear.Proof: Simply combine Proposition 2.8 and Theorem 2.34.

December 14, 2015 revision: 1 main 22

Page 23: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

3 Uniformly convex and uniformly smooth functions

Theorem 2.39: Continuity of the directional derivative

If the convex function f : X → (−∞,∞] is continuous at a point x, then its directional derivative atx is also continuous. (Hence the Gateaux derivative of f , if exists, is continuous.)Proof: By Theorem 2.28 and Proposition 2.29 we know f is in fact Lipschitz continuous: For somestar-shaped nhood W , some nhood V , and a finite constant L ≥ 0, for any u,v ∈ V ,

|f(x + u)− f(x + v)| ≤ L · pW (u− v).

Therefore,

|f ′+(x;d)− f ′+(x; e)| = limt↓0|f(x + td)− f(x + te)

t| ≤ L · pW (d− e).

Since W is an nhood, pW is continuous at origin, and the theorem follows.

The proof shows that the directional derivative enjoys the same Lipschitz continuity as the functionitself.

Theorem 2.40: Sierpinski Theorem

A Lebesgue measurable function is convex iff it is mid-point convex.

3 Uniformly convex and uniformly smooth functionsLet X∗ ⊆ X ′, the latter being the algebraic dual of X (i.e. all linear functionals on X). We associatethe dual pairing 〈·; ·〉 for X and X∗, and topologize X (resp. X∗) with the weak (resp. weak-∗) topologyinduced by X∗ (resp. X). Note there is a slight asymmetry between X and its dual X∗: we had to defineX first and X∗∗ := (X∗)∗ ⊇ X where the containment may be strict.

In this section we will let X be a Banach space with norm ‖ · ‖, and X∗ its topological dual (but againequipped with the weak-* topology).The following class of univariate functions will be frequently referenced:

A := f : R+ → R+ ∪ +∞, f(t) = 0 ⇐⇒ t = 0. (30)

(A stands for Asplund.)

Definition 3.1: Uniformly convex functions

A function f : X → (−∞,∞] is called σ-convex if for all x,y ∈ X and λ ∈ (0, 1),

f((1− λ)x + λy) + λ(1− λ) · σ(‖x− y‖) ≤ (1− λ)f(x) + λf(y), (31)

where σ : R+ → (−∞,∞]. Note that dom f has to be convex.The existence of some σ immediately implies the existence of a largest σ, in the following sense:

σf (t) := supσ(t) : f is σ-convex (32)

= inf

(1−λ)f(x)+λf(y)−f((1−λ)x+λy)λ(1−λ) : λ ∈ (0, 1),x,y ∈ dom f, ‖x− y‖ = t

. (33)

Clearly f is convex iff σf ≥ 0, in which case σf (0) = 0 (provided that dom f 6= ∅).We call f uniformly convex (hence bona fide convex) if σf ≥ 0 and σf (t) = 0 ⇐⇒ t = 0.

December 14, 2015 revision: 1 main 23

Page 24: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

3 Uniformly convex and uniformly smooth functions

Remark 3.2: Modification for positive homogeneous functions

The definition of uniform convexity does not work for positive homogeneous functions: Take y ∝ xwe see that σf ≡ 0.

The fix is simple: we constrain both x and y to have unit norm in the definition (31) or (33).This modification must be kept in mind when we talk about uniformly convex norms.

Definition 3.3: Totally convex functions

We define the moduli of total convexity of a convex function f : X → (−∞,∞] as:

τf (t) := inf Df (y;x) := f(y)− f(x)− f ′(x;y − x) : x ∈ core dom f, ‖x− y‖ = t (34)= inf Df (y;x) := f(y)− f(x)− f ′(x;y − x) : x ∈ core dom f, ‖x− y‖ ≥ t , (35)

where the second equality follows from the weak convexity of the Bregman divergence, see Theo-rem 2.36: If ‖x−y‖ = s > t, then we can find z ∈ [x,y] so that ‖z−x‖ = t and Df (z,x) ≤ Df (y,x).Thus, it follows from (34) that τf (0) = 0 and from (35) that τf is an increasing function.We call a convex function f totally convex iff τf (t) = 0 ⇐⇒ t = 0. Totally convex functions are

perfect candidates of Lyapunov functions: Df (xn,x)→ 0 =⇒ xn → x.

Theorem 3.4: Uniformly convex ⊂ Totally convex

For any convex (or more generally directionally differentiable) function f : X → (−∞,∞] we haveτf ≥ σf .Proof: Apply the definition of the directional derivative in (31).

Theorem 3.5: Uniformly convex ⊂ Strictly convex

Uniformly convex functions are strictly convex.

Definition 3.6: Uniformly smooth functions

A (proper) function f : X → (−∞,∞] is called ρ-smooth if for all x,y ∈ X and λ ∈ (0, 1) such that(1− λ)x + λy ∈ dom f ,

(1− λ)f(x) + λf(y) ≤ f((1− λ)x + λy) + λ(1− λ) · ρ(‖x− y‖), (36)

where ρ : R+ → (−∞,∞]. For convex f , it is necessary to have ρ ≥ 0.The existence of some ρ immediately implies the existence of a smallest ρ, in the following sense:

ρf (t) := infρ(t) : f is ρ-smooth (37)

= sup

(1−λ)f(x)+λf(y)−f((1−λ)x+λy)λ(1−λ) : λ ∈ (0, 1), (1− λ)x + λy ∈ dom f, ‖x− y‖ = t

(38)

= sup

(1−λ)f(x−λty)+λf(x+(1−λ)ty)−f(x)λ(1−λ) : λ ∈ (0, 1),x ∈ dom f, ‖y‖ = 1

. (39)

Clearly, ρf (0) = 0 iff dom f 6= ∅ iff dom ρf 6= ∅. From (39) we know ρf is l.s.c or convex if f is so.

Remark 3.7: Modification for positive homogeneous functions

For purely symmetric reasons we also constrain x,y to have unit norms in the above definition ofuniform smoothness, when the underlying function is positive homogeneous. This is for the sake of a

December 14, 2015 revision: 1 main 24

Page 25: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

3 Uniformly convex and uniformly smooth functions

beautiful duality result, although the original definition is not really broken (unlike the uniformlyconvex case).

As we shall see, uniform convexity is a dual notion to uniform smoothness, hence there are a lot of“similarities” between the two, with one notable exception: behaviour with respect to restriction. It iseasily verified that if f is ρ-uniformly convex, then f + δA is also ρ-uniformly convex while uniformsmoothness does not enjoy this property (see the next proposition). Nevertheless, we can still defineuniform smoothness1 of f w.r.t. some convex set A by replacing dom f in (??)-(??) with dom f ∩A, inwhich case we denote the gage of uniform smoothness by σf,A (hence σf = σf,dom f ).

Proposition 3.8

Let ∅ 6= A ⊆ dom f and t0 > 0. If σf,A(t0) < ∞ then A + t0BX ⊆ dom f . In particular, ifσf (t0) <∞ then dom f = X hence σf is l.s.c.Proof: From definition (??) and the given condition σf,A(t0) <∞ it follows that A+t0BX ⊆ dom f .

We have already seen that σf is nondecreasing if f is closed convex. Much more is true for ρf , althoughless transparently.

Proposition 3.9

The function ρf (t)/t2 is nondecreasing, hence also ρf (t)/t and ρ(t).Proof: Let t > 0 and 1 < c < 2 be such that ρf (ct) < ∞ (if no such t and c exist, then ρ ≡ ∞,hence nothing to prove). Fix ε > 0, ∃x ∈ dom f, y ∈ dom f and 0 < λ ≤ 1/2 such that ‖y − x‖ = ctand

ρf (ct) + ε >(1− λ)f(x) + λf(y)− f((1− λ)x+ λy)

λ(1− λ).

Consider xλ := (1−λ)x+λy and xc := (1−c−1)x+c−1y, we have ‖xc−x‖ = t and xλ = (1−cλ)x+cλxc(where cλ ∈ (0, 1)). Hence

(1−λ)f(x)+λf(y)−λ(1−λ)ρf (ct)− ελ(1−λ) < f(xλ) ≤ (1− cλ)f(x)+ cλf(xc)− cλ(1− cλ)ρf (t),

andf(xc) ≤ (1− c−1)f(x) + c−1f(y)− c−1(1− c−1)ρf (ct).

A bit simplification yields

c2ρf (t) < ρf (ct) + εc1− λ1− cλ

≤ ρf (ct) + εc

2− c.

Letting ε→ 0 proves c2ρf (t) ≤ ρf (ct) for any t > 0 and 1 < c < 2. Induction gives c2nρf (t) ≤ ρf (cnt)for any n ∈ N, t > 0, 1 < c < 2 hence c2ρf (t) ≤ ρf (ct) for any t > 0 and 1 < c.

We have said that uniform convexity and uniform smoothness are “dual” to each other, here is theformal statement.

Theorem 3.10

If f is ρ-uniformly convex then f∗ is ρ∗-uniformly smooth, and if f is σ-uniformly smooth then f∗is σ∗-uniformly convex.Proof:

1When talking about uniform convexity/smoothness on some convex set A, our current strategy of definition seems toprovide a consistent treatment: just replace dom f with dom f ∩A. There is an important subtlety though: in Eq. (??),one of x and y, say x, could lie in dom f −A and yet f(x) <∞. Had we restricted f to the set A, f(x) will have to be∞ then. Said more explicitly, the uniform convexity of f on some convex set A is the same as the uniform convexity off + δA while this is NOT so for uniform smoothness.

December 14, 2015 revision: 1 main 25

Page 26: Convexity, Smoothness, Duality, and Stabilityyaoliang/mynotes/convex.pdfConvexity, Smoothness, Duality, and Stability Yao-Liang Yu yaoliang@cs.cmu.edu Machine Learning Department Carnegie

References References

Remark 3.11

It is tempting to say f is ρ-uniformly convex iff f∗ is ρ∗-uniformly smooth. This is indeed so if f isl.s.c.

ReferencesL. M. Bregman. The relaxation method of finding the common point of convex sets and its application to

the solution of problems in convex programming. USSR Computational Mathematics and MathematicalPhysics, 7(3):200–217, 1967.

Johan Ludwig William Valdemar Jensen. Om konvekse funktioner og uligheder mellem middelværdier.Nyt Tidsskrift for Matematik B, 16:49–68, 1905.

Robert Lang. A note on the measurability of convex sets. Archiv der Mathematik, 47:90–92, 1986.

Laurent Mazliak. The ghosts of the école normale. Statistical Science, 30(3):391—-412, 2015.

André Weil. Sur les espaces a structure uniforme et sur la topologie générale. Actualités scientifiques etindustrielles, 551. Paris, Hermann, 1937.

December 14, 2015 revision: 1 main 26