compressive sensing and sparse recovery2.15) and the hoe dings inequality for real rademacher sums...

Universita degli Studi di Genova

Facolta di Scienze Matematiche, Fisiche e Naturali

Dipartimento di Matematica

Tesi di laurea magistrale in

Matematica

Compressive Sensing and Sparse Recovery

Ilaria Giulini

Relatore: Prof. Ernesto De Vito

Correlatore: Prof. Filippo De Mari

Anno Accademico 2011/2012

Contents

1 Introduction 5

2 Basics on Probability 92.1 Preliminaries and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Moments and Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Rademacher sequence and Symmetrization . . . . . . . . . . . . . . . . . . 182.5 Hoeffding’s inquality for Rademacher sums . . . . . . . . . . . . . . . . . . 182.6 Rudelson’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.7 Dudley’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.8 Deviation Inequalities for Suprema of Empirical Process . . . . . . . . . . 40

3 Recovery via `1 minimization 453.1 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2 Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3 The Null Space Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.4 The Restricted Isometry Property . . . . . . . . . . . . . . . . . . . . . . . 513.5 Recovery of Individual Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 553.6 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.7 RIP for Gaussian and Bernoulli Random

Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.8 Compressive Sensing and Gelfand widths . . . . . . . . . . . . . . . . . . . 64

4 Bounded Orthonormal Systems 694.1 Nonuniform Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2 Uniform Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.3 Proof of Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.4 Proof of Theorem 4.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4.1 Proof of Lemma 4.16 . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5 Equivalent approach 955.1 Subgradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2 The LASSO estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 Appendix 996.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3

Chapter 1

Introduction

Given a high resolution image, the first thing most people do, if they plan to send it bye-mail or post it on the web, is to compact it to a more manageable size. Indeed, dealingwith high-dimensional data requires to find the most concise representation of a signal,that is a compressed version of it.

In the digital imaging world, a signal is an image, and a sample of the image is typicallya pixel, that is a measurement of light intensity at a particular point.

Typically, in order to have a compressed version of a signal, one acquires the full signal,computes the complete set of coefficients, encode the largest coefficients and set to zero allthe others. This process of massive data acquisition followed by compression is extremelywasteful.

Compressive sensing suggests ways to economically translate data into an already com-pressed form. For instance modern transform coder, such as JPEG, use the fact thatmany signals can be stored (or transmitted) using only a small number of adaptivelychosen coefficients rather than all the signal samples.

The history of compressive sensing begins with the works of Nyquist, Shannon, Kotelnikovand Whittaker in 1949, [20, 26, 23, 31]. Their results show that a band limited signal canbe perfectly reconstructed from a set of uniformly spaced samples taken at the so calledNyquist rate of twice the highest frequency present in the signal of interest.

The theory of compressive sensing was introduced in the 2000s by the works of D. Donoho,J. Romberg, E. Candes and T. Tao, [4, 6, 7, 8, 9, 13] and shows that signals and imagescan be reconstructed from far fewer measurements than what is usually believed necessary.Several compressive sensing algorithms were introduced. The basis pursuit, or `1 mini-mization algorithm was empirically introduced in the 1970s and then studied mathemat-ically in the 1990s by Chen, Donoho, Saunder, [10]. Other algorithms are the matchingpursuit, introduced by Mallat and Zhang, [12] and the LASSO (Least Absolute Shrinkageand Selection Operator) introduced by Tibshirani, [29].

In the compressive sensing framework, the achievable resolution is controlled primarilyby the information content of the image. An image with low information content canbe reconstructed perfectly from a small number of measurements. If such images wererare or unusual, this news might not be very exciting. But virtually all real-world imageshave low information content. For instance, an image may be many megapixels in size,but when viewed in a right basis many coefficients may be negligible and so the imagemay be compressed into a file of much smaller size without seriously affecting the imagequality. This is the basis behind algorithms such as JPEG. The idea of compressive

5

6

sensing is then to use the low information content of most real-life images to circumventthe Shannon-Nyquist sampling theorem.

Many types of real-world signals, for example images and sounds, can be viewed as an Ndimensional vector x = t(x1, . . . , xN) ∈ RN . To acquire this signal we consider a linearmeasurement model in which we measure a m dimensional vector y = Ax ∈ Rm, for somemeasurement matrix A ∈Mm,N(R).This leads to the following question: How many measurements we need to make in orderto recover the signal x exactly from y? In other words, when can we solve the equationAx = y?

By linear algebra we know that if m ≥ N and the matrix A has full rank we can solveAx = y uniquely. While, if m < N the problem is underdetermined, even if A has fullrank, and we can not determine x completely.

Let suppose x is k sparse, which means that as most k of the coefficients of x can benonzero. As already said, sparsity is a simple but effective model for many real-lifesignals. Hence, if a signal x ∈ RN is k sparse, in principle one should only need k(k N) measurements to exactly reconstruct x.

Knowing that a unique solution exists is not the same thing as being able to find it.Indeed, the problem is that we do not know in advance which k coordinates of x arenonzero. The natural approach for reconstruct a sparse signal x ∈ RN from y = Ax isfind the unique sparsest solution to Ax = y, or in other words, to solve the non-convexminimization problem

min ‖z‖0 subject to Az = y,

where ‖z‖0 is the so called `0-norm of the vector, that is the number of nonzero entries ofz. This naive approach is equivalent to try all the possibilities until you hit on the rightone, but this turns out to be a hopelessly slow algorithm.

We can draw an analogy with the game of ”20 questions”. If you have to find a numberbetween 1 and N, the worst way to proceed is to guess individual numbers. This isthe analog of measuring individual pixels. On average, it will take you N

2guesses. By

contrast, asking questions like, ”Is the number less than N2

?” and then ”Is the numberless than N

4?” and so on, it is possible find the concealed number with at most log2N

questions. If N is a large number, this is an enormous speed-up. Notice that the ”20questions” strategy is adaptive, indeed you are allowed to adapt your questions in lightof the previous answers.

In the digital signal world then, given an image consisting of a few sparse dots or a fewsharp lines, the worst way to sample it is by capturing individual pixels. Candes andTao find a way to make the measurement process nonadaptive. Their approach is called`1-minimization and the idea is to find the solution to Az = y with the smallest `1-norm,that is

min ‖z‖1 subject to Az = y,

where

‖z‖1 = ‖(z1, . . . , zN)‖1 =N∑j=1

|zj|.

They proved, [8], that in many cases it is also the unique solution with the smallest`0-norm, that is `1 minimization ensures sparse recovery. This result clearly does notholds for any measurement matrix. Indeed the matrix A should satisfies some conditions,

7

namely the Null Space Property (NSP) and the Restricted Isometry Property (RIP),which are also necessary.

It is possible to deterministically construct matrices of size m × N that satisfy the re-stricted isometry property of order k but such constructions require m to be relativelylarge. In particular, we obtain a quadratic dependence between the minimal number ofmeasurements and the sparsity,

m ≥ Ck2.

These limitations can be overcame by randomizing the matrix construction, for examplechoosing the entries according to a Gaussian distribution. In this context, we have thata Gaussian random matrix satisfies the restricted isometry property provided

m ≥ Ck log(N

k).

Compressive sensing enables a large reduction in the sampling and computation costsfor sensing signals that have not only a sparse representation but also a compressiblerepresentation, where by compressible representation we mean that the signal is wellapproximated by sparse signals.

In the thesis we have considered the problem of sparse recovery by `1 minimization inthe case of structured random matrices. Given an othonormal system of complex valuedfunctions, uniformly bounded in L∞, the corresponding measurement matrix has entriesthese functions evaluated in some random sampling points. So the structure is determinedby the function system, while the randomness comes from the sampling points.The presentation in mainly based on the review papers of H. Rauhut [25] and H. Rauhutand M. Fornasier [17]. The thesis is organized in three chapters.

In Chapter 2, we introduce the probabilistic tools we need to prove the statistical proper-ties of sparse recovery. We discuss the relation between moments and tails estimates. Weintroduce the Radamacher sequences and we prove the Symmetrization Lemma (Lemma2.15) and the Hoeffdings inequality for real Rademacher sums (Proposition 2.19) basedon Khintchine inequalities. We also prove Rudelson’s Lemma (Theorem 2.28) and Dud-ley’s inequality on the expectation of the supremum of a subgaussian process (Theorem2.30). These latter results are necessary to prove the two main theorems concerning `1

sparse recovery in the case of structured random matrices.

In Chapter 3, we present the concept of sparsity and recovery of sparse vectors using `1

minimization. We will show that requiring that a measurement matrix A satisfies the nullspace property (Definition 3.5) is equivalent to ensure `1 sparse recovery from y = Ax.In particular, it is sufficient to prove that the measurement matrix satisfies the restrictedisometry property, that is, for every k sparse vector z

(1− δ)‖z‖22 ≤ ‖Az‖2

2 ≤ (1 + δ)‖z‖22

for small values of δ.Since it is difficult to prove these properties for deterministic matrices, we introduceGaussian random matrices.We prove that Gaussian random matrices satisfy, with high probability, the restrictedisometry property. As a consequence, with probability at least 1− ε, every sparse vectorcan be recovered by `1 minimization, provided that the minimal number of measurementsm satisfies

8

m ≥ C(k log(N

k)) + log(ε−1)).

This result shows that in the case of Gaussian random matrices there exists, up to a logfactor, a linear dependence between the minimal number of measurements and the spar-sity. Moreover, we will show that the log factor cannot be removed and hence Gaussianrandom matrices provide optimal estimates for the minimal number of required measure-ments.Since the use of Gaussian random matrices in applications is limited, structured randommatrices were introduced.

Hence, in Chapter 4, we show recovery results for `1 minimization in connection withstructured random matrices associated to bounded orthonormal systems of complex val-ued functions. Given an orthonormal systems φ1, . . . , φN on a probability space D, astructured random matrix is a matrix whose entries are the functions φj evaluated is thesampling random points t1, . . . , tm ∈ D. Following [25] we have considered both uniformand non uniform recovery. A uniform recovery result means that once we have chosena random matrix, with high probability, every sparse signal can be recovered by `1 min-imization, while a nonuniform recovery result means that once fixed the measurementmatrix and the sparse vector, with high probability the sparse signal can be recovered.The main theorem concerning non uniform sparse recovery (Theorem 4.4) states that ak sparse signal, whose nonzero entries are chosen at random, is the unique solution of the`1 minimization problem, with probability at least 1− ε, provided

m ≥ Ck log(6N

ε).

Concerning the uniform recovery, the result is slightly worse. Theorem 4.10 ensure thatwith probability at least 1− ε every k sparse vector can be recovered by `1 minimization,provided

m ≥ Ck log4(N).

Chapter 2

Basics on Probability

In this chapter we first introduce some notation and then we prove some important factsfrom probability theory.

2.1 Preliminaries and Notation

Let us first give some notation. We consider a vector x ∈ CN as a column vector and wedenote with tx the corresponding row vector.

Given p ≥ 1, the `p norm of a vector x = t(x1, . . . , xN) ∈ CN is defined as follows

||x||p :=( N∑j=1

|xj|p)1/p

||x||∞ := maxj=1,...,N

|xj|.

In this context it is useful to extend the notion of `p norm to the case where p < 1. In thiscase we obtain a quasinorm, meaning that it is not satisfied the triangular inequality.The unit ball in `Np is defined as

BNp = x ∈ CN | ||x||p ≤ 1.

Figure 0 shows the unit sphere Bp in R2 in the case p = 1, p = 2, p =∞, p = 12.

Figure 0: Unit spheres in R2 for the `p norms.

We have that, for p > q, the `q ball is contained in the `p ball, that is

9

10

Lemma 2.1 For p > q,||x||p ≤ ||x||q. (2.1)

Proof. Without loss of generality we may assume ‖x‖q = 1, since the norm `q is homo-geneous of degree one. Since also ‖x‖qq = 1, then for every j = 1, . . . , N we have |xj| ≤ 1,where x = t(x1, . . . , xN). It follows that for p > q,

|xj|p ≤ |xj|q,

which implies‖x‖p ≤ ‖x‖q.

ut

The operator norm of a matrix A ∈Mm,N(C) from `Np to `mp is denoted

||A||p→p = max||x||p=1

||Ax||p.

We now recall some explicit expressions for the operator norm of a matrix A ∈Mm,N(C).

Lemma 2.2 Let A ∈Mm,N(C), then it holds that

1. ||A||1→1 = maxk=1,...,N

m∑j=1

|ajk|

2. ||A||∞→∞ = maxj=1,...,m

N∑k=1

|ajk|

3. ||A||2→2 = σmax(A) =√λmax(A∗A)

where σmax(A) denotes the largest singular value of A and λmax(A∗A) ≥ 0 is the largest

eigenvalue of the positive matrix A∗A.

Proof. We first prove 1.

||Ax||1 =m∑j=1

|N∑k=1

ajkxk| ≤m∑j=1

N∑k=1

|ajk||xk| ≤ supk=1,...,N

m∑j=1

|ajk| ||x||1,

which means

||A||1→1 ≤ supk=1,...,N

m∑j=1

|ajk|.

Furthermore,

||A||1→1 = sup||x||1=1

||Ax||1 ≥ supk=1,...,N

||Aek||1 = supk=1,...,N

m∑j=1

|ajk|.

We now prove 2.

||Ax||∞ = supj=1,...,m

|N∑k=1

ajkxk| ≤ supj=1,...,m

N∑k=1

|ajk||xk| ≤ supj=1,...,m

N∑k=1

|ajk| ||x||∞.

11

Moreover, for all l = 1, ..., N let xl = (sgn al1, ..., sgn alN). Then ‖xl‖∞ = 1 and

||Axl||∞ = supj=1,...,m

|N∑k=1

ajksgn alk| ≥ |N∑k=1

alksgn alk| =N∑k=1

|alk|,

which implies

||A||∞→∞ ≤ supl=1,...,m

||Axl||∞ ≤ supl=1,...,m

N∑k=1

|alk|.

Finally we prove 3.||Ax||22 = 〈Ax,Ax〉 = 〈A∗Ax, x〉.

SinceA∗A is diagonalizable and positive, there exists U unitary matrix such that U−1A∗AU =diag(λ1, ..., λN), where λj ≥ 0 for all j = 1, . . . , N. Let y = U−1x, then

||Ax||22 = 〈A∗AUy, Uy〉 =N∑k=1

λkykyk ≤ supk=1,..,N

λk〈y, y〉 = supk=1,..,N

λk||y||22 = supk=1,..,N

λk||x||22.

Then||Ax||2→2 ≤

√sup

k=1,..,Nλk =

√λmax.

Conversely, let λmax be the largest eigenvalue of A∗A and xλmax the corresponding eigen-vector, ||xλmax ||2 = 1. Then

||Axλmax||22 = 〈A∗Axλmax , xλmax〉 = λmax〈xλmax , xλmax〉 = λmax||xλmax ||22,

which means||Ax||2→2 =

√λmax.

ut

It is clear that||A||1→1 = ||A∗||∞→∞.

Let now introduce a version of the Schur test.

Proposition 2.3 (Schur test) Let A ∈Mm,N(C). Then

‖A‖2→2 ≤√‖A‖1→1‖A‖∞→∞ ≤ max‖A‖1→1, ‖A‖∞→∞.

Proof. By definition, ‖Ax‖22 =

m∑i=1

|(Ax)i|. We observe that, using Cauchy Schwarz

inequality,

|(Ax)i|2 ≤ (N∑j=1

|aij||xj|)2 = (N∑j=1

|aij|12 |aij|

12 |xj|)2

≤ (N∑j=1

|aij|)(N∑j=1

|aij||xj|2)

≤ ‖A‖∞→∞(N∑j=1

|aij||xj|2).

12

It follows that

‖Ax‖22 ≤ ‖A‖∞→∞

m∑i=1

N∑j=1

|aij||xj|2

= ‖A‖∞→∞N∑j=1

m∑i=1

|aij||xj|2

≤ ‖A‖∞→∞‖A‖1→1‖x‖22.

Hence, taking the maximum over the vector x ∈ CN with ‖x‖2 = 1,

‖A‖2→2 ≤√‖A‖1→1‖A‖∞→∞.

Using the fact that, given α, β ∈ R,√αβ ≤ maxα, β we conclude the proof. ut

In particular, for an hermitian matrix A = A∗,

||A||2→2 ≤ ||A||1→1 = ‖A‖∞→∞. (2.2)

Moreover, it holds||A||2→2 = sup

||x||2=1

|〈Ax, x〉|. (2.3)

Other norms we will use are the Schatten norms. Before define the Schatten norms werecall the notion of singular values.

Given a matrix A ∈Mm,N(C) of rank r there exist a ”diagonal” matrix Σ ∈Mm,N(C) andunitary matrices U ∈Mm,m(C), V ∈MN,N(C), that is U∗U = I = UU∗, V ∗V = I = V V ∗

such thatA = UΣV ∗.

The diagonal entries of Σ, Σjj = σj are nonnegative and can be arranged in order ofdecreasing magnitude. The positive ones are called the singular values of A.It follows that

A∗A = V Σ∗ΣV ∗ and AA∗ = UΣΣ∗U∗.

Note that both Σ∗Σ and ΣΣ∗ are square diagonal matrices whose first r diagonal entriesare the σ2

j and the remaining diagonal entries are equal to zero.

The singular value decomposition (SVD) has the form

A =r∑j=1

σj(A)ujtvj

where r is the rank of A, u1, . . . , ur ∈ Rm and v1, . . . , vr ∈ RN are orthonormal vectorsand σ1(A) ≥ · · · ≥ σr(A) > 0 are real numbers, the singular values of A.Let S1 and S2 be the linear spans of u1, . . . , ur and v1, . . . , vr respectively. The pairof linear vector spaces (S1, S2) is called the support of A.We will denote PS the projector on the linear vector space S.

Let σ(A) = (σ1(A), ..., σr(A)) be the sequence of singular values of the matrix A. Letdefine the Schatten p norm as

||A||Sp := ||σ(A)||p 1 ≤ p ≤ ∞.

13

Note that in the case p =∞ we have that ‖A‖S∞ = σ1(A).

The operator norm is the Schatten norm with p =∞,

||A||2→2 = ||σ(A)||∞ = ||A||S∞ .

By the analogous property of the p vector norm, equation (2.1), we have that

||A||Sq ≤ ||A||Sp for q ≥ p.

In particular

||A||2→2 ≤ ||A||Sp for all 1 ≤ p ≤ ∞.

If A has rank r, the nonzero eigenvalues of A∗A are r. Then

||σ(A)||p =( r∑j=1

|σj(A)|p) 1p ≤ max

j=1,...,r|σj(A)|r

1p

it follows that

||A||Sp ≤ r1p ||A||2→2. (2.4)

Let D2 = ΣΣ∗. Since, for square matrices, the trace is cyclic, that is tr(AB) = tr(BA),we have that, for every n ∈ N,

||A||2nS2n= ||σ(A)||2n2n = tr(D2n)

= tr(D2nU∗U) = tr(UD2nU∗)

= tr((UD2U∗)n)

= tr((AA∗)n) = tr((A∗A)n).

(2.5)

Lemma 2.4 Given A ∈Mm,N(C) it holds

‖A‖22→2 = ‖A∗A‖2→2.

Proof. Since ‖Ax‖22 = |〈A∗Ax, x〉| then, by equation (2.3), since the matrix A∗A is self

adjoint,

‖A‖22→2 = sup

‖x‖2=1

|〈A∗Ax, x〉| = ‖A∗A‖2→2.

ut

For every subset T ⊂ 1, ..., N we denote xT ∈ CN the vector which coincides withx ∈ CN on the entries in T and is zero outside T, while xT ∈ C|T | denotes the vector xrestricted to the entries in T.

Similary, AT denotes the submatrix of A corresponding to the columns indexed by T, thatis, if A = (a1| |aN), with aj ∈ Cm, then AT = (aj)j∈T .

Further, the cardinality of T is denoted by |T |.

14

2.2 Concentration inequalities

Let (Ω,Σ,P) be a probability space, where Σ is a σ algebra on the sample space Ω and Pis a probability measure on (Ω,Σ).

We recall that a random vector is a function X : Ω → Rn such that for any Borel setA ⊂ Rn,

X−1(A) = ω ∈ Ω | X(ω) ∈ A ∈ Σ.

If n = 1, X is called random variable.

A random variable X has finite expectation if X is integrable with respect to the proba-bility measure P, that is ∫

Ω

|X(ω)|dP(ω) <∞.

We denote the expectation of X

E[X] =

∫Ω

X(ω)dP(ω).

If X is positive and not integrable we set E[X] = +∞.The quantities E[|X|p] are called moments of order p. It holds the following lemma.

Lemma 2.5 For 1 ≤ p <∞ it holds that

E[|X + Y |p]1p ≤ E[|X|p]

1p + E[|Y |p]

1p .

Proof. It holds that

E[|X + Y |p]1p ≤ E[(|X|+ |Y |)p]

1p

=(∫

(|X|+ |Y |)pdP) 1p

= ‖|X|+ |Y |‖p ≤ ‖|X|‖p + ‖|Y |‖p= E[|X|p]

1p + E[|Y |p]

1p .

ut

For any set E ∈ Σ we denote the indicator function of E as

1E(ω) =

1 if ω ∈ E0 if ω /∈ E.

The moment of order p of a random variable X can be expressed as follows.

Lemma 2.6 Let X be a random variable, it holds that

E[|X|p] = p

∫ ∞0

P[|X| ≥ t]tp−1dt, p > 0.

15

Proof. Using Fubini Theorem and a change of variable

E[|X|p] =

∫Ω

|X(ω)|pdP(ω) =

∫Ω

∫ ∞0

1|X|p≥x(ω)dxdP(ω)

=

∫ ∞0

∫Ω

1|X|p≥x(ω)dP(ω)dx =

∫ ∞0

P[|X|p ≥ x]dx

=

∫ ∞0

P[|X|p ≥ tp]ptp−1dt = p

∫ ∞0

P[|X| ≥ t]tp−1dt.

In the first line we have used the following fact,

|X(ω)|p =

∫ |X(ω)|p

0

dx =

∫ ∞0

θ(|X(ω)|p − x)dx =

∫ ∞0

1|X|p≥x(ω)dx,

where

θ(x) =

1 if x ≥ 0

0 if x ≤ 0.

ut

Theorem 2.7 (Markov inequality). Let X be a random variable. Then for p ≥ 1, for allt > 0,

P[|X| ≥ t] ≤ E[|X|p]tp

.

Proof. Since P[|X| ≥ t] = E[1|X|≥t] and t1|X|≥t ≤ |X| then

|X|p ≥ (t1|X|≥t)p = tp1|X|≥t.

Hence, taking the expectation,

E[|X|p] ≥ tpE[1|X|≥t] = tpP[|X| ≥ t].

ut

Theorem 2.8 (Jensen’s inequality). Let f : Cn → R be a convex function and let X ∈ Cn

be a random vector. Assume that X and f(X) have both finite expectation. Then

f(E[X]) ≤ E[f(X)].

Proof. Let X be a random vector. Since f is convex, for every x0 there exists m ∈ Rsuch that

f(X) ≥ f(x0) +m(X − x0).

Choosing x0 = E[X]f(X) ≥ f(E[X]) +m(X − E[X]).

Taking the expectation we conclude the proof,

E[f(X)] ≥ f(E[X]) +m(E[X]− E[X]) = f(E[X]).

ut

16

Lemma 2.9 (Borel-Cantelli Lemma). Let (Ω,Σ,P) be a probability space and let A1, A2, · · · ∈Σ be events. Let

A∗ = lim supn→∞

An =∞⋂n=1

∞⋃j=n

Aj.

If∑∞

n=1 P[An] <∞, then P[A∗] = 0.

Proof. Since A∗ ⊂⋃∞j=nAj, for all n, then P[A∗] ≤

∑∞j=n P[An] → 0 as n → ∞,

whenever∑∞

n=1 P[An] <∞. ut

We now introduce a further definition which will be useful later.

Definition 2.10 A random vector X is called symmetric if X and −X have the samedistribution.

2.3 Moments and Tails

Recall that the function t 7→ P[|X| ≥ t] is called the tail of X.

Proposition 2.11 Suppose Z is a random variable satisfying(E[|Z|p]

) 1p ≤ αβ

1pp

1γ , for all p ≥ p0

for some constants α, β, γ, p0 > 0. Then

P[|Z| ≥ e1γαu] ≤ βe−

uγ

γ

for all u ≥ p1γ

0 .

Proof. By Markov inequality, we obtain,

P[|X| ≥ t] ≤ E[|X|p]tp

≤ β(αp 1

γ

t

)p.

Choosing t = eκαu for any arbitrary κ > 0,

P[|Z| ≥ e1γαu] ≤ β

( p 1γ

ueκ

)p.

Choosing p = uγ and κ = 1γ

we obtain the claim. ut

In particular if E[|Z|p]1p ≤ αβ

1p√p, for all p ≥ 2, choosing in the above proposition γ = 2,

then the random variable Z satisfies the subgaussian tail estimate, for all u ≥√

2,

P[|Z| ≥ e12αu] ≤ βe−

u2

2 .

For a random variable satisfying a subgaussian tail estimate the following results hold.

Lemma 2.12 For u > 0, ∫ ∞u

e−t2

2 dt ≤ min√π

2,

1

ue−

u2

2 .

17

Proof. By changing variable we obtain∫ ∞u

e−t2

2 dt =

∫ ∞0

e−(s+u)2

2 ds =

∫ ∞0

e−( s2

2+su+u2

2)ds = e−

u2

2

∫ ∞0

e−s2

2 e−suds.

On the one hand we observe that e−su ≤ 1 since s, u ≥ 0, then∫ ∞u

e−t2

2 dt ≤ e−u2

2

∫ ∞0

e−s2

2 ds = e−u2

2

√π

2.

On the other hand, e−s2

2 ≤ 1, then∫ ∞u

e−t2

2 dt ≤ e−u2

2

∫ ∞0

e−suds = e−u2

21

u.

This concludes the proof. ut

Lemma 2.13 Let X1, . . . , XM be random variables satisfying

P[|Xl| ≥ u] ≤ βe−u2

2 for u ≥√

2, l = 1, . . . ,M,

for some β ≥ 1. ThenE[ max

l=1,...,M|Xl|] ≤ Cβ

√log(4βM)

where Cβ ≥√

2 + 14√

2 log(4β).

Proof. According to Lemma 2.6 and using the fact that any probability is bounded bu1, for some α ≥

√2,

E[ maxl=1,...,M

|Xl|] =

∫ ∞0

P[

maxl=1,...,M

|Xl| > u]du

≤∫ α

0

1du+

∫ ∞α

P[

maxl=1,...,M

|Xl| > u]du

= α +

∫ ∞α

P[∪l=1,...,M |Xl| > u

]du

≤ α +

∫ ∞α

M∑l=1

P[|Xl| > u

]du

≤ α +Mβ

∫ ∞α

e−u2

2 du.

Using Lemma 2.12 we obtain

E[ maxl=1,...,M

|Xl|] ≤ α +Mβ

αe−

α2

2 ,

since α ≥√

2 implies 1α≤√

π2.

Choosing α =√

2 log(4βM) ≥√

2 log(4) ≥√

2, we have that

E[ maxl=1,...,M

|Xl|] ≤√

2 log(4βM) +Mβ√

2 log(4βM)

1

4βM

=√

log(4βM)(√

2 +1

4√

2 log(4βM)

)≤ Cβ

√log(4βM)

having choose Cβ ≥√

2 + 14√

2 log(4β). ut

18

2.4 Rademacher sequence and Symmetrization

Definition 2.14 A Rademacher variable is a random variable which takes values +1 e−1 with probability 1

2each.

A sequence ε = (ε1, . . . , εM) of independent Rademacher variables εj is called a Rademacher

sequence. A sum of the form∑M

j=1 εjxj, where xj may be scalars, vectors or matrices iscalled Rademacher sum.

Lemma 2.15 (Symmetrization). Assume ξ = (ξj)j=1,...M is a sequence of random vectorsin Cn (equipped with the seminorm || · ||), having expectation E[ξj] = xj. Then, for 1 ≤p <∞ (

E[||M∑j=1

(ξj − xj)||p]

) 1p

≤ 2

(E[||

M∑j=1

εjξj||p]

) 1p

(2.6)

where ε = (εj)j=1,...,M is a Rademacher sequence independent of ξ.

Proof. Let ξ′ = (ξ′1, ..., ξ′M) be an independent copy of the sequence of random vectors

(ξ1, ..., ξM). Since E[ξ′j] = xj, by Jensen’s inequality it holds

E[||M∑j=1

(ξj − xj)||p] = E[||M∑j=1

(ξj − Eξ′j)||p]

≤ E[||M∑j=1

(ξj − ξ′j)||p]

≤ E[||M∑j=1

εj(ξj − ξ′j)||p]

since (ξj − ξ′j)j=1,...,M has the same distribution of (εj(ξj − ξ′j))j=1,...,M . Furthermore,

(E[||

M∑j=1

(ξj − xj)||p]) 1p ≤

(E[||

M∑j=1

εj(ξj − ξ′j)||p]) 1p

≤(E[||

M∑j=1

εjξj||p]) 1p

+(E[||

M∑j=1

εjξ′j||p]

) 1p

≤ 2(E[||

M∑j=1

εjξj||p]) 1p

since ξ′ is an independent copy of ξ and the map X 7→ (E[Xp])1p is a norm and hence

subadditive. ut

2.5 Hoeffding’s inquality for Rademacher sums

In order to proof Hoeffding’s inquality, we need to introduce Khintchine inequalities,which provide estimates for the moment of Rademacher sums.

19

We recall that the multinomial theorem states that( M∑j=1

xj

)n=

∑k1+···+kM=nki∈0,1,...,n

n!

k1! . . . kM !xk11 . . . xkMM .

Theorem 2.16 (Khintichine’s inequality). Let b ∈ CM and ε = (ε1, . . . , εM) be a Rademachersequence. Then, for all n ∈ N,

E[|M∑j=1

εjbj|2n]≤ (2n)!

2nn!||b||2n2 .

Proof. We first assume that bj are real valued. Using the multinomial theorem, we havethat

E := E[|M∑j=1

εjbj|2n]

=∑

j1+···+jM=nji∈0,1,...,n

(2n)!

(2j1)! . . . (2jM)!|b1|2j1 . . . |bM |2jME[ε2j11 ] . . .E[ε2jMM ]

=∑

j1+···+jM=nji∈0,1,...,n

(2n)!

(2j1)! . . . (2jM)!|b1|2j1 . . . |bM |2jM ,

where we have used the independence of εj and the fact that E[ε2j ] = 1 and E[ε2k+1j ] = 0.

Furthermore, for integers satisfying j1 + . . . jM = n it holds

2nj1! . . . jM ! = 2j1j1! . . . 2jM jM ! ≤ (2j1)! . . . (2jM)!.

This implies

E ≤ (2n)!

2nn!

∑j1+···+jM=nji∈0,1,...,n

n!

j1! . . . jM !|b1|2j1 . . . |bM |2jM

=(2n)!

2nn!

( M∑j=1

|bj|2)n

=(2n)!

2nn!||b||2n2 .

The general complex case is derived as follows

E[|M∑j=1

εj(Re(bj) + iIm(bj)

)|2n] 1

2n= E

[(|M∑j=1

εjRe(bj)|2 + |M∑j=1

εjIm(bj)|2)n] 1

2n

≤(E[|M∑j=1

εjRe(bj)|2n] 1n

+ E[|M∑j=1

εjIm(bj)|2n] 1n) 1

2

≤(((2n)!

2nn!

) 1n (||Re(b)||22 + ||Im(b)||22

)) 12

=((2n)!

2nn!

) 12n ||b||2.

ut

20

We recall the Stirling formula for the factorial, []. For any n ≥ 1,

n! = nne−n√

2πneλn (2.7)

where 112n+1

≤ λn ≤ 112n.

Corollary 2.17 (Khintichine’s inequality). Let b ∈ CM and ε = (ε1, . . . , εM) be aRademacher sequence. Then, for all p ≥ 2,

E[|M∑j=1

εjbj|p] 1p ≤ 2

34p e−

12√p||b||2.

Proof. Without loss of generality we may assume that ||b||2 = 1. Note that, as a conse-quence of Holder inequality, for θ ∈ [0, 1] and a random variable Z we have that

E[|Z|2n+2θ] = E[(|Z|2n)1−θ(|Z|2n+2)θ

]≤(E[|Z|2n]

)1−θ(E[|Z|2n+2])θ.

The previous estimate combined with Khintchine inequality (Theorem 2.16) gives

E[|M∑j=1

εjbj|2n+2θ]≤(E[|

M∑j=1

εjbj|2n])1−θ(E[|

M∑j=1

εjbj|2n+2])θ

≤((2n)!

2nn!

)1−θ( (2n+ 2)!

2n+1(n+ 1)!

)θ≤(√

2(2

e

)nnn)1−θ(√

2(2

e

)n+1(n+ 1)n+1

)θ=√

2(2

e

)n+θnn(1−θ)(n+ 1)(n+1)θ,

since by Stirling formula (2.7)

(2n)!

2nn!= 2nnne−n

√2eλ2n−λn ≤ 2nnne−n

√2

where λ2n − λn ≤ 1−2n24n(12n+1)

≤ 0. Furthermore

nn(1−θ)(n+ 1)(n+1)θ =(n1−θ(n+ 1)θ

)n(n+ 1)θ

=(n1−θ(n+ 1)θ

)n+θ(n+ 1

n

)θ(1−θ)≤(n+ θ

)n+θ(n+ 1

n

)θ(1−θ)≤(n+ θ

)n+θ2

14

since n+1n≤ 2 and θ(1− θ) ≤ 1

4and since, using the convexity of the logarithm,

n1−θ(n+ 1)θ = e(1−θ) logn+θ log(n+1) ≤ elog(

(1−θ)n+θ(n+1))

= n+ θ.

We have then shown that

E[|M∑j=1

εjbj|2n+2θ]≤√

2(2

e

)n+θ(n+ θ

)n+θ2

14 = 2

34

(2

e

)n+θ(n+ θ

)n+θ.

Replacing n+ θ by p2

we conclude the proof. ut

21

Corollary 2.18 Let b = (b1, ..., bM) ∈ CM and ε = (ε1, ..., εM) be a Rademacher sequence.Then, for u >

√2,

P[|M∑j=1

εjbj| ≥ ||b||2 u]≤ 2 exp(−u

2

2).

Proof. By the above corollary we have that

(E[|

M∑j=1

εjbj|p]) 1p ≤ 2

34p e−

12√p||b||2.

Applying Proposition 2.11 with γ = 2, β = 234 , α = e−

12 ||b||2, we have that

P[|M∑j=1

εjbj| ≥ ||b||2 u]≤ 2 exp(−u

2

2).

ut

We now give the standard version of the Hoeffding’s inequality for real Rademacher sums.

Proposition 2.19 Let b = (b1, ..., bM) ∈ RM and ε = (ε1, ..., εM) be a Rademacher se-quence. Then, for u > 0,

P[ M∑j=1

εjbj ≥ ||b||2 u]≤ exp(−u

2

2) (2.8)

and consequently,

P[|M∑j=1

εjbj| ≥ ||b||2 u]≤ 2 exp(−u

2

2).

Proof. Without loss of generality we may assume ||b||2 = 1. By Markov inequality andindependence, we have, for λ > 0,

P[ M∑j=1

εjbj ≥ u]

= P[

exp(λ

M∑j=1

εjbj)≥ eλu

]≤ e−λuE

[exp

(λ

M∑j=1

εjbj)]

= e−λuM∏j=1

E[

exp(λεjbj

)].

Note that, for s ∈ R,

E[

exp(εs)]

=1

2(es + e−s) = cosh(s) =

∞∑j=0

s2j

(2j)!

≤∞∑j=0

s2j

2jj!=∞∑j=0

(s2

2

)j 1

j!= e

s2

2 .

22

Then,

P[ M∑j=1

εjbj ≥ u]≤ e−λu

M∏j=1

E[

exp(λεjbj

)]≤ e−λu

M∏j=1

eλ2b2j

2 = e−λueλ2||b||22

2 .

Choosing λ = u and recalling that ||b||2 = 1, we have

P[ M∑j=1

εjbj ≥ u]≤ e−u

2

eu2

2 = e−u2

2 .

Finally,

P[|M∑j=1

εjbj| ≥ u]

= P[ M∑j=1

εjbj ≥ u]

+ P[−

M∑j=1

εjbj ≥ u]≤ 2e−

u2

2 ,

since −εj has the same distribution of εj. ut

We compare the previous result with the Chebyshev’s inequality.

Proposition 2.20 (Chebyshev’s inequality). Let X be a random variable with finite ex-pectation value µ and nonzero variance σ2. Then for any real number u > 0

P[|x− µ| ≥ u√σ2] ≤ 1

u2.

In this context, let X =∑M

j=1 εjbj. Then we have

E[X] =M∑j=1

E[εj]bj = 0

and

V ar[X] = E[X2]− E[X]2 = E[X2] = E[(M∑j=1

εjbj)2] =

M∑j=1

b2j = ‖b‖2

2,

since E[εjεk] = δjk for all j, k = 1, . . . ,M.

The Chebyshev’s inequality becomes

P[|M∑j=1

εjbj| ≥ u‖b‖2] ≤ 1

u2.

We then observe that the above Proposition 2.19 gives a stronger bound with respect tothe Chebyshev’s inequality.

We also note that equation (2.8) can be write as

P[|〈ε, b〉|‖b‖2‖ε‖2

> t

]≤ exp

(− Mt2

2

)having choose u = Mt and where ‖ε‖2 =

√M. For large values of the dimension M, the

estimates of probability Pt,M := P[|〈ε,b〉|‖b‖2‖ε‖2 ≤ t

]is close to one, for (almost) any t, as

shown in Table 2.1.

23

Table 2.1: Estimates of Pt,M depending on t and M.

t \ M 2 3 5 7 10 25 50 100 2000.01 0.02% 0.04% 0.12% 0.24% 0.50% 3.08% 11.75% 39.35% 86.47%0.05 0.50% 1.12% 3.08% 5.94% 11.75% 54.22% 95.61% 100% 100%0.1 1.98% 4.40% 11.75% 21.73% 39,35% 95.61% 100% 100% 100%0.2 7.69% 16.47% 39.35% 62.47% 86.47% 100% 100% 100% 100%0.3 16.47% 33.30% 67.53% 88.97% 98.89% 100% 100% 100% 100%0.5 39.35% 67.53% 95.61% 99.78% 100% 100% 100% 100% 100%

2.6 Rudelson’s Lemma

The Rudelson’s Lemma is an estimate for the operator norm of a Rademacher sum. Wefirst introduce some notation and some technical results.

Definition 2.21 A pairing is a partition of the set 1, . . . , 2n into n two-elements sub-sets, called blocks.

We now introduce some notation. We will denote the set of all pairings of 1, . . . , 2n byP2n.

The canonical pairing is denoted by 1 = D1, . . . , Dn and the blocks are Dj = 2j −1, 2j.Let π = D1, . . . , Dn be a paring, then its cyclic shift is the pairing Tπ = TD1, . . . , TDnwhere Tj, k = j + 1, k + 1 modulo 2n.

The symmetrized pairing ←−π contains all the blocks j, k of π such that j, k ≤ n, all the”reflected” blocks 2n+ 1− j, 2n+ 1− k. In addition the blocks j, k with j ≤ n andk > n are replaced by the symmetric block j, 2n + 1− j while the blocks j, k wherej, k > n are omitted.Similarly the pairing −→π contains all the blocks j, k of π such that j, k > n, all the”reflected” blocks 2n+ 1− j, 2n+ 1− k. In addition the blocks j, k with j ≤ n andk > n are replaced by the symmetric block 2n+ 1− k, k while the blocks j, k wherej, k ≤ n are omitted.

In order to explain the previous definitions we introduce an example.

Example 2.22 Let n = 4 and

π = 1, 7, 2, 3, 4, 6, 5, 8.

In order to construct ←−π we consider the block 2, 3, its reflected block 7, 6 and thesymmetric block 1, 8, 4, 5. Then

←−π = 2, 3, 7, 6, 1, 8, 4, 5.

Similarly,−→π = 5, 8, 4, 1, 6, 3, 7, 2.

24

Let B = (B1, . . . , BM) be a sequence of matrices of the same dimension and let π =D1, . . . , Dn ∈ P2n. We have that

π(B) =M∑

k1,...,kn=1

Bkα(1)B∗kα(2)

Bkα(3)B∗kα(4)

. . . Bkα(2n−1)B∗kα(2n) ,

where α = απ : 1, . . . , 2n → 1, . . . , n is such that α(j) = l if and only if j ∈ Dl.

Lemma 2.23 Let π ∈ P2n, B = (B1, . . . , BM) be a sequence of complex matrices of thesame dimension. Then there exist γ ≥ 1

4nand nonnegative numbers pρ(π), ρ ∈ P2n,

satisfying γ +∑

ρ∈P2npρ = 1, such that

|trπ(B)| ≤( ∏ρ∈P2n

|trρ(B)|pρ)

max

||( M∑j=1

BjB∗j

) 12 ||2nS2n

, ||( M∑j=1

B∗jBj

) 12 ||2nS2n

γ

.

Proof. In order to clarify the proof of this lemma we introduce some example at theend of the proof. We first observe that

1(B) =M∑

k1,...,kn=1

Bk1B∗k1. . . BknB

∗kn

=M∑

k1,...,kn=1

n∏j=1

BkjB∗kj

=

( M∑k=1

BkB∗k

)n=

(( M∑k=1

BkB∗k

) 12( M∑k=1

BkB∗k

) 12

)nsince the matrix inside the bracket is selfadjoint and positive. Moreover, in the secondline, we used the fact that, by induction

M∑k1,...,kn=1

n∏j=1

aj =( M∑k=1

ak)n.

Using equation (2.5) yields

tr1(B) = ||( M∑j=1

BjB∗j

) 12 ||2nS2n

.

Since the trace is cyclic, we obtain as above that

trT1(B) = ||( M∑j=1

B∗jBj

) 12 ||2nS2n

.

Let t ∈ 0, 1, . . . , n be the maximal number such that p, p+ 1, p+ 2, p+ 3, . . . , p+2t− 2, p+ 2t− 1 are blocks of the partition π, for some p.

If t = n then π = 1 or π = T1 and then we conclude.

25

Suppose t ∈ 1, . . . , n− 1. By the cyclicity of the trace

tr π(B) =

tr T n−p−2t+1 π(B) if n− p odd

tr T n−p−2t+1 π(B∗) if n− p even,

where B∗ = (B∗1 , . . . , B∗M). Note that it is sufficient to prove the lemma for n − p odd,

since if n− p is even we conclude observing that

tr1(B∗) = trT1(B)

trT1(B∗) = tr1(B).

By definition of T n−p−2t+1, n− 2t+ 1, n− 2t+ 2, n− 2t+ 3, n− 2t+ 4, . . . , n− 1, nare blocks of the partition T n−p−2t+1π =: D1, . . . Dn. Let α : 1, . . . , 2n → 1, . . . , nsuch that α(j) = l if and only if j ∈ Dl.

By the cyclicty of the trace, we have that

trπ(B) = trT n−p−2t+1π(B)

= trM∑

k1,...,kn=1

Bkα(1)B∗kα(2)

Bkα(3)B∗kα(4)

. . . Bkα(2n−1)B∗kα(2n)

Assume n is even and p is odd.

Observe that kj with j ∈ α(1), . . . , α(n) is a repeated index if both the elements of Dj

are in 1, . . . , n, while if j ∈ α(n + 1), . . . , α(2n) then both the elements of Dj are inn+ 1, . . . , 2n.

Denote with L the subset of 1, . . . , n containing the indexes l for which both elementsof Dl are in 1, . . . , n, R the subset containing the indexes l for which both elements ofDl are in n+ 1, . . . , 2n and U the set containing the remaining indexes for blocks whichhave elements in both 1, . . . , n and n+ 1, . . . , 2n.

26

It follows that

|trπ(B)| = |trT n−p−2t+1π(B)|

=

∣∣∣∣tr( M∑ki=1i∈U

( M∑ki=1i∈L

Bkα(1) . . . B∗kα(n)

M∑ki=1i∈R

Bkα(n+1). . . B∗kα(2n)

))∣∣∣∣

≤∣∣∣∣ M∑ki=1i∈U

tr

(( M∑ki=1i∈L

Bkα(1) . . . B∗kα(n)

)( M∑ki=1i∈R

Bkα(n+1). . . B∗kα(2n)

))∣∣∣∣

≤M∑ki=1i∈U

√√√√√tr

(( M∑ki=1i∈L

Bkα(1) . . . B∗kα(n)

)( M∑ki=1i∈L

Bkα(n) . . . B∗kα(1)

))·

·

√√√√√tr

(( M∑ki=1i∈R

Bkα(2n) . . . B∗kα(n+1)

)( M∑ki=1i∈R

Bkα(n+1). . . B∗kα(2n)

))

≤

√√√√√ M∑ki=1i∈U

tr

(( M∑ki=1i∈L

Bkα(1) . . . B∗kα(n)

)( M∑ki=1i∈L

Bkα(n) . . . B∗kα(1)

))·

·

√√√√√ M∑ki=1i∈U

tr

(( M∑ki=1i∈R

Bkα(2n) . . . B∗kα(n+1)

)( M∑ki=1i∈R

Bkα(n+1). . . B∗kα(2n)

))

=∣∣tr(←−−−−−−−−T n−p−2t+1 π)(B)

∣∣ 12 ∣∣tr(−−−−−−−−→T n−p−2t+1 π)(B)∣∣ 12

=∣∣tr(←−−−−−−−−T n−p−2t+1 π)(B)

∣∣ 12 |trρ(B)|12 .

In the forth line we have used the fact

|tr(B∗A)| ≤√

tr(A∗A)√

tr(B∗B).

Indeed, by Cauchy Schwarz inequality, since tr(B∗A) =∑

i,j bjiaji then

|tr(B∗A)| ≤∑i,j

|bji||aji| ≤ (∑i,j

|bij|2)12 (∑i,j

|aij|2)12 .

Moreover in the fifth line we have used the fact∑i

√aibi ≤

√∑i

ai

√∑i

bi. (2.9)

27

Indeed, to prove (2.9) is equivalent to show that

∑i,j

√aibiajbj ≤

∑i

ai∑j

bj,

which is equivalent to ∑i 6=j

aibj −∑i 6=j

√aibiajbj ≥ 0.

We observe that∑i 6=j

(aibj −√aibiajbj) =

∑i<j

(aibj + ajbi − 2√aibiajbj) =

∑i<j

(√aibj −

√ajbi)

2 ≥ 0.

At the end of the proof we introduce some examples in order to clarify the previous steps.

If t ≥ n2

then←−−−−−−−−T n−p−2t+1 π is equal to 1 or T1 and then we conclude choosing γ = 1

2and

pρ = 12. Note that in the case n odd we have to consider the case t ≥ n+1

2.

Else, if t < n2

then←−−−−−−−−T n−p−2t+1 π contains the blocks n − 2t + 1, n − 2t + 2, . . . , n −

1, n, n + 1, n + 2, . . . , n + 2t − 1, n + 2t. Applying the same estimate as above to

T−2t←−−−−−−−−T n−p−2t+1 π we obtain

|trπ(B)| ≤ |tr(←−−−−−−−−−−−T−2t←−−−−−−−−T n−p−2t+1 π)(B)|

14 |trρ(B)|

14 |trρ(B)|

12 ,

for suitable ρ, ρ ∈ P2n. Simillary as above, if t ≥ n4

then←−−−−−−−−−−−T−2t←−−−−−−−−T n−p−2t+1 π equals 1 or T1

and we have finished. Else, if t < n4, we continue this way.

Since we must have that 1 ≤ t ≤ n2k, for every k, then after at most log2(n) estimations

we obtain the claim for γ =(

12

)[log2(n)] ≥ 12n.

Finally, if t = 0, we apply the above method to T qπ where q is such that n, p is a blockof T qπ, for some p > n. Then, as above, we obtain

|trπ(B)| ≤ |tr(←−−T qπ)(B)|

12 |tr(

−−→T qπ)(B)|

12

= |tr(←−−T qπ)(B)|

12 |trρ(B)|

12 .

Moreover, by definition of←−−T qπ, we have that n, n + 1 is a block of

←−−T qπ. Iterating this

method we obtain the claim with γ = 12

(12

)[log2(n)] ≥ 14n. ut

The followings examples give an idea of the proof of the lemma.

Example 2.24 Let π = 1, 5, 2, 6, 3, 4. In this case n = 3, t = 1, and p = 3, then

29

Corollary 2.26 Let B = (B1, . . . , BM) be a sequence of complex matrices of the samedimension. Then for all π ∈ P2n,

|trπ(B)| ≤ max

||( M∑j=1

BjB∗j

) 12 ||2nS2n

, ||( M∑j=1

B∗jBj

) 12 ||2nS2n

.

Proof. Denote

max

||( M∑j=1

BjB∗j

) 12 ||2nS2n

, ||( M∑j=1

B∗jBj

) 12 ||2nS2n

=: D.

The constant γ can be chosen the same for all partition π ∈ P2n. For instance, let γ =γ1 = 1

4n. Indeed, if γ is initially larger, we will move some weight from Dγ to |tr1(B)|p1(π)

or |trT1(B)|pT1(π), whichever term is larger.Applying two times the above lemma, we obtain

|trπ(B)| ≤ Dγ1∏κ∈P2n

|trκ(B)|pκ(π)

≤ Dγ1∏κ∈P2n

(Dγ1

∏ρ∈P2n

|trρ(B)|pρ(κ))pκ(π)

= Dγ1∏κ∈P2n

Dγ1pκ(π)∏ρ∈P2n

|trρ(B)|pρ(κ)pκ(π)

= Dγ1(1+∑κ∈P2n

pκ(π))|trρ(B)|∑κ∈P2n

pρ(κ)pκ(π)

= Dγ1+γ1(1−γ1)|trρ(B)|∑κ∈P2n

pρ(κ)pκ(π).

Letγ2 = γ1 + γ1(1− γ1) and p(2)

ρ (π) =∑κ∈P2n

pρ(κ)pκ(π).

Since γ1 = 14n∈ (0, 1) then the new constant γ2 is larger than γ1. Iterating this process,

we obtain increasingly larger constants γl defined recursively by

γl+1 = γl + γl(1− γl).

Let L = liml→∞ γl. Then we have that L = L + L(1 − L), that is L(1 − L) = 0. Thisimplies L = 1, since γl is an increasing sequence and γ1 = 1

4n> 0. We than have proved

thatliml→∞

γl = 1

and then, since it yields , for all l

γl +∑ρ∈P2n

p(l)ρ (π) = 1

we conclude that for all ρ ∈ P2n

liml→∞

p(l)ρ (π) = 0.

This completes the proof. ut

We can now introduce the noncommutative Khinntchine inequality.

30

Theorem 2.27 (Noncommutative Khinntchine Inequality). Let ε = (ε1, . . . , εM) be aRademacher sequence and let Bj, j = 1, . . . ,M be complex matrices of the same dimen-sion. For any n ∈ N,

E[||

M∑j=1

εjBj||2nS2n

]≤ (2n)!

2nn!max

||( M∑j=1

BjB∗j

) 12 ||2nS2n

, ||( M∑j=1

B∗jBj

) 12 ||2nS2n

.

Proof. By (2.5),

E := E[||

M∑j=1

εjBj||2nS2n

]= E

[tr((M∑k=1

εkBk)(M∑j=1

εjBj)∗)n]

= E[tr( M∑j,k=1

εjεkBjBk

)n]=

M∑k1,...,k2n=1

E[εk1 . . . εk2n ]tr(Bk1B∗k2Bk3 . . . B

∗k2n

),

since (∑ajk)

n =∑M

j1,...,jn,k1,...,kn=1

aj1k1 . . . ajnkn .

Observe that

E[εk1 . . . εk2n ] =

1 if for every j there exists i such that kj = ki

0 otherwise.

Indeed, εk1 . . . εk2n is a Rademacher variable and if for every j there exists i such thatkj = ki that means εk1 . . . εk2n = ε2j1 . . . ε

2jn = 1, with j1, . . . , jn ∈ 1, . . . , n not necessarily

distinct.Denoting B = (B1, . . . , BM) and using Corollary 2.26, we have

E =M∑

k1,...,k2n=1

E[εk1 . . . εk2n ]tr(Bk1B∗k2Bk3 . . . B

∗k2n

)

=∑π∈P2n

M∑k1,...,k2n=1

tr(Bk1B∗k2Bk3 . . . B

∗k2n

)

=∑π∈P2n

trπ(B)

≤ |P2n| max

||( M∑j=1

BjB∗j

) 12 ||2nS2n

, ||( M∑j=1

B∗jBj

) 12 ||2nS2n

=(2n)!

2nn!max

||( M∑j=1

BjB∗j

) 12 ||2nS2n

, ||( M∑j=1

B∗jBj

) 12 ||2nS2n

,

since |P2n| = (2n)!2nn!

. ut

We can now introduce the main result of this section, the so called Rudelson’s Lemma.

31

Theorem 2.28 (Rudelson’s Lemma). Let A ∈Mm,M(C) of rank r with columns a1, ..., aM .Let ε = (ε1, .., εM) be a Rademacher sequence. Then, for 2 ≤ p <∞,

(E||

M∑j=1

εjaja∗j ||

p2→2

) 1p

≤ 234p r

1p√p e−

12 ||A||2→2 max

j=1,...,M||aj||2.

Proof. We write p = 2n+ 2θ, n ∈ N, θ ∈ [0, 1). Note that

(aja∗j)∗(aja

∗j) = (aja

∗j)(aja

∗j)∗ = (aja

∗j)(aja

∗j) = ||aj||22(aja

∗j).

Therefore, the noncommutative Khintchine inequality yields

E := E[||

M∑j=1

εjaja∗j ||2n2→2

] 12n

≤ E[||

M∑j=1

εjaja∗j ||2nS2n

] 12n

≤ Cn||( M∑j=1

||aj||22aja∗j) 1

2 ||S2n ,

where Cn =( (2n)!

2nn!

) 12n .

The operator∑

j ||aj||22aja∗j has rank at most r (for the proof see the Appendix). Thenwe have

E[||

M∑j=1


] 12n ≤ Cn||

( M∑j=1

||aj||22aja∗j) 1

2 ||S2n

≤ Cnr12n ||( M∑j=1

||aj||22aja∗j) 1

2 ||2→2 using (2.4)

≤ Cnr12n ||

M∑j=1

aja∗j ||

122→2 max

k=1,...,M||ak||2

≤ Cnr12n ||A||2→2 max

k=1,...,M||ak||2

since ||∑M

j=1 aja∗j ||2→2 = ||AA ∗ ||2→2 = ||A||22, (for the proof see the Appendix).

As a consequence of the Holder inequality, we have that, given a random variable Z,

E[|Z|2n+2θ] = E[|Z|2n(1−θ)|Z|θ(2n+2)] = E[(|Z|2n)(1−θ)(|Z|(2n+2))θ

]≤(E[|Z|2n]

)1−θ(E[|Z|2n+2])θ.

32

Then,

E[||

M∑j=1

εjaja∗j ||2n+2θ

2→2

]≤(E[||

M∑j=1


])1−θ(E[||

M∑j=1

εjaja∗j ||2n+2

2→2

])θ≤(E[||

M∑j=1

εjaja∗j ||2nS2n

])1−θ(E[||

M∑j=1

εjaja∗j ||2n+2

S2n+2

])θ≤(Cnr

12n ||A||2→2 max

k=1,...,M||ak||2

)2n(1−θ)·

·(Cn+1r

12n+2 ||A||2→2 max

k=1,...,M||ak||2

)(2n+2)θ

≤((2n)!

2nn!

)(1−θ)( (2n+ 2)!

2n+1(n+ 1)!

)θr(||A||2→2 max

k=1,...,M||ak||2

)2n+2θ

.

Using Stirling inequality((2n)!

2nn!

)(1−θ)( (2n+ 2)!

2n+1(n+ 1)!

)θ=((2n)2ne−2n

√2π2n

2nnne−n√

2πn

)(1−θ)( (2n+ 2)2n+2e−2n−2√

2π(2n+ 2)

2n+1(n+ 1)n+1e−n−1√

2π(n+ 1)

)θ

≤√

2 2ne−(n+θ)2n+θnn(1−θ)(n+ 1)θ(n+ 1).

Moreover,

nn(1−θ)(n+ 1)θ(n+ 1) = (n(1−θ)(n+ 1)θ)n(n+ 1)θ

= (n(1−θ)(n+ 1)θ)n+θ(n+ 1

n

)θ(1−θ)≤ (n+ θ)n+θ

(n+ 1

n

)θ(1−θ)≤ (n+ θ)n+θ2

14 since

n+ 1

n≤ 2 and θ(1− θ) ≤ 1

4

where we have used that n(1−θ)(n + 1)θ ≤ n + θ since the function log is concave andtherefore

(1− θ) log(n) + θ log(n+ 1) ≤ log(n+ θ).

We have then proved that

E[||

M∑j=1

εjaja∗j ||2n+2θ

2→2

]≤ 2

34

(2

e

)n+θ(n+ θ)n+θr

(||A||2→2 max

k=1,...,M||ak||2

)2n+2θ

.

Substituting p2

= n+ θ, we complete the proof. ut

The following Corollary shows an exponential bound for the tail of the random variable||∑M

j=1 εjaja∗j ||2→2.

Corollary 2.29 Let A ∈ Mm,M(C) of rank r with columns a1, ..., aM . Let ε = (ε1, .., εM)be a Rademacher sequence. Then, for all u ≥

√2,

P

[||

M∑j=1

εjaja∗j ||2→2 ≥ u‖A‖2→2 max

j=1,...,M‖aj‖2

]≤ 2

34 r e−

u2

2 .

Proof. The thesis follows from Rudelson’s Lemma (Theorem 2.28) and Proposition

2.11 with p0 = 2, γ = 2, β = 234 r and α = e−

12‖A‖2→2 maxj=1,...,M ‖aj‖2. ut

33

2.7 Dudley’s Inequality

Let Xt, t ∈ T , be a stochastic process of complex valued random variables indexed bysome set T . We are interested in bounding the moments of its supremum. We endow Twith the pseudo metric

d(s, t) = E[|Xt −Xs|2

] 12 .

We consider a stochastic process of the form

Xt =M∑j=1

εjxj(t), t ∈ T ,

where ε = (ε1, ..., εM) is a Rademacher sequence and xj : T → C are deterministicfunctions. Such a process is called a Rademacher process and it satisfies (2.10). Indeed,let x(t) = (x1(t), . . . , xM(t)), then

d(s, t)2 = E[|Xt −Xs|2

]= E

[|M∑j=1

εj(xj(t)− xj(s))|2]

= E[ M∑j=1

εj(xj(t)− xj(s))M∑k=1

εj(xk(t)− xk(s))]

= E[|M∑j=1

(xj(t)− xj(s))2]

+ E[ M∑j,k=1j 6=k

εjεk(xj(t)− xj(s)

)(xk(t)− xk(s)

)]

=M∑j=1

(xj(t)− xj(s))2 = ||x(t)− x(s)||22,

since E[εjεk] = δjk. Therefore, we can define the pseudo metric

d(s, t) = E[|Xt −Xs|2

] 12 = ||x(t)− x(s)||2.

Proposition 2.19 shows that the Rademacher process above satisfies

P[|Xt −Xs| ≥ u d(t, s)] ≤ 2 exp(−u2

2) t, s ∈ T , u > 0. (2.10)

A process Xt, t ∈ T , satisfying condition (2.10) is called subgaussian process.

For a subset T ⊂ T , the covering number N(T, d, ε) is defined as the minimal number Nsuch that there exists a subset E ⊂ T , |E| = N satisfying

T ⊂⋃t∈E

Bd(t, ε),

where Bd(t, ε) = s ∈ T | d(t, s) ≤ ε. In other word, is the minimal number of balls ofradius ε needed to cover T. Denote the diameter of the space T by

∆(T ) := sups,t∈T

d(s, t).

34

We are interested in bounding the moments of supt∈T |Xt −Xt0|. In order to avoid mea-surability issue, since the supremum of an uncountable number of random variables mightnot be measurable, we define the so called lattice supremum, [28]

E[supt∈T|Xt −Xt0 |] = sup

F⊂TF finite

E[supt∈F|Xt −Xt0|].

We now introduce a version of Dudley’s inequality.

Theorem 2.30 Let Xt, t ∈ T , be a complex-valued process indexed by the pseudo metricspace (T , d), as defined above. Then, for a subset T ⊂ T and any point t0 ∈ T it holds

E[

supt∈T|Xt −Xt0

]| ≤ C1

∫ ∆(T )

0

√log(N(T, d, u)) du + D1∆(T ) (2.11)

with constants C1 = 16.51 and D1 = 4.424. Furthermore, for p ≥ 2,(E[

supt∈T|Xt −Xt0|p

]) 1p ≤ β

1p√p(C

∫ ∆(T )

0

√log(N(T, d, u)) du + D∆(T )

)(2.12)

with constants C = 14.372, D = 5.818 and β = 6.028.

Proof. Without loss of generality we may assume ∆(T ) = 1, otherwise we consider therescaled process Yt = Xt

∆(T ).

Let b > 1 to be chosen later. According to the definition of the covering number, for everyj ≥ 1, there exist finite subsets Ej ⊂ T , of cardinality |Ej| = N(T, d, b−j) such that

T ⊂⋃t∈Ej

Bd(t, b−j).

For each t ∈ T and j ≥ 1 we can therefore define πj(t) ∈ Ej such that

d(t, πj(t)) ≤ b−j.

Further, let set π0(t) = t0. By the triangular inequality, for all j ≥ 2,

d(πj(t), πj−1(t)) ≤ d(πj, t) + d(t, πj−1) ≤ b−j(1 + b),

and

d(π1(t), π0(t)) ≤ D(T ) = 1.

Therefore, for all j ≥ 1,

d(πj(t), πj−1(t)) ≤ b−j(1 + b).

Step 1. We now claim the chaining identity

Xt −Xt0 =∞∑j=1

(Xπj(t) −Xπj−1(t)

)almost surely. (2.13)

Indeed, by (2.10) we have that

35

P[∣∣Xπj(t) −Xπj−1(t)

∣∣ ≥ b−j2

]≤ P

[∣∣Xπj(t) −Xπj−1(t)

∣∣ ≥ bj2

b+ 1d(πj(t), πj−1(t))

]≤ 2 exp

(− bj

2(b+ 1)2

)

This implies that∑∞

j=1 P[∣∣Xπj(t) −Xπj−1(t)

∣∣ ≥ b−j2

]<∞.

From Borelli Cantelli Lemma, choosing Aj = |Xπjl (t)− Xπjl−1(t)| ≥ b−

j2, we have

that P[⋂∞n=1

⋃∞j=nAj] = 0 which means that P[

⋃∞n=1

⋂∞j=nA

Cj ] = 1. Then the fact that

ω ∈⋃∞n=1

⋂∞j=nA

Cj is equivalent to the fact that there exists j0 such that |Xπj(t)(ω) −

Xπj−1(t)(ω)| < b−j2 holds almost surely for all j ≥ j0. Consequently

∞∑j=1


)converges almost surely. Further,

E[∣∣∣Xt −Xt0 −

J∑j=1


)∣∣∣2] = E[∣∣∣Xt −XπJ (t)

∣∣∣2]=(d(t, πJ(t))

)2 ≤ b−2J → 0 for J →∞.

Then there exists a subsequence converging to Xt − Xt0 . Since∑∞

j=1

(Xπj(t) − Xπj−1(t)

)also converges almost surely, it must converge to Xt −Xt0 . We have so proved the claim.

Step 2. We want to prove that

E[

supt∈F|Xt −Xt0|p

] 1p

= p1pα−1(1 + b)

(αp

p+

2

b− 1

∫ ∞α

b−u2

α2 up−1du

) 1p

·

·(

2b

b− 1

∫ ∆(T )

0

√log(N(T, d, u)

)du+

√2 log b

2b− 1

(b− 1)2

).

Let F be a finite subset of T. Let aj > 0, for every j > 0, to be determined later. Forbrevity of notation we will write N(T, d, b−j) = N(b−j). Then, using (2.13)

P[

maxt∈F|Xt −Xt0| > u

∞∑j=1

aj]≤ P

[maxt∈F

∞∑j=1

|Xπj(t) −Xπj−1(t)| > u

∞∑j=1

aj]

≤ P[∪∞j=1 max

t∈F|Xπj(t) −Xπj−1(t)| > u aj

]≤

∞∑j=1

P[

maxt∈F|Xπj(t) −Xπj−1(t)| > u aj

].

Denoting E ′j = πj(t) | t ∈ F ⊂ Ej it follows that

36

P[

maxt∈F|Xt −Xt0| > u

∞∑j=1

aj]≤

∞∑j=1

P[

maxπj(t)∈E′j

πj−1(t)∈E′j−1

|Xπj(t) −Xπj−1(t)| > u aj]

≤∞∑j=1

∑πj(t)∈E′j

πj−1(t)∈E′j−1

P[|Xπj(t) −Xπj−1(t)| > u aj

]

≤∞∑j=1

|E ′j||E ′j−1| maxt∈F


]≤

∞∑j=1

N(b−j) N(b−(j−1)) maxt∈F


]≤

∞∑j=1

N(b−j) N(b−(j−1)) maxt∈F

P[|Xπj(t) −Xπj−1(t)| >

u ajb−j(1 + b)

d(πj(t), πj−1(t))]

≤ 2∞∑j=1

N(b−j) N(b−(j−1)) exp(− 1

2

u2(bjaj)2

(1 + b)2

).

Setting for j ≥ 1,

aj =√

2α−1(1 + b)b−j√

log(N(b−j)N(b−(j−1)) bj+1

)for some α > 0, to be chosen later, then we have that, for u ≥ α,

P[

maxt∈F|Xt −Xt0 | > u

∞∑j=1

aj]≤ 2

∞∑j=1

N(b−j) N(b−(j−1))(N(b−j)N(b−(j−1)) bj+1

)− u2α2

≤ 2∞∑j=1

b−(j+1)u2

α2 = 2b−u2

α2

∞∑j=1

b−ju2

α2

≤ 2b−u2

α2

∞∑j=1

b−j = 2b−u2

α21

b− 1.

(2.14)

Using that N(b−(j−1)) ≤ N(b−j) we obtain

Θ :=∞∑j=1

aj ≤√

2α−1(1 + b)∞∑j=1

b−j√

log(N2(b−j) bj+1

)≤√

2α−1(1 + b)∞∑j=1

b−j√

2 log(N(b−j)

)+ (j + 1) log b

≤√

2α−1(1 + b)∞∑j=1

b−j(√

2 log(N(b−j)

)+√

(j + 1) log b)

= 2α−1(1 + b)∞∑j=1

b−j√

log(N(b−j)

)+√

2α−1(1 + b)√

log b∞∑j=1

b−j√

(j + 1).

37

We treat the two terms of the above inequality separately. To bound the first term weobserve that

∞∑j=1

b−j√

log(N(b−j)

)=

b

b− 1

∞∑j=1

√log(N(b−j)

) ∫ b−j

b−(j+1)

du

≤ b

b− 1

∞∑j=1

∫ b−j

b−(j+1)

√log(N(T, d, u)

)du

≤ b

b− 1

∫ b−1

0

√log(N(T, d, u)

)du

≤ b

b− 1

∫ ∆(T )

0

√log(N(T, d, u)

)du.

In the second line we have used the fact that the function u 7→√

log(N(T, d, u)

)is

increasing.The second term is bounded as follows,

∞∑j=1

xj√

(j + 1) ≤∞∑j=1

xj(j + 1) =∞∑j=1

jxj +∞∑j=1

xj

= x∞∑j=1

xj−1 +∞∑j=1

xj = xd

dx

( ∞∑j=1

xj)

+∞∑j=1

xj

= xd

dx

( 1

1− x

)+

x

1− x=

x

(1− x)2+

x

1− x

=x

1− x2− x1− x

.

Then∞∑j=1

b−j√

(j + 1) =2b− 1

(b− 1)2.

Let denote

Θ∗ = 2α−1(1 + b)b

b− 1

∫ ∆(T )

0

√log(N(T, d, u)

)du+

√2α−1(1 + b)

√log b

2b− 1

(b− 1)2,

we have that

P[

maxt∈F|Xt −Xt0| > uΘ∗

]≤ 2

1

b− 1b−

u2

α2 . (2.15)

Using Lemma 2.6, the fact that any probability is bounded by 1 and equation (2.15) ,

E[

supt∈F|Xt −Xt0|p

]= p

∫ ∞0

P[supt∈F|Xt −Xt0 | ≥ v]vp−1dv

= Θp∗p

∫ ∞0

P[supt∈F|Xt −Xt0| ≥ uΘ∗]u

p−1du

= Θp∗p

(∫ α

0

up−1du+2

b− 1

∫ ∞α

b−u2

α2 up−1du

)

= Θp∗p

(αp

p+

2

b− 1

∫ ∞α

b−u2

α2 up−1du

)

38

It follows that, taking the supremum over all finite subsets F ⊂ T,

E[

supt∈T|Xt −Xt0 |p

] 1p

≤ Θ∗p1p

(αp

p+

2

b− 1

∫ ∞α

b−u2

α2 up−1du

) 1p

= p1pα−1(1 + b)

(αp

p+

2

b− 1

∫ ∞α

b−u2

α2 up−1du

) 1p

·

·(

2b

b− 1

∫ ∆(T )

0

√log(N(T, d, u)

)du+

√2 log b

2b− 1

(b− 1)2

).

We chose α =√

2 log b.

Step 3. Consider first p = 1. We have that

E[

supt∈T|Xt −Xt0|

]≤ K1

∫ ∆(T )

0

√log(N(T, d, u)

)du+ J1 ∆(T ),

where

K1 = α−1(1 + b)2b

b− 1

(α +

2

b− 1

∫ ∞α

b−u2

α2 du

)

J1 =√

2α−1(1 + b) log b2b− 1

(b− 1)2

(α +

2

b− 1

∫ ∞α

b−u2

α2 du

).

By Lemma 2.12 we have that∫ ∞α

b−u2

α2 du =

∫ ∞α

e−u2

2 du ≤ e−α2

2 min√π

2,

1

α

≤ 1

αe−

α2

2 =1

b√

2 log b.

It follows that

K1 ≤2b(1 + b)

b− 1+

2(b+ 1)

(b− 1)2 log b

and

J1 ≤(2b− 1)(b+ 1)

(b− 1)2+

(2b− 1)(b+ 1)

b(b− 1)3 log b.

Choosing b = 3.95 we obtain K1 ≤ 14.08 =: C1 and J1 ≤ 4.17 =: D1. This choice of thevalue of b is related to the minimum of the function K1 +J1, as shown in the figure below.

Figure 0: At the value b = 3.95 the function K1 + J1 has its minimum.

39

Now assume p ≥ 2. Recalling that α =√

2 log b, set

Kp = p1pα−1 2b(1 + b)

b− 1

(αpp

+2

b− 1

∫ ∞α

b−u2

α2 up−1du) 1p

and

Jp =√

2p1pα−1(b+ 1) log b

2b− 1

(b− 1)2

(αpp

+2

b− 1

∫ ∞α

b−u2

α2 up−1du) 1p.

To evaluate these constants we need to introduce the Gamma function Γ(x) =∫∞

0e−ttx−1dt

and the inequality Γ(x) ≤ xx−12

ex−1 , [21]. Then∫ ∞α

b−u2

α2 up−1du =

∫ ∞α

e−u2

2 up−1du ≤∫ ∞

0

e−u2

2 up−1du

=

∫ ∞0

(√

2t)p−2e−tdt = 2p2−1γ(

p

2)

≤ 2p2−1e−

p2

+1(p

2

) p2

( p2−1)

=e√2p

(2p

e2

) p4.

We obtain

Kp ≤ p1pα−1 2b(1 + b)

b− 1

(√2 log bp

p+

2

b− 1

e√2p

(2p

e2

) p4) 1p

≤ p1pα−1 2b(1 + b)

b− 1

(√2 log b

p1p

+( 2e√

2p(b− 1)

) 1p(2p

e2

) 14)

=2b(1 + b)

b− 1+

2b(1 + b)

b− 1

1√2 log b

(√2pe

b− 1

) 1p(2p

e2

) 14

=

√2eb

b− 1

√p

(2b(1 + b)

b(b− 1)

) ( 1

p14

+

√e

214

√log b

)≤√

2eb

b− 1

√p 2

34b(1 + b)

b(b− 1)

(1 +

√e

log b

)and

Jp =√


2b− 1

(b− 1)2

(αpp

+2

b− 1

∫ ∞α

b−u2

α2 up−1du) 1p

≤√


2b− 1

(b− 1)2

(√2 log b

p1p

+( 2e√

2p(b− 1)

) 1p(2p

e2

) 14)

=√

2(2b− 1)(b+ 1)

(b− 1)2log b+

(2b− 1)(b+ 1)

(b− 1)2log b

(√2pe

b− 1

) 1p(2p

e2

) 14

≤√

2eb

b− 1

√p

(2b− 1)(b+ 1)

b(b− 1)2log b

(√2 log b+

√e

214

).

In conclusion, choosing b = 2.76 we have that

β =

√2eb

b− 1= 6.028

Kp ≤ 234b(1 + b)

b(b− 1)

(1 +

√e

log b

)=: C ≈ 9.47

Jp ≤(2b− 1)(b+ 1)

b(b− 1)2log b

(√2 log b+

√e

214

)= D ≈ 5.63.

ut

40

2.8 Deviation Inequalities for Suprema of Empirical

Process

We introduce a deviation inequality for supremum of empirical processes, on which de-pends Theorem 4.10.

Definition 2.31 A function f of n variables is subadditive if there exists a sequence offunctions fk on n− 1 variables such that for all x1, . . . , xn,

n∑k=1

(f(x1, . . . , xn)− fk(x1, . . . , xk, . . . , xn)

)≤ f(x1, . . . , xn).

Definition 2.32 A function of n independent random variables X1, . . . , Xn is subaddi-tive if Z = φ(X1, . . . , Xn) is σ(X1, . . . , Xn) measurable, Zk = φk(X1, . . . , Xk, . . . , Xn) isσ(X1, . . . , Xk, . . . , Xn) measurable for all k = 1, . . . , n and

n∑k=1

(Z − Zk) ≤ Z almost surely.

We consider a sequence of independent random variables X1, . . . , Xn taking values in aPolish space X . We recall that

Definition 2.33 A Polish space is a separable completely metrizable topological space,that is, a space homeomorphic to a complete metric space that has a countable densesubset.

We introduce the following σ algebras

A := σ(X1, . . . , Xn)

Ak := σ(X1, . . . , Xk, . . . , Xn) for all k = 1, . . . , n.

Given a function f : X n → R and n functions fk : X n−1 → R Borel measurable, we definethe random variables

Z := φ(X1, . . . , Xn)

Zk := φk(X1, . . . , Xk, . . . , Xn).

We denote Ek the expectation taken conditionally to Ak. We recall the following result,proven in [1].

Theorem 2.34 Assume that Z is subadditive, and let Yk be a A measurable randomvariable such that Yk ≤ Z − Zk ≤ 1 and Ek[Yk] ≥ 0. Let σ2 be a real number such that

σ2 ≥ 1

n

n∑k=1

Ek[Y 2k ].

If, for all k = 1, . . . , n there exists some b > 0 such that Yk ≤ b almost surely, then weobtain, for all λ ≤ 0,

logE[eλ(Z−E(Z))

]≤ ψ(−λ)v,

where v = (1 + b)E[Z] + nσ2 and ψ(x) = e−x − 1 + x.

41

We now apply the above result in order to have a deviation inequality for the supremumof an empirical process indexed by a class of bounded functions.

Theorem 2.35 Let X1, . . . , Xn be independent random variables taking values in a Polishspace X and let F be a countable set of functions from X to R. Assume that all functionsf ∈ F are measurable, square-integrable and satisfy E[f(Xk)] = 0.Assume that supf∈F ess supf ≤ 1 and denote

Z = supf∈F

n∑k=1

f(Xk).

Let σ be a positive real number such that nσ2 ≥∑n

k=1 supf∈F E[f 2(Xk)]. Then, for allx ≥ 0, we have

P[Z ≥ E[Z] + x

]≤ exp

(− vh(

x

v))

where v = nσ2 + 2E[Z] and h(x) = (1 + x) log(1 + x)− x. Moreover,

P[Z ≥ E[Z] + x

]≤ exp

(− x2

2v + 23x

).

Proof. It is sufficient to prove the theorem in the case F is a finite set of functions, thegeneral case is deduced from this. Indeed, suppose F = ft | t ∈ I, where I is a nonfinite set of indexes. Without loss of generality we may suppose I = N. Observe that

Z = supf∈F

n∑i=1

f(Xk) = supt∈I

n∑i=1

ft(Xk)

= limN→∞

supt≤N

n∑i=1

ft(Xk) =: limN→∞

ZN .

Observe that ZN ≤ ZN+1 and ZN ≤ Z for all N. Then by the monotone convergencetheorem we have that limN→∞ E[ZN ] = E[Z]. Hence limN→∞ vN = v, where vN = nσ2 +2E[ZN ].It follows,

P[Z ≥ E[Z] + x] = limN→∞

P[ZN ≥ E[Z] + x]

≤ limN→∞

P[ZN ≥ E[ZN ] + x]

≤ limN→∞

exp(− vNh(

x

vN))

= exp(− vh(

x

v)),

where in the last line we have used the thesis for a finite set and the continuity of thefunction exp .

We may then assume that I is a finite set of indexes. Let

Zk = supf∈F

∑i 6=k

f(Xi) = maxf∈F

∑i 6=k

f(Xi).

Let define the random variable

tk = mint ∈ I |

∑i 6=k

ft(Xi) = Zk

42

and similarly

t0 = mint ∈ I |

n∑i=1

ft(Xi) = Z.

Observe thatt ∈ I |

∑i 6=k ft(Xi) = Zk

is not empty since I is finite.

Let defineYk = ftk(Xk).

We have thatEk[Yk] = Ek[ftk(Xk)] = 0

since, by assumption, for all f ∈ F , E[f(Xk)] = 0. Moreover,

Yk = ftk(Xk) =n∑i=1

ftk(Xi)−∑i 6=k

ftk(Xi)

≤ supf∈F

n∑i=1

ftk(Xi)− Zk = Z − Zk

=n∑i=1

ft0(Xi)− Zk

≤n∑i=1

ft0(Xi)−∑i 6=k

ft0(Xi)

= ft0(Xk) ≤ 1 almost surely.

It follows, by definition of Yk and ftk ,

n∑k=1

Ek[Y 2k ] =

n∑k=1

Ek[f 2tk

(Xk)] ≤n∑k=1

supf∈F

Ek[f 2(Xk)

]≤ nσ2.

By Theorem 2.34, with b = 1 we have that

logE[eλ(Z−E[Z])

]≤ ψ(−λ)v

where v = (2E[Z] + nσ2). This means that

E[eλ(Z−E[Z])

]≤ exp

(ψ(−λ)v

).

Then, by Markov inequality,

P[Z ≥ E[Z] + x

]= P

[eλ(Z−E[Z]) ≥ eλx

]≤ E[eλ(Z−E[Z])]e−λx

≤ exp(ψ(−λ)v

)exp(−λx).

Since ψ(−λ)v − λx = −v(1 + λ+ λxv− eλ), setting λ = log(x

v+ 1) we have

1 + λ+ λx

v− eλ = 1 + log(

x

v+ 1) + log(

x

v+ 1)

x

v− x

v− 1 = log(

x

v+ 1)(

x

v+ 1)− x

v.

ThenP[Z ≥ E[Z] + x

]≤ exp

(ψ(−λ)v

)e−λx

= exp(− v(

log(x

v+ 1)(

x

v+ 1)− x

v

))= exp

(− vh(

x

v)).

43

To see the last bound, we observe that for x > 0,

h(x) ≥ x2

2 + 23x.

Indeed, for some a to be chosen later, let

g(x) = (1 + x) log(1 + x)− x− x2

2 + ax.

Then we have that

g′(x) = log(1 + x)− 4x+ ax2

(2 + ax)2

and

g′′(x) =a3x3 + 6a2x2 + (12a− 8)x

(1 + x)(2 + ax)3.

For a ≥ 23, we have that g′′(x) > 0 which means g′ increasing, then g′(x) > g′(0) = 0.

This implies g(x) > g(0) = 0 then, choosing a = 23

we have

(1 + x) log(1 + x)− x ≥ x2

2 + 23x.

ut

Chapter 3

Recovery via `1 minimization

In this chapter we introduce the concept of sparsity and we consider `1 minimization asrecovery method for sparse vectors. In the last sections we proved some results concerningGaussian random matrices. We first recall some definitions.

Definition 3.1 A set φjj=1,...,N is called a basis for RN if the vectors in the set arelinearly independent and they span RN .

This means that for any x ∈ RN there exist (unique) coefficients c1, . . . , cN such that

x =N∑j=1

cjφj.

Note that if we denote Φ = (φ1| . . . |φN) the matrix in Mn(R) with columns given by φjand c = t(c1, . . . , cN), then we can rewrite the relation above as

x = Φc.

An important special case of a basis is the orthonormal basis. See Section 4 for moredetails.

We now generalize the concept of basis and introduce the definition of frame.

Definition 3.2 A frame is a set of vectors φjj=1,...,N in Rd, d < N corresponding to amatrix Φ ∈Md,n(R) such that for all vectors x ∈ Rd,

C1‖x‖22 ≤ ‖Φx‖2

2 ≤ C2‖x‖22,

with 0 < C1 ≤ C2 <∞.

3.1 Sparsity

Compressive sensing is based on the empirical observation that many types of real worldsignals can be well approximated as a linear combination of just a few elements from aknown basis, or, in other words, that the coefficients vector can be well approximated byone having only a small number of nonvanishing entries.

The support of a vector x ∈ CN is denoted supp(x) = j | xj 6= 0. Let define

||x||0 := |supp(x)|

and it is often called `0 norm.

45

46

Definition 3.3 Given k ∈ 1, . . . , N, a vector x is called k-sparse if

||x||0 ≤ k.

and

Σk := x ∈ CN | ||x||0 ≤ k

denotes the set of k-sparse vectors.

Only few real world signals are truly sparse, rather they are compressible, meaning thatthey can be well approximated by sparse signals.

We can quantify the compressibility by calculating the error incurred by approximatinga signal x by a k sparse one. The best k-term approximation error of a vector x ∈ CN in`p is defined as

σk(x)p = infz∈Σk||x− z||p.

Clearly if x ∈ Σk then σk(x)p = 0 for any p.

Note that in order to compress x one may simply store only the k largest entries. Whenreconstructing x from its compressed version, the nonstored entries are set to zero, andthe reconstructing error is σk(x)p.

It is emphasized at this point that the procedure of obtaining a compressed version of avecor x is adaptive and nonlinear, since it is required the search of the largest entries of xin absolute value. Indeed, sparsity is a highly nonlinear model, since the choice of whichdictionary elements are used can change from signal to signal.

Moreover, for any x, z ∈ Σk we do not necessarily have that x + y ∈ Σk, althoughx+ y ∈ Σ2k. The set of sparse signal Σk does not form a linear space.

The best k−term approximationan of x can be obtained using the rearrangemnet r(x) =(|xi1 |, ..., |xiN |), where ij denotes a permutation of the indexes such that |xij | ≥ |xij+1

|,for j = 1, .., N. Let denote rj(x) the jth component of r(x). Then

σk(x)p =( N∑j=k+1

rj(x)p) 1p, 0 < p <∞,

and the vector x[k], derived from x by setting to zero all the N − k smallest entries inabsolute value is the best k−term approximation, that is

x[k] = arg minz∈Σk||x− z||p, 0 < p ≤ ∞.

The following Lemma shows that the elements of BNq are well approximated by sparse

vectors.

Lemma 3.4 Let 0 < q < p ≤ ∞ and set r = 1q− 1

p. Then

σk(x)p ≤ k−r,

for all x ∈ BNq , and k = 1, ..., N.

47

Proof. Let T be the set of indexes of the k largest entries of x in absolute value. Usingthe notation above, we have that |rk(x)| ≤ |xj| for all j ∈ T and therefore, since x ∈ BN

q ,

krk(x)q ≤∑j∈T

|xj|q ≤ ||x||qq ≤ 1.

Hence

rk(x) ≤ k−1q .

Therefore,

σk(x)pp =∑j /∈T

|xj|p ≤∑j /∈T

|rk|p−q|xj|q

≤∑j /∈T

k−p−qq |xj|q

≤ k−p−qq ||x||qq ≤ k−

p−qq ,

which implies

σk(x)p ≤ k−r.

ut

3.2 Compressive Sensing

The above strategy of compressing a signal x by keeping only its largest coefficients isvalid when full information on x is avaliable. The aim is to obtain, in a more direct way, acompressed version of the signal by taking only a small amount of linear and nonadaptivemeasurements. Compressive sensing allows the reconstruction from vastly undersampledmeasurements.

Taking m linear measurements of a signal x ∈ CN corresponds to apply a matrix A ∈Mm,N(C), that is

y = Ax,

the matrix A is the measurement matrix, the vector y is called the measurement vector.

The matrix A represents a dimensionality reduction, that is it maps CN into Cm, where mis usually much smaller than N. We assume that measurements are nonadaptive, meaningthat the rows of A are fixed in advance and do not depend on the previously acquiredmeasurements.

The main interest is the vastly undersampled case, m N. It is impossible to recoverx from y, since the linear system is underdetermined, unless we impose the additionalassumption that x is k sparse.

The main questions are how designing the measurement matrix A to ensure that it pre-serves the information in the signal x and how we can recover the original vector x fromthe measurements vector y.

Given a measurement vector y and knowing that the original signal x is sparse, or com-pressible, a natural approach is to attempt to recover x by solving an optimization problemof the form

min ||z||0 subject to z ∈ B(y). (3.1)

48

In the case the measurement is noise free we can set B(y) = z | Az = y, while in the caseit has been contaminated by a bounded noise, we will consider B(y) = z | ‖Az−y‖2 ≤ ε.In both cases we find the sparest x that is consistent with the measurement y.Note that we are assuming that x itself is sparse. In the more common set where x = Φc,we consider the problem (3.1) where B(y) = z | AΦz = y or B(y) = z | ‖AΦz− y‖2 ≤ε. Choosing A = AΦ we obtain the same problem.

Unfortunately, an algorithm that solves this `0 minimization problem for any matrixA and any vector y is computationally intractable. Therefore, an alternative to the `0

minimization problem is the `1 minimization, which can be interpret as its convexification.

The `1 minimization approach considers the solution of

min ||z||1 subject to Az = y. (3.2)

In general, if we are interested in solving the problem

min f(x) subject to x ∈ C,

where f is a non convex function and C is a closed convex set, it may be convenient toconsider its convexification,

min f(x) subject to x ∈ C,

wheref(x) = supg(x) | g(y) ≤ f(y) ∀y, g convex .

The function f is called the convex relaxation or the convex envelop of f.

While f can have many local minimizers on C, its convex relaxation f has global mini-mizers and such global minima are likely to be in a neighborhood of a global minimum off.

As said before, we can interpret the `1 norm as the convex relaxation of the `0 norm.Indeed, since

||x||0 = |supp(x)| =N∑j=1

|sgn(xj)|

we have that, for every y, the convex envelope of ‖·‖0 restricted to B∞(0, R)∩z | Az = y,is bounded below by 1

R

∑Nj=1 |xj| =

1R||x||1. Indeed,

N∑j=1

|sgn(xj)| ≥N∑j=1

|xj|R

=1

R‖x‖1.

However it is not clear when a global minimizer of the `1 minimization coincides with asolution of the `0 minimization problem (3.1). There are intuitive reasons to expect that`1 minimization will promote sparsity. For example, suppose we are given a signal x ∈ R2,that means N = 2, m = 1. Hence the affine space of solution F = z | Az = y is a linein R2. We wish to approximate x using a point in F .If we measure the approximation error using an `p norm, then our task is to find the z ∈ Fthat minimizes ‖x− z‖p. The choice of p will have significant effects on the properties ofthe resulting approximation error, as shown in Figure 0. To compute the closest point

49

Figure 0: The figure shows the best approximation of a point in R2 by a one dimensionalsubspace using the `p norms.

in F to x using an `p norm, we can imagine growing an `p sphere centered on x until itintersects with F . This will be the point z ∈ F .We observe that smaller p leads to an error that is more unevently distributed and tendsto be sparse.

If we do not consider the situation where N = kerA is parallel to one of the faces of the`1 ball B1, then we find a unique minimizer for the `1 problem (3.2) which also coincideswith the minimizer of the `0 problem.

In higher dimension these reasonings become more involved. The so called Null SpaceProperty is introduced in order to guarantee that , in high dimension, the solution of the`1 minimization (3.2) coincides with the solution of the `0 minimization problem (3.1).

3.3 The Null Space Property

Definition 3.5 A matrix A ∈ Mm,N(C) satisfies the null space property (NSP) of orderk with constant γ ∈ (0, 1) if

||ηT ||1 ≤ γ||ηTC ||1,

for all η ∈ kerA and for all sets T ⊂ 1, ..., N, |T | ≤ k.

Lemma 3.6 The null space property is equivalent to the condition that for any T ⊂1, . . . , N, |T | ≤ k, η ∈ kerA,

‖ηT‖1 < ‖ηTC‖1. (3.3)

Proof. Clearly the NSP implies condition (3.3) since 0 < γ < 1.Moreover, we observe that, given T ⊂ 1, ..., N, |T | ≤ k, the null space property isequivalent to the condition

‖PT‖1→1 ≤γ

1 + γ,

where PT : kerA→ Ck is the linear operator defined by

PTη = ηT .

50

Indeed, condition‖ηT‖1 ≤ γ‖ηTC‖1 = γ(‖η‖1 − ‖ηT‖1)

is equivalent to(1 + γ)‖ηT‖1 ≤ γ‖η‖1.

That is‖PTη‖1 ≤

γ

1 + γ‖η‖1

and hence, taking the supremum over ‖x‖1 = 1 and using the fact that 0 < γ < 1,

‖PT‖1→1 ≤γ

1 + γ<

1

2.

Moreover, suppose that for every T ⊂ 1, ..., N, |T | ≤ k, there exist γT such that

‖PT‖1→1 ≤γT

1 + γT. (3.4)

Let γ be the supremum of γT satisfying (3.4). Since there exists only a finite number ofγT then γ < 1, which means the null space property holds with γ = γ. We conclude theproof observing that condition (3.3) to ‖ηT‖1 <

12.

ut

Theorem 3.7 Let A ∈ Mm,N(C) be a matrix that satisfies the null space property oforder k with constant γ ∈ (0, 1). Let x ∈ CN , y = Ax and let x∗ be a solution of the `1

minimization problem (3.2). Then

||x− x∗||1 ≤2(1 + γ)

1− γσk(x)1.

In particular, if x is k sparse then x = x∗.

Proof. Let η = x − x∗. Then η ∈ kerA and ||x∗||1 ≤ ||x||1 since x∗ is a solution of the`1 minimization problem (3.2). Let T be the set of the k largest entries of x in absolutevalue. It follows from the triangular inequality that

‖xT‖1 − ‖ηT‖1 + ‖ηTC‖1 − ‖xTC‖1 ≤ ‖xT − ηT‖1 + ‖ηTC − xTC‖1

= ‖x∗T‖1 + ‖x∗TC‖1

≤ ‖xT‖1 + ‖xTC‖1.

Then||ηTC ||1 ≤ 2||xTC ||1 + ||ηT ||1

≤ 2σk(x)1 + γ||ηTC ||1where the last inequality follows from the NSP and the fact that ||xTC ||1 = σk(x)1. Indeed,

σk(x)1 =N∑

j=k+1

rj(x) =∑j∈TC|xj| = ||xTC ||1.

It follows that

||ηTC ||1 ≤2

1− γσk(x)1. (3.5)

51

Finally

||x− x∗||1 = ||η||1 = ||ηT ||1 + ||ηTC ||1≤ (1 + γ)||ηTC ||1 NSP

≤ (1 + γ)2

1− γσk(x)1.

ut

We have then proved that given a matrix A ∈Mm,N(C) satisfying the null space propertyof order k with constant γ ∈ (0, 1), and x ∈ CN such that y = Ax, if x is k sparse thereexists a unique solution of the `1 minimization problem (3.2).

Conversely, the following theorem holds.

Theorem 3.8 If any k sparse vector x ∈ CN is the unique solution of the `1 minimizationproblem (3.2) with y = Ax, then the matrix A satisfies the null spase property of order kwith some constant γ ∈ (0, 1).

Proof. Since every k sparse vector x ∈ CN is the unique minimizer of ||z||1 subject toAz = Ax, then, for any η ∈ kerA and for any T ⊂ 1, ..., N, such that |T | = k, thek−sparse vector ηT is the unique minimizer of ||z||1 subject to Az = AηT .

Observe that

0 = Aη = AηT + AηTC .

Therefore we must have

||ηT ||1 < ||ηTC ||1,

since ηT is the unique minimizer of ||z||1 subject to Az = AηT . That is, the null spaceproperty holds,

||ηT ||1 ≤ γ||ηTC ||1, γ ∈ (0, 1).

ut

Therefore, the null space property is equivalent to sparse `1 recovery.

3.4 The Restricted Isometry Property

The null space property is difficult to show directly, so we introduce the so called RestrictedIsometry Property, which is easier to prove.

Definition 3.9 The restricted isometry constant δk of a matrix A ∈ Mm,N(C) is thesmallest number such that

(1− δk)||z||22 ≤ ||Az||22 ≤ (1 + δk)||z||22 (3.6)

for all z ∈ Σk.

A matrix A is said to satisfy the restricted isometry property (RIP) of order k with con-stant δk if δk ∈ (0, 1).

52

Equivalentely δk can be defined as

δk = maxT⊂1,...,N,|T |≤k

||A∗TAT − I||2→2. (3.7)

Indeed, by definition we have that∣∣||Az||22 − ||z||22∣∣ ≤ δk||z||22

and moreover ∣∣||Az||22 − ||z||22∣∣ = |〈(A∗A− I)z, z〉|.

Taking the supremum over z such that supp(z) ∈ T, ||z||2 = 1 we have that

||A∗TAT − I||2→2 ≤ δk,

since A∗A− I is hermitian. Taking the maximum over T ⊂ 1, ..., N, |T | ≤ k,

maxT⊂1,...,N,|T |≤k

||A∗TAT − I||2→2 ≤ δk.

Since δk is, by definition, the smallest number for which (3.6) holds, we have proved that

maxT⊂1,...,N,|T |≤k

||A∗TAT − I||2→2 = δk.

Observe that, since a k sparse vector is a k + 1 sparse vector, it follows that

δ1 ≤ δ2 ≤ δ3 ≤ . . .

Moreover, using the characterization (3.7), in the case k = 1, we have that δ1 = 0.

The following lemma shows that the restricted isometry property implies the null spaceproperty.

Lemma 3.10 Let A ∈ Mm,N(C) be a matrix satisfying the restricted isometry propertyof order K = k + h with constant δK ∈ (0, 1). Then A satisfies the null space property of

order k with constant γ =√

kh

1+δK1−δK

.

Proof. Let η ∈ kerA and T ⊂ 1, ..., N, |T | ≤ k. Define T0 = T and let T1, ..., Tsdisjoint sets of indexes such that |Tj| = h, for h ≤ s − 1 and |Ts| ≤ h, associated to arearrangement of the entries of η,

|ηj| ≤ |ηi| for all j ∈ Tl, i ∈ Tl′ , l ≥ l′ ≥ 1. (3.8)

Note that

0 = Aη = Aη∪sj=0Tj= AηT0∪T1 +

s∑j=2

AηTj .

53

Then||ηT ||1 ≤

√k||ηT ||2 by Cauchy-Schwarz inequality

≤√k||ηT0∪T1||2

≤√

k

1− δK||AηT0∪T1||2 RIP

=

√k

1− δK||Aη∪sj=2Tj

||2

≤√

k

1− δK

s∑j=2

||AηTj ||2

≤√k

1 + δK1− δK

s∑j=2

||ηTj ||2 RIP.

(3.9)

It follows from (3.8) that

|ηi| ≤ |ηl| for all i ∈ Tj+1, l ∈ Tj.

Taking the sum over l ∈ Tj,|ηi| |Tj| ≤ ||ηTj ||1

and hence|ηi| ≤ h−1||ηTj ||1.

Taking the `2 norm over i ∈ Tj+1 yields∑i∈Tj+1

|ηi|2 ≤∑i∈Tj+1

h−2||ηTj ||21 = h−2||ηTj ||21|Tj+1| ≤ h−1||ηTj ||21

since |Tj+1| = h. We have just prove that

||ηTj+1||2 ≤ h−

12 ||ηTj ||1. (3.10)

Using the latter estimate in (3.9) we have that

||ηT ||1 ≤√k

1 + δK1− δK

s∑j=2

||ηTj ||2 =

√k

1 + δK1− δK

s−1∑j=1

||ηTj+1||2

≤√k

1 + δK1− δK

s−1∑j=1

h−12 ||ηTj ||1 using (3.10)

≤√k

1 + δK1− δK

h−12 ||ηTC ||1

=

√k

h

1 + δK1− δK

||ηTC ||1.

(3.11)

ut

Observe that the previous lemma does not guarantees that γ < 1. However, taking h = 2k,the above lemma shows that if δ3k <

13

then γ < 1, in fact

γ =

√k

2k

1 + δ3k

1− δ3k

<

√1

2

1 + 1/3

1− 1/3= 1.

As a corollary of Theorem 3.7 we have the following result.

54

Corollary 3.11 Let A ∈Mm,N(C) be a matrix satisfying the restricted isometry propertyof order 3k and constant δ3k <

13

then every k sparse vector x ∈ CN can be recovered by`1 minimization.

Next theorem shows that the restricted isometry property implies also a bound on thereconstruction error in `2.

Theorem 3.12 Let A ∈Mm,N(C) be a matrix satisfying the restricted isometry propertyof order 3k with δ3k < 1

3. For x ∈ CN , let y = Ax and x∗ be a solution of the `1

minimization problem (3.2). Then

||x− x∗||2 ≤ Cσk(x)1√

k,

with C = (2γ + 1) 21−γ and γ =

√1+δ3k

2(1−δ3k).

Proof. As in the proof of the lemma, let η = x∗ − x ∈ kerA, T0 = T the set of the klargest entries of η in absolute value, and Tj sets of size at most 2k, corresponding to thenonincreasing rearrangement of η.As in equations (3.9) and (3.11), choosing h = 2k,

||ηT ||2 ≤√

1 + δ3k

1− δ3k

s∑j=2

||ηTj ||2

≤√

1 + δ3k

1− δ3k

(2k)−12 ||ηTC ||1

= γk−12 ||ηTC ||1

(3.12)

where γ =√

1+δ3k2(1−δ3k)

.

From the assumption δ3k <13

it follows that γ < 1. Then

||ηTC ||2 ≤ σk(η)2 ≤ k−12 ||η||1 by Lemma 3.4

≤ k−12 (γ + 1)||ηTC ||1 RIP

(3.13)

Then, by the triangular inequality,

||x− x∗||2 = ||η||2 ≤ ||ηT ||2 + ||ηTC ||2≤ γk−

12 ||ηTC ||1 + ||ηTC ||2 by (3.12)

≤ γk−12 ||ηTC ||1 + (γ + 1)k−

12 ||ηTC ||1 by (3.13)

= k−12 (2γ + 1)||ηTC ||1

≤ k−12 (2γ + 1)

2

1− γσk(x)1 by (3.5)

since it holds ||ηTC ||1 ≤ ||η(suppx[2k])C ||1 ≤ ||η(suppx[k])

C ||1.ut

Suppose the measurements are contaminated with some bounded noise, that is we considerthe problem

Ax = y0 + ε with ‖ε‖2 ≤ η.

The following theorem is the best known result [18] concerning recovery using a noise.

55

Theorem 3.13 Assume that the restricted isometry constant δ2k of the matrix A ∈Mm,N(C) satisfies

δ2k <2

3 +√

74

≈ 0.4627.

Then the following holds for all x ∈ CN . Let noisy measurements

y = Ax+ e

be given with ||e||2 ≤ η. Let x∗ be a solution of

min ||z||1 subject to ||Az − y||2 ≤ η.

Then,

||x− x∗||2 ≤ C1η + C2σk(x)1√

k

for some constants C1, C2 > 0 that depend only on δ2k.

3.5 Recovery of Individual Vectors

Let introduce the sign vector sgn(x) ∈ CN whose entries are

sgn(x)j :=

xj|xj | if xj 6= 0

0 if xj = 0j = 1, . . . , N.

The following theorem makes clear that the success of sparse recovery by `1 minimizationonly depends on the support set S and on the sign pattern of the nonzero coefficients ofx.

Theorem 3.14 Let A ∈Mm,N(C) and let x ∈ CN with S = supp(x). Assume that AS isinjective and that there exists a vector h ∈ Cm such that

A∗Sh = sgn(xS)

|(A∗h)l| < 1, l ∈ 1, ..., N \ S.Then x is the unique solution of the `1 minimization problem (3.2) with y = Ax.

Proof. We have that

||x||1 =N∑j=1

|xj| =N∑j=1

xj|xj|

xj =∑j∈S

sgn(xj)xj =∑j∈S

(A∗h)jxj = 〈A∗h, x〉 = 〈h,Ax〉.

Thus, for z ∈ CN , z 6= x, such that Az = y, we have that

||x||1 = 〈h,Ax〉 = 〈h,Az〉 = 〈A∗h, zS〉+ 〈A∗h, zSC 〉≤ ||(A∗h)S||∞||zS||1 + ||(A∗h)SC ||∞||zSC ||1 < ||zS||1 + ||zSC ||1 = ||z||1,

using the hypothesis. Note that the strict inequality follows from ||zSC ||1 > 0 because ofthe hypothesis of injectivity of AS. Indeed, if zSC = 0, we have y = Az = ASz. On theother hand, y = Ax = ASx. Then we have that ASz = ASx, which is a contradiction.Hence the vector x is the unique solution of the `1 minimization problem (3.2). ut

56

Choosing the vector h = (A†S)∗sgn(xS) leads to the following corollary.

Corollary 3.15 Let A ∈ Mm,N(C) and let x ∈ CN with S = supp(x). If the matrix ASis injective and if

|〈A†Sal, sgn(xS)〉| < 1 l ∈ 1, ..., N \ Sthen the vector x is the unique solution to the `1 minimization problem (3.2) with y = Ax.

Proof. The vector h = (A†S)∗sgn(xS) satisfies the condition A∗Sh = sgn(xS), indeed

A∗Sh = A∗S(A†S)∗sgn(xS) = A∗S((A∗SAS)−1A∗S)∗sgn(xS) = (A∗SAS)(A∗SAS)−1sgn(xS) = sgn(xS).

Hence, the statement follows form Theorem 3.14, since

〈sgn(xS), A†Sal〉 = a∗l (A†S)∗sgn(xS) = (A∗h)l.

ut

3.6 Coherence

The coherence is a way of analyzing the recovery abilities of a measurement matrix. Asmall coherence is desired in order to have good sparse recovery properties.

Definition 3.16 The coherence of a matrix A is the largest absolute inner product be-tween two any columns aj, ak of A :

µ := max1≤j<k≤N

|〈aj, ak〉|‖aj‖2‖ak‖2

.

If a matrix A = (a1|...|aN) ∈ Mm,N(C) has normalized columns ||al||2 = 1, then thecoherence is defined as

µ = maxl 6=k|〈al, ak〉|.

An example of matrix with small coherence is the concatenation of A = (I|F ) ∈Mm,2m(C)of the identity matrix and the unitary Fourier matrix F ∈ Mm(C) with entries Fjk =

1√me

2πijkm . In this case, µ = 1√

m, since F is unitary. Then δk ≤ k−1√

m, in particular

δk ≤ C k√m, for some C.

Proposition 3.17 In the case the matrix A has normalized columns, it holds that

δk ≤ (k − 1)µ.

Proof. Note that the matrix A∗TAT − I has zeros on the diagonal. Indeed, since thecolumns of A are orthonormal,

(A∗TAT )ij =m∑k=1

akiakj =m∑k=1

|aki|2 = ‖ai‖22 = 1.

Since ||A||1→1 = maxk∈1,...,N∑m

j=1 |ajk|, then

||A∗TAT − I||1→1 = maxj∈T

∑k∈Tk 6=j

|〈aj, ak〉|.

57

Taking the maximum over T ⊂ 1, ..., N, |T | ≤ k, we have

maxT⊂1,...,N|T |≤k

||A∗TAT − I||1→1 = maxT⊂1,...,N|T |≤k

maxj∈T

∑l∈Tl 6=j

|〈aj, al〉|

= maxj∈1,...,N

maxT⊂1,...,N−j

|T |≤k−1

∑l∈T

|〈aj, al〉|

≤ maxj∈1,...,N

maxT⊂1,...,N−j

|T |≤k−1

∑l∈T

µ

= µ (k − 1).

Furthermore, using (3.7) and (2.2)

δk = maxT⊂1,...,N|T |≤k

||A∗TAT − I||2→2 ≤ maxT⊂1,...,N|T |≤k

||A∗TAT − I||1→1.

Then we have just proved thatδk ≤ (k − 1)µ.

ut

As a consequence of Proposition 3.17 and Corollary 3.11 it holds the following corollary.

Corollary 3.18 Let A ∈Mm,N(C) be a matrix satisfying the restricted isometry propertyof order 3k and constant δ3k ≤ (3k− 1)µ < 1

3, then all k sparse vectors x can be recovered

from y = Ax via `1 minimization.

In the case the coherence µ = 1√m

we have that all k sparse vectors x can be recoveredfrom y = Ax via `1 minimization, provided

m > C ′k2. (3.14)

The above result can be improved but we cannot use the coherence because of the followinggeneral lower bound, known as Welch bound, [27]

µ ≥

√N −mm(N − 1)

∼ 1√m

(N sufficently large).

In order to improve (3.14), it is necessary to take into account also the cancellations inthe Gramian A∗TAT − I, which is quite difficult for deterministic matrices. It is indeedeasier to deal with cancellations in the Gramian using probabilistic techniques. For thatreason, to overcome the ”quadratic bottleneck” (3.14), random matrices are introduced.

3.7 RIP for Gaussian and Bernoulli Random

Matrices

As said before, we want to construct matrices that satisfy the properties above. It ispossible to deterministically construct matrices of size m × N that satisfy the restricted

58

isometry property of order k, but such constructions require m to be relatively large.These limitations can be overcome by randomizing the matrix construction, for examplechoosing the entries according to a Gaussian or Bernoulli distribution.

Gaussian, Bernoulli or in general subgaussian random matrices allow to obtain optimalestimate for the restricted isometry constants in terms of the number m of measurementmatrices. Let recall some definitions.

Definition 3.19 A Gaussian random matrix is a matrix whose entries are chosen asi.i.d. Gaussian random variables with expectation 0 and variance 1

m.

Definition 3.20 A Bernoulli random matrix is a matrix whose entries are independentrealizations of ± 1

mBernoulli random variables, that is, each entries takes the value 1

mor

− 1m

with probability 12.

We recall that a random variable X is called subgaussian if there exist constants α, β > 0,such that, for all t > 0,

P[|X| > t] ≤ βe−αt2

.

It can be shown that if X has means zero, the above condition is equivalent to

E[eλX ] ≤ ecλ2

(3.15)

for all λ ∈ R and for some constant c.

Definition 3.21 A subgaussian random matrix is a matrix whose entries are independentmean-zero subgaussian random variables with the same constant c in (3.15).

In general, let X be a real random variable. Then we can define a random matrix

A = A(ω) ∈Mm,N(R), ω ∈ Ω,

as the matrix whose entries are independent realizations of X, in the probability space(Ω,Σ,P). We also assume that for any x ∈ RN

E[||Ax||22] = ||x||22. (3.16)

Observe that equation (3.16) is equivalent to the condition

E[X] = 0, V[X] = E[X2] =1

m.

Indeed,

E[‖Ax‖22] = E

[ m∑k=1

|N∑j=1

ajkxj|2]

= E[ m∑k=1

(N∑j=1

ajkxj)(N∑i=1

aikxi)]

=m∑k=1

N∑i,j=1

xjxiE[ajkaik]

=m∑k=1

N∑i,j=1i 6=j

xjxiE[ajk]E[aik] +m∑k=1

N∑i=1

|xi|2E[a2ik].

59

By equation (3.16) we must have

m∑k=1

N∑i,j=1i 6=j

xjxiE[ajk]E[aik] +m∑k=1

N∑i=1

|xi|2E[a2ik] =

N∑l=1

|xl|2.

Since the entries of A are independent realizations of the random variable X, this impliesthat

mN∑

i,j=1i 6=j

xjxiE[X]2 +mN∑i=1

|xi|2E[X2] =N∑l=1

|xl|2,

which is equivalent to E[X] = 0 and E[X2] = 1m.

Let consider the concentration inequality of the form

P(∣∣||Ax||22 − ||x||22∣∣ ≥ δ||x||22) ≤ 2 e−c0δ

2m, 0 < δ < 1, (3.17)

where c0 > 0 is some constant.

Both Gaussian and Bernoulli matrices satisfy the concentration inequality (3.17). Fromnow on we will consider Gaussian random matrices, but the results below hold for sub-gaussian random variables.

Lemma 3.22 Let X ∼ N (0, σ2) then, for any λ < 1,

E[exp(λX2

2σ2

)] =

1√1− λ

.

Proof. We have that

E[exp(λX2

2σ2

)] =

∫ ∞−∞

eλx2

2σ21√2πe−

x2

2σ2 dx

=

∫ ∞−∞

e−x2

2σ2(1−λ) 1√

2πdx

=1√

1− λ

∫ ∞−∞

1√2πe−

y2

2σ2 dy

=1√

1− λ.

ut

Theorem 3.23 Let X = (X1, . . . , Xm) be a random vector where Xi are i.i.d. withXi ∼ N (0, σ2). Then

E[‖X‖22] = mσ2.

Moreover, given βmax > 1, for any α ∈ (0, 1) and for any β ∈ [1, βmax] there exists aconstant τ = τβmax ≥ 4 such that

P[‖X‖22 ≤ αmσ2] ≤ exp

(− m(1− α)2

τ

)(3.18)

P[‖X‖22 ≥ βmσ2] ≤ exp

(− m(1− β)2

τ

). (3.19)

60

Proof. Since the Xi are independent we obtain

E[‖X‖22] =

m∑i=1

E[X2i ] = σ2m.

We noe prove equation (3.19). By Markov’s inequality

P[‖X‖22 ≥ βmσ2] = P[eλ‖X‖

22 ≥ eλβmσ

2

]

≤ E[eλ‖X‖22 ]e−λβmσ

2

≤m∏i=1

E[eλX2i ]e−λβmσ

2

.

Since Xi ∼ N (0, σ2) we have from Lemma 3.22 that, if 2σ2λ < 1,

E[eλX2i ] = E[e

2σ2λX2i

2σ2 ] =1√

1− 2σ2λ.

Thus,

P[‖X‖22 ≥ βmσ2] ≤

m∏i=1

E[eλX2i ]e−λβmσ

2

≤( 1√

1− 2λσ2

)me−λβmσ

2

=( e−2λβσ2

1− 2λσ2

)m2.

By setting the derivate, with respect to λ, to zero and solving for λ we obtain that theoptimal λ is

λ =β − 1

2βσ2.

Since λ > 0, this implies β > 1. We then obtain that

P[‖X‖22 ≥ βmσ2] ≤

(βe−(β−1)

)m2 .

Similarly,

P[‖X‖22 ≤ αmσ2] ≤

(αe−(α−1)

)m2 . (3.20)

Indeed, by Marcov inequality

P[‖X‖22 ≤ αmσ2] = P[e−‖X‖

22 ≤ e−αmσ

2

]

≤ E[e−‖X‖22 ]eαmσ

2

=m∏i=1

E[e−λX2i ]eαmσ

2

.

It follows that P[‖X‖22 ≤ αmσ2] ≤

(e2λασ

2

1+2λσ2

)m2. As before, setting the derivate to zero and

solving for λ we obtain that λ = 1−α2σ2α

. Then

P[‖X‖22 ≤ αmσ2] ≤

(αe1−α)m2 .

61

Let define

τ = 2(βmax − 1)2

(βmax − 1)− log(βmax).

It holds that τ ≥ 4. Indeed, let consider the function f(x) = x2

2− x + log(x + 1). For

x > −1, the function f is increasing, in particular, since f(0) = 0, it holds that f(x) =x2

2− x+ log(x+ 1) > 0 for x > 0. It follows that

x2

x− log(x+ 1)≥ 2 for any x > 0.

Setting x = t− 1 we obtain

(t− 1)2

t− 1− log(t)≥ 2 for any t > 1.

Moreover, the function g(t) = 2(t− 1)2 − τ(t− 1) + τ log(t) is decreasing in the interval[1, τ

4] and increasing outside. Let t∗ > τ

4be the point such that g(t∗) = 0. It follows that

for any t ∈ (0, t∗], g(t) ≤ 0. Hence, setting βmax = t∗, for any γ ∈ [0, βmax] we have thebound

log(γ) ≤ (γ − 1)− 2(γ − 1)2

τ

and hence

γ ≤ exp(

(γ − 1)− 2(γ − 1)2

τ

).

Setting γ = α and using (3.20) we obtain

P[‖X‖22 ≤ αmσ2] ≤ exp

(− m(1− α)2

τ

)and similarly setting γ = β,

P[‖X‖22 ≥ βmσ2] ≤ exp

(− m(1− β)2

τ

).

ut

We now prove that Gaussian matrices satisfy the concentration inequality (3.17).

Theorem 3.24 Let A ∈ Mm,N(R) be a Gaussian matrix. Then for any δ > 0 and forany x ∈ RN ,

E[‖Ax‖22] = ‖x‖2

2

and

P[|‖Ax‖22 − ‖x‖2

2| ≥ δ‖x‖22] ≤ 2 exp

(− mδ2

τ

),

where τ = 21−log(2)

.

62

Proof. Let Ai denote the i th row of A. Observe that Yi = 〈Ai, x〉 is a Gaussian randomvariable. Indeed, since the rows Ai are independent realization of a random vector X =

(X1, . . . , XN) ∈ RN such that Xi ∼ N (0, 1m

), then Y = 〈X, x〉 ∼ N (0,‖x‖22m

). Indeed, since

Y =∑N

i=1 xiXi, and X1, . . . , XN are independent, then Y is a Gaussian random variable

with expectation E[Y ] =∑N

i=1 xiE[Xi] = 0 and variance V [Y ] =∑N

i=1 x2iV [Xi] = 1

m‖x‖2

2.

Then Y ∼ N (0,‖x‖22m

). Hence, we may apply Theorem 3.23 to Ax = Y = (Y1, . . . , Ym)with α = 1− δ and β = 1 + δ, and we obtain

P[‖Ax‖22 ≤ (1− δ)‖x‖2

2] ≤ exp(− mδ

τ

)and hence

P[‖Ax‖22 − ‖x‖2

2 ≤ −δ‖x‖22] ≤ exp

(− mδ

τ

).

Similarly we obtain

P[‖Ax‖22 − ‖x‖2

2 ≥ δ‖x‖22] ≤ exp

(− mδ

τ

),

which concludes the proof. The value of the constant τ follows from the observation thatβ = 1 + δ ∈ [1, 2], so we can set βmax = 2. Setting t = βmax−1 = 1, we have that function

t2

t−log(t+1)is increasing and hence τ ≥ 2

1−log(2). ut

Based on the concentration inequality (3.17) the following estimate on the restrictedisometry constant holds.

Theorem 3.25 Let A ∈ Mm,N(R) be a random matrix satisfying the concentration in-equality (3.17). Then there exists a constant C depending only on c0 such that therestricted isometry constant of A satisfies δk ≤ δ with probability 1− ε provided

m ≥ Cδ−2(k log(

N

k) + log(ε−1)

). (3.21)

Proof. First we observe that is enough to prove the RIP in the case ‖x‖2 = 1. Next, letT ⊂ 1, . . . , N, |T | = k and and define ΣT = x ∈ RN | xTC = 0.Fix 0 < η < 1 and let NT,η be the minimal number of closed balls Bη of radius ηneeded to cover ΣT and xi ∈ ΣT the centers of the balls, i = 1, . . . , NT,η. Given i =1, . . . , NT,η, by equation (3.17), replacing δ with δ√

2we have that with probability at

least 1− 2 exp(−c0mδ2

2) it holds that

(1− δ√2

)‖xi‖22 ≤ ‖Axi‖2

2 ≤ (1 +δ√2

)‖xi‖22. (3.22)

Then equation (3.22) holds for every i with probability at least 1−NT,η2 exp(−c0mδ2

2).

We now define D as the smallest number such that

‖Ax‖2 ≤√

1 +D‖x‖2 for all x ∈ ΣT , ‖x‖2 ≤ 1. (3.23)

Our goal is to prove that D ≤ δ. It holds that, for every x such that ‖x− xi‖ ≤ η, usingequation (3.22)

‖Ax‖2 ≤ ‖Axi‖2 +‖A(x−xi)‖2 ≤

√1 +

δ√2

+√

1 +D‖x−xi‖2 =

√1 +

δ√2

+√

1 +D η.

63

Since, by definition, D is the smallest number for which (3.23) holds, we obtain that√

1 +D ≤√

1 + δ√2

+√

1 +D η and therefore, choosing η = δ14,

√1 +D ≤

√1 + δ√

2

1− δ14

≤√

1 + δ.

Hence, since D ≤ δ,‖Ax‖2 ≤

√1 + δ.

The lower bound of the RIP follows from this, since, using equation (3.23)

‖Ax‖2 ≥ ‖Axi‖2 − ‖A(x− xi)‖2 ≥

√1− δ√

2−√

1 + δ‖x− xi‖2.

Having choose ‖x− xi‖2 ≤ η < 1 it holds

‖Ax‖2 ≥

√1− δ√

2−√

1− δ η ≥√

1− δ(1− η) ≥√

1− δ.

Therefore‖Ax‖2 ≥

√1− δ.

By Proposition 4.19,

NT,η ≤(

1 +2

η

)k≤(3

η

)k=(42

δ

)k.

We consider the above process for all possible index set T. Since there are(Nk

)≤(eNk

)kpossible choices for T, the covering number of the set of k sparse vector Σk = ∪TΣT is

Nη ≤(42eN

δk

)k.

Then we use the union bound to obtain that, with probability at least 1−2e−c0mδ2

2

(42eNδk

)k,

the RIP holds for every k sparse vector.

This implies that with probability at least 1− ε the RIP holds for every k sparse vectorprovided

ε ≥ 2e−c0mδ2

2

(42eN

δk

)kwhich is equivalent to

m ≥ Cδ−2(k log(

N

k) + log(ε−1)

).

ut

Note that the concentration inequality is invariant under unitary transformations. Indeed,suppose z ∈ CN is sparse with respect to an orthonormal basis, which is not the canonicalbasis. Then z = Ux, for a sparse x ∈ CN and for a unitary matrix U ∈MN(C). Applyingthe measurement matrix A yields

Az = AUx,

64

that means that is equivalent to work with the new measurement matrix A′ = AU andthe sparsity with respect to the canonical basis. Indeed, the matrix A′ satisfies again theconcentration inequality (3.17) once A does. Indeed, chosing x = U−1x′,

P[∣∣||AUx||22 − ||x||22∣∣ ≥ δ||x||2lN2 ] = P[

∣∣||Ax′||22 − ||U−1x′||22∣∣ ≥ δ||U−1x′||2lN2 ]

= P[∣∣||Ax′||22 − ||x′||22∣∣ ≥ δ||x′||2lN2 ] since U is unitary

≤ 2e−c0δ−2m.

Theorem 3.25 also applies to A′ = AU. This fact is sometimes called the universality ofthe Gaussian or Bernoulli random matrices. It does not matter in which basis the signalx is actually sparse.

We observe that, in Theorem 3.25, choosing ε = e−cm with c = 12C′, C ′ = Cδ−2 we obtain

that recovery by `1 minimization is successful with probability 1− e−cm provided

m ≥ Ck log(N

k).

Indeed,

m ≥ C ′(k log(

N

k) + log(ecm)

)= C ′

(k log(

N

k) +

m

2C ′)≥ C ′k log(

N

k) +

m

2,

which means

m ≥ 2C ′k log(N

k).

This leads to the following theorem.

Theorem 3.26 Let A ∈ Mm,N(R) be a random matrix satisfying the concentration in-

equality (3.17) and let δ3k <13. Then with probability at least 1 − exp(−mδ2

2C), every k

sparse vector x can be recovered by `1 minimization provided

m ≥ 2Cδ−2k log(N

k). (3.24)

The proof of the above theorem is a direct consequence of Theorem 3.25 and Corollary3.11.

Moreover, we observe that, up to the log factor the theorem provides a linear scaling ofthe number m of measurements with respect to the sparsity k. As shown in the sectionbelow, condition (3.24) cannot be further improved; in particular the log factor cannotbe removed.

3.8 Compressive Sensing and Gelfand widths

In this section we want to investigate how well any measurements matrix and any recon-struction method may perform. This leads to the study of the Gelfand widths. We willsee that Gaussian random matrices in connection with `1 minimiazion provide optimalperformance.

We will treated only the real valued case. The complex-valued case is deduced from thereal-valued case by identifying CN with R2N and by corresponding norm equivalences of`p norms.

65

We will also call the measurement matrix A ∈Mm,N(R) encoder.

The set Am,N denotes all possible encoder/decoder pairs (A,∆), where A ∈Mm,N(R) and∆ : Rm → RN is any (nonlinear) function, which represents our specific recovery method.The `1 minimization is an example where A is the measurement matrix and ∆ : Rm → RN

is defined as ∆(y) = x such x that has minimal `1 norm and Ax = y.

For 1 ≤ k ≤ N, the reconstruction errors over subsets K ⊂ RN , where RN is endowedwith a norm || · ||X , are defined as

σk(K)X := supx∈K

σk(x)X ,

Em(K,X) := inf(A,∆)∈Am,N

supx∈K||x−∆(Ax)||X .

That means, Em(K,X) is the worst reconstruction error for the best pair of encoder/decoder.The goal is to find the largest k such that

Em(K,X) ≤ C0σk(K)X .

Note that of particular interest are the unit balls, K = BNp for 0 < p ≤ 1 and X = `N2 ,

since by Lemma 3.4 the elements of BNp are well approximated by sparse vectors.

Definition 3.27 Let K be a compact set in a normed space X. The Gelfand width of Kof order m is

dm(K,X) := infY⊂X

codimY≤m

sup||x||X | x ∈ K ∩ Y ,

where the infimum is over all linear subspaces Y of X of codimension less or equal to m.

The following relationship between Em(K,X) and the Gelfand widths holds.

Proposition 3.28 Let K ⊂ RN be a closed compact set such that K = −K and K+K ⊂C0K for some constant C0. Let X = (RN , || · ||X) be a normed space. Then

dm(K,X) ≤ Em(K,X) ≤ C0dm(K,X).

We will prove the above theorem in the case K = BNp , for 0 < p ≤ 1 and X = (RN , ‖ · ‖q).

Let denote

dmp := dm(BNp , X).

Proof. Observe that given A ∈Mm,N(R), the subspace Y = kerA has codimension lessor equal to m. Coversely, to any subspace Y ∈ RN of codimension less or equal to m, canbe associated a matrix A ∈Mm,N(R), whose rows form a basis for Y ⊥.

This identification yields

dmp = infA∈Rm×N

sup||η||q | η ∈ kerA ∩BNp . (3.25)

Let (A,∆) ∈ Am,N be an encoder/decoder pair and let z = ∆(0). Denote Y = kerA.

66

By equation (3.25)

dm(K,X) = infA∈Rm×N

sup||η||q | η ∈ Y ∩BNp

≤ sup||η||q | η ∈ Y ∩BNp ≤ sup||η − z||q | η ∈ Y ∩BN

p = sup||η −∆(Aη)||q | η ∈ Y ∩BN

p since z = ∆(0) = ∆(Aη)

≤ sup||x−∆(Ax)||q | x ∈ BNp since Y ∩BN

p ⊂ BNp .

In the second line, if ||η||q ≤ ||η−z||q there is nothing to prove. Else, if ||η||q ≤ ||−η−z||qthen

||η||q ≤ || − η − z||q = ||η′ − z||q,where η′ ∈ BN

p ∩Y, since K = −K and if η ∈ Y, then −η ∈ Y, since A(−η) = −A(η) = 0.Moreover, either ||η− z||q ≥ ||η||q or || − η− z||q ≥ ||η||q. Indeed, if both inequalities werefalse,

||2η||q = ||η − z + z + η||q ≤ ||η − z||q + ||z + η||q < 2||η||q,that is a contradiction, by definition of norm.

Taking the infimum over all (A,∆) ∈ Am,N yields

dmp ≤ Em(BNp , X).

By equation (3.25) there exist Y of codimension less or equal to m and ε > 0 such that

sup||η||q | η ∈ Y ∩BNp ≤ dmp + ε

Let A be a matrix whose rows form a basis for Y ⊥. Denote the affine solution space

F(y) = x | Ax = y.

If F(y)∩BNp 6= ∅ there exist at least a solution, xy, such that y = Axy and xy ∈ BN

p . Letdefine a decoder as follows,

∆(y) =

xy if F(y) ∩BN

p 6= ∅x otherwise

where x is any element in BNp .

The following chain of inequality is then deduced

Em(BNp , X) ≤ sup

x∈BNp‖x−∆(Ax)‖q

= supx,x′∈BNp

||x− x′||q

≤ supη∈2(Y ∩BNp )

||η||q

≤ 2 supη∈Y ∩BNp

||η||q

since η := x− x′ ∈ kerA and x− x′ ∈ 2BNp , since x− x′ ∈ BN

p +BNp ⊂ 2BN

p . Taking theinfimum over Y ⊂ X, codimY ≤ m,

Em(K,X) ≤ 2dmp .

ut

67

Next theorem provides an estimate of the Gelfand widths dm(BNp , `

N2 ), [19, 14, 30].

Theorem 3.29 Let 0 < p ≤ 1. There exist universal constants Cp, Dp > 0 such that theGelfand widths dm(BN

p , `N2 ) satisfy

Cp min

1,log(2N

m)

m

1p− 1

2 ≤ dm(BNp , `

N2 ) ≤ Dp min

1,

log(2Nm

)

m

1p− 1

2.

Choosing p = 1 and combining Proposition 3.28 and Theorem 3.29 gives, for large m

C1

√log(2N

m)

m≤ Em(BN

1 , `N2 ) ≤ D1

√log(2N

m)

m, (3.26)

where C1 = C1 and D1 = C0D1.

This estimate implies a lower bound for the minimal number of required samples whichallows approximate sparse recovery using any measurement matrix and any recoverymethod.

The following corollary provides the best estimate of the minimal number of requiredmeasurements.

Corollary 3.30 Let A ∈Mm,N(R) and ∆ : Rm → RN such that

||x−∆(Ax)||2 ≤ Cσk(x)1√

k(3.27)

for all x ∈ BN1 and some constant C > 0. Then

m ≥ C ′k log(2N

m). (3.28)

Proof. Since σk(x)1 ≤ ||x||1 ≤ 1, the assumption (3.27) implies ||x − ∆(Ax)||2 ≤ C√k.

Taking the supremum over x ∈ K and then the infimum over (A,∆) ∈ Am,N ,

Em(BN1 , `

N2 ) ≤ C√

k.

Moreover, using (3.26),

C1

√log(2N

m)

m≤ Em(BN

1 , `N2 ) ≤ C√

k.

Consequently m ≥ C ′k log(2Nm

) as claimed. ut

The above corollary applies to `1 minimization and hence if δk <13

for a matrix A ∈Mm,N(C) we have m ≥ Ck log(2N

m), using Theorem 3.12.

Therefore, the recovery results for Gaussian and Bernoulli random matrices with `1 min-imization stated above, in particular Theorem 3.25, are optimal.

Chapter 4

Bounded Orthonormal Systems

Let D ⊂ Rd be endowed with a probability measure ν. Let φ1, ..., φN be an orthonormalsystem of complex valued functions on D, that is, for j, l ∈ 1, ..., N,∫

Dφj(t) φl(t)dν(t) = δjl. (4.1)

We assume the orthonormal system is uniformly bounded in L∞

||φj||∞ = supt∈D|φj(t)| ≤ K, for all j ∈ 1, ..., N. (4.2)

Note that the smallest value of the constant K is K = 1. Indeed,

1 =

∫D|φj(t)|2dν(t) ≤ sup

t∈D|φj(t)|2

∫Ddν(t) ≤ K2.

In the case K = 1 we will necessarily have |φj(t)| = 1 for ν−almost all t ∈ D.Indeed, let E = t ∈ D | |φj(t)| < 1 ⊂ D and suppose ν(E) > 0, then

1 =

∫D|φj(t)|2dν(t) =

∫E

+

∫D\E|φj(t)|2dν(t) < ν(E) + ν(D \ E) = 1,

which is a contradiction.

In general let Eε = t ∈ D | |φj(t)| < ε then

ν(Eε) ≤K2 − 1

K2 − ε.

Indeed, as before,

1 =

∫D|φj(t)|2dν(t) =

∫Eε

+

∫D\Eε|φj(t)|2dν(t)

< ε2ν(Eε) +K2ν(D \ Eε)= ε2ν(Eε) +K2

(ν(D)− ν(Eε)

)= (ε2 −K2)ν(Eε) +K2ν(D),

which implies ν(Eε) ≤ K2−1K2−ε .

69

70

We will consider function of the form

f(t) =N∑j=1

xjφj(t), t ∈ D (4.3)

with coefficients x1, ..., xN ∈ C.Let consider the points t1, ..., tm ∈ D and the sample values

yl = f(tl) =N∑j=1

xjφj(tl), l = 1, ...,m.

Introducing the sampling matrix A ∈Mm,N(C)

Alj = φj(tl), l = 1, ...,m, j = 1, ..., N (4.4)

we have thaty = Ax (4.5)

where y is the vector of sample values and x the vector of coefficients.The aim is to reconstruct the function f, which means its vector of coefficients x, fromthe vector of samples y. We are interested in the case m N.

Definition 4.1 A function f of the form (4.3) is called k-sparse if its coefficients vectorx is k sparse.

Recovering a k sparse function from m samples reduces to solving (4.5) where the vectorof coefficients x is k sparse.

We have seen that Gaussian matrices provide optimal conditions for the minimal numberof required samples for sparse recovery, but their use in applications is limited becauseoften applications impose constraints on the measurement matrix and assuming A to beGaussian may not be justifiable.Gaussian matrices are not structured so they are not applicable in large scale problems.A very important class of structured random matrices that overcomes this problem arerandom partial Fourier matrices.

Definition 4.2 A random partial Fourier matrix A ∈ Mm,N(C) is derived from the dis-crete Fourier matrix F ∈MN(C), with entries

Fjk =1√Ne

2πijkN ,

by selecting m rows uniformly at random among all N rows.

Taking measurements of a sparse vector x ∈ CN corresponds then to observe m of theentries of its discrete Fourier transform x = Fx.

It is not possible to obtain the (near) optimal condition shown for Gaussian matrices. Inthe case of Fourier systems we have the following theorem, [7, 24].

Theorem 4.3 Let x ∈ CN be a k sparse vector and let A be the random partial Fouriermatrix. Then x is the unique solution of the `1 minimization problem (3.2) with probabilityat least 1− ε provided

m ≥ Ck log(N

ε).

71

Let now introduce randomness. Let suppose the sampling points t1, ..., tm are selectedindependently at random according to the probability measure ν. This means that forB ∈ D

P[tl ∈ B] = ν(B), l = 1, ...,m.

The matrix A in (4.4) is the so called structured random matrix.

Let us give some example of bounded orthonormal systems.

• Trigonometric Polynomials. Let D = [0, 1] and let ν be the Lebesgue measure on[0, 1]. For k ∈ Z set

φj(t) = e2πijt, t ∈ [0, 1].

Then for all j, l ∈ Z, ∫ 1

0

φl(t)φj(t)dt = δjl.

Furthermore the constant in (4.2) is K = 1, since ||φj||∞ = 1. For a subset Γ ⊂Z, |Γ| = N, consider the trigonometric polynomials of the form

f(t) =∑j∈Γ

xjφj(t) =∑j∈Γ

xje2πijt.

Introducing sparsity on the coefficient vector x ∈ CN , it leads to the notion of ksparse trigonometric polynomials.

Let chose the sampling points t1, ..., tm independently and uniformly at random from[0, 1]. The entries of the associated structured random matrix A are

Alj = φj(tl) = e2πijtl , l = 1, ...,m, j ∈ Γ.

Such matrix A is a Fourier type matrix.

This example extends to multivariate trigonometric polynomials on [0, 1]d with d ∈N. The monomial

φj(t) = e2πi〈j,t〉, j ∈ Zd, t ∈ [0, 1]d

form an orthonormal system.

Instead of complex exponentials we may take the real functions

φ0(t) = 1

φ2j(t) =√

2 cos(2πjt) j ∈ N+

φ2j+1(t) =√

2 sin(2πjt) j ∈ N.

They also form an orthonormal system on [0, 1] with respect to the Lebesgue measureand the constant in (4.2) is K =

√2.

• Haar-Wavelets and Noiselets. Let describe the Haar-wavelets and noiselets. TheHaar scaling function on R is defined as the characteristic function of the interval[0, 1),

φ(x) = 1[0,1) =

1 if x ∈ [0, 1)

0 otherwise.

72

The Haar wavelets is then defined as

ψ(x) = φ(2x)− φ(2x− 1) =

1 if x ∈ [0, 1

2)

−1 if x ∈ [12, 1)

0 otherwise.

Further, denote

ψj,l(x) = 2j2ψ(2jx− l) φj(x) = φ(x− j) x ∈ R, j, l ∈ Z.

The Haar-wavelets system

Ψn := φj | j ∈ Z ∪ ψj,l | l = 0, . . . , 2j − 1, j = 0, . . . , n− 1

forms, [32], an orthonormal basis of

Vn := f ∈ L2([0, 1]) | f constant on[2−nj, 2−n(j + 1)

), j = 0, . . . , 2n − 1.

Let N = 2n for some n ∈ N. Since the functions ψj,l, j ≤ n − 1 are constant onintervals of the form

[2−nj, 2−n(j + 1)

)we conclude that the vectors with entries,

for t = 0, . . . , N − 1 are

φt = 2−n2 φ(

t

N), ψ

(j,l)t = 2−

n2ψj,l(

t

N)

form an orthonormal basis of CN . Note that ψ(j,l) are 2n − 1 vectors. We collectthese vectors as the columns of a unitary matrix Ψ ∈MN(C).

Next we introduce the noiselet system on [0, 1]. Let g1 = φ = 1[0,1) be the Haarscaling function and define recursively the complex valued functions, for r ≥ 1,

g2r(x) = (1− i)gr(2x) + (1 + i)gr(2x− 1)

g2r+1(x) = (1 + i)gr(2x) + (1− i)gr(2x− 1).

The functions 2−n2 gr | r = 2n, . . . , 2n+1 − 1 form an orthonormal basis of Vn, asshown in [11]. Since the functions gN+r, r = 0, . . . , N − 1 are constant on intervalsof the form

[2−nj, 2−n(j + 1)

)it follows that the vectors

g(r)t = 2−ngN+r(

t

N)

form an orthonormal basis of CN . We collect these vectors as the columns of aunitary matrix G ∈MN(C).

The unitary matrix U = G∗Ψ ∈MN(C) satisfies equation (4.2) with K = 1, [11].

We will study how sparse vectors can be recovered by `1 minimization in the case themeasurement matrix A is a structure random matrix associated to a bounded orthonormalsystem. We will consider both uniform and nonuniform recovery results.A uniform recovery result means that once we have chosen a random matrix of the form(4.4), with high probability, every sparse signal can be recovered by `1 minimization, whilein a nonuniform recovery result we fixed both the matrix and the sparse vector. Notethat uniform recovery implies nonuniform recovery.

73

4.1 Nonuniform Recovery

We start with a nonuniform recovery result that assume that the signs of the nonzeroentries of the signal x are chosen at random.

Theorem 4.4 Let S ⊂ 1, ..., N, |S| = k and let ε = (εl)l∈S ∈ Ck be a sequence ofindependent Rademacher variables and let x be a random variable taking values in Rn

which is a k sparse vector with support S and sgn(xS) = ε.Let t1, ..., tm be random sampling points taking values in D chosen independently anddistributed according to the measure ν, with m > k. Let A ∈ Mm,N(C) be the samplingmatrix (4.4) associated to an orthonormal system that satisfies the boundedness condition(4.2) for some constant K ≥ 1 and set y = Ax.Then with probability at least 1−ε the vector x is the unique solution of the `1 minimizationproblem (3.2), provided

m ≥ CK2k log2(6N

ε) (4.6)

for a suitable constant C.

This result will be proven in Section 4.3

This result is not optimal. In 2011, Candes and Plan, in [5], improved this result, withoutmaking any assumption about the support or the sign pattern of x. By Corollary 3.30their result is optimal.

Theorem 4.5 Let x be a k sparse vector in RN . Then with probability at least 1− 5N−e−β,

x is the unique solution of the `1 minimization problem, provided

m ≥ Cβ µk log(N).

More precisely, Cβ can be chosen as (1 + β)C0, for some positive constant C0.

In order to prove Theorem 4.5, we introduce some lemmas, proved in [5]. We denotewith ‖ · ‖ the nuclear norm, that is the Schattern 1 norm.

Lemma 4.6 Let S be a set of cardinality k. Then for δ > 0,

P[‖A∗SAS − I‖ ≥ δ] ≤ 2k exp

(− m

µk

δ2

2(1 + δ3)

).

In particular,

P[‖A∗SAS − I‖ ≥1

2] ≤ 2

N

provided m ≥ 563µk log(N).

Note that condition ‖A∗SAS − I‖ ≤ δ implies ‖(A∗SAS)−1‖ ≤ 11−δ . This fact will be useful

later.

Lemma 4.7 Let S be a set of cardinality k. For any 0 ≤ t ≤√k,

P[maxi∈SC‖A∗Sai‖2 ≥ t] ≤ N exp

(− mt2

8µk+

1

4

),

where ai denote the ith column of A. In particular,

P[maxi∈SC‖A∗Sai‖2 ≥ 1] ≤ 1

N,

provided m ≥ 8µk(2 log(N) + 14).

74

Lemma 4.8 (Inexact duality). Let x be a vector with supp(x) = S, |S| = k, and assumethat

‖(A∗SAS)−1‖ ≤ 2 and maxi∈SC‖A∗Sai‖2 ≤ 1. (4.7)

Suppose there exists v ∈ RN in the row space of A obeying

‖vS − sgn(xS)‖2 ≤1

4and ‖vTC‖∞ ≤

1

4,

then x is the unique `1 minimizer.

Lemma 4.9 Under the hypotheses of Theorem 4.5, one can find v ∈ RN obeying theconditions of Lemma 4.8 with probability 1− e−β − 1

N, provided m ≥ C(1 + β)µk log(N).

In order to prove Theorem 4.5 we just need to verify conditions (4.7) of Lemma 4.8.By Lemma 4.6 and the observation below the lemma, ‖(A∗SAS)−1‖ ≤ 2 holds withprobability at least 1− 2

Nprovided m ≥ 56

3µk log(N).

Moreover, by Lemma 4.7 maxi∈SC ‖A∗Sai‖2 ≤ 1 with probability at least 1− 1N

providedm ≥ 8µk(2 log(N) + 1

4).

It follows that the probability of failure of conditions (4.7) is bounded above by 3N,

provided m ≥ µk(19 log(N) + 2). Hence with probability at least 1 − 4N− e−β, x is the

unique solution of the `1 minimization problem provided m ≥ C (1 + β)µk log(N).

4.2 Uniform Recovery

The main theorem concerning the uniform recovery of sparse function in bounded or-thonormal system from random samples is the following.

Theorem 4.10 Let t1, ..., tm be random sampling points chosen independently and dis-tributed according to the measure ν. Let A ∈ Mm,N(C) be the sampling matrix (4.4)associated to an orthonormal system that satisfies the boundedness condition (4.2) forsome constant K ≥ 1. Then with probability at least 1 − ε every k sparse vector x ∈ CN

is recovered by `1 minimization (3.2) from the samples

y = Ax =( N∑j=1

xjφj(tl))l=1,...,m

providedm

log(m)≥ CK2k log2(k) log(N), (4.8)

m ≥ DK2k log(ε−1),

where C,D > 0 are some universal constants.

This result is proved in Section 4.4

We may choose ε such that m = DK2k log(ε−1). Then equation (4.8) implies that

D log(ε−1) ≥ C log2(k) log(m) log(N)

75

which means

ε ≤ N−CD

log(m) log2(k).

Then condition (4.8) implies recovery with probability at least 1−N−γ log(m) log2(k), whereγ = C

D. Remembering that k,m ≤ N, we have recovery by `1 minimization with probability

at least 1−N−γ log3(N), provided

m ≥ CK2k log4(N).

Remark 4.11 Under the assumption of the above theorem, the following holds for everyx ∈ CN , [8]. Let noisy samples y = Ax + e be given with ||e||2 ≤ η

√m. Let x∗ be a

solution of

min ||z||1 subject to ||Az − y||2 ≤ η√m.

Then,

||x− x∗||2 ≤ dη + cσk(x)1√

k

for some constants c, d > 0.

4.3 Proof of Theorem 4.4

Let consider the following results.

Proposition 4.12 Let A = (a1|...|aN) ∈ Mm,N(C) and define A = 1√mA. Let S ⊂

1, ..., N, |S| = k. Assume AS is injective (which implies k ≤ m) and

||A†S al||2 = ||A†Sal||2 ≤ α <1√2, for all l /∈ S, (4.9)

where A†S = (A∗SAS)−1A∗S is the Moore-Penrose pseudo-inverse of AS. Let ε = (ε)j∈S ∈ Ck

be a Rademacher sequence.Then every vector x ∈ CN with support S and sgn(xS) = ε is the unique solution to the`1 minimization problem (3.2), with probability at least

1− 234 (N − k) exp(−α

−2

2).

Proof. By Hoeffding’s inequality (Corollary 2.18),

P[

maxl /∈S|〈A†Sal, sgn xS〉| ≥ 1

]≤∑l /∈S

P[|〈A†Sal, sgn xS〉| ≥ 1

]≤∑l /∈S

P[|〈A†Sal, sgn xS〉| ≥ ||A†Sal||2α

−1]

by (4.9)

≤ 234 exp

(− α−2

2

)Using Corollary 3.15 we conclude the proof. ut

76

Proposition 4.13 Let A ∈ Mm,N(C) with coherence µ and let S ⊂ 1, ..., N, |S| = k.Assume that

||A∗SAS − I||2→2 ≤ δ for some δ ∈ (0, 1). (4.10)

Then

||A†Sal||2 ≤√k µ

1− δfor all l /∈ S.

Proof. By definition of A† and by definition of the operator norm, it holds

||A†Sal||2 = ||(A∗SAS)−1A∗Sal||2 ≤ ||(A∗SAS)−1||2→2||A∗Sal||2.

Furthermore,

||(A∗SAS)−1||2→2 = ||∞∑j=0

(I − A∗SAS)j||2→2 since∞∑j=0

(1− x)j =1

x

≤∞∑j=0

||(I − A∗SAS)||j2→2 ≤∞∑j=o

δj by (4.10)

=1

1− δ.

and

||A∗Sal||2 =

√∑j∈S

|〈aj, al〉|2 ≤ µ√k.

Combining the two estimates, we complete the proof. ut

The following theorem gives an estimate of ||A∗SAS − I||2→2, while the estimate of thecoherence will follow as a corollary.

Theorem 4.14 Let A ∈ Mm,N(C) be the sampling matrix (4.4) associated to an or-thonormal system that satisfies the boundedness condition (4.2) for some constant K ≥ 1.Let S ⊂ 1, ..., N, |S| = k ≥ 2 and let t1, ..., tm be random sampling points choosing in-dependently according to the measure ν. Let δ ∈ (0, 1

2].

Then the normalized matrix A = 1√mA satisfies

||A∗SAS − I||2→2 ≤ δ,

with probability at least

1− 234k exp(− mδ2

CK2k)

where C = 9 +√

17 ≈ 12.399.

Note that the theorem also holds for δ ∈ [12, 1) with a larger constant C.

Proof. Denote by Xl = (φj(tl))j∈S ∈ Ck a column vector of A∗S. Since tl are independent,Xl are i.i.d. random vectors and

||Xl||2 =

√∑j∈S

|φj(tl)|2 ≤ K√k, (4.11)

77

by condition (4.2). Furthermore,

E[XlX∗l ]ij = E

[φi(tl)φj(tl)

]=

∫Dφj(t)φi(t)dν(t) = δij,

that isE[XlX

∗l ] = I.

For p ≥ 2 letEp := E[||A∗SAS − I||

p2→2].

Using the symmetrization inequality, Lemma 2.15, we have

Ep := E[||A∗SAS − I||p2→2] = E[|| 1

m

m∑l=1

(XlX

∗l − E(XlX

∗l ))||p2→2]

≤( 2

m

)pEEε[||

m∑l=1

εlXlX∗l ||

p2→2],

where ε = (ε1, ..., εm) is a Rademacher sequence, independent of X1, ..., Xm. We note thatAS has rank at most k. Then applying Lemma 2.28

Ep ≤( 2

m

)pEEε[||

m∑l=1

εlXlX∗l ||

p2→2]

≤( 2

m

)p2

34kp

p2 e−

p2E[||A∗S||

p2→2 max

l=1,...,m||Xl||p2

]≤( 2

m

)p2

34kp

p2 e−

p2E[||AS||p2→2]Kpk

p2 by (4.11)

≤( 2√

m

)p2

34kp

p2 e−

p2Kpk

p2

√E[||A∗SAS||

p2→2]

≤( 2√

m

)p2

34kp

p2 e−

p2Kpk

p2

√E[||A∗SAS − I||2→2 + 1

]p.

In the forth line we use Lemma 2.4 and the fact that ‖AS‖22→2 = ‖A∗SAS‖2

2→2 =m‖A∗SAS‖2

2→2.

By Jensen inequality,

E[||AS||p2→2] = mp2E[||A∗SAS||

p22→2] ≤ m

p2

√E[||A∗SAS||

p2→2].

Then, setting

Dpmk = 2K

√k

m2

34pk

1pp

12 e−

12

sinceE[||A∗SAS − I||2→2 + 1

]p= ‖‖A∗SAS − I‖2→2 + 1‖pp

≤(‖‖A∗SAS − I‖2→2‖

1pp + 1

)p=((E[||A∗SAS − I||

p2→2])

1p + 1

)pwe have that

E1pp ≤ Dpmk

√(E[||A∗SAS − I||

p2→2])

1p + 1 = Dpmk

√E

1pp + 1.

78

Squaring the inequality, we obtain

E2pp −D2

pmkE1pp ≤ D2

pmk.

It follows (E

1pp −

D2pmk

2

)2

≤ D2pmk +

D4pmk

4

which means

E1pp ≤

√D2pmk +

D4pmk

4+D2pmk

2.

Assuming Dpmk ≤ 1√6

it yields

E1pp ≤ Dpmk

(√1 +

1

4Dpmk +

1

2Dpmk

)≤ Dpmk

√26 + 1√

24= τDpmk,

having set τ =√

26+1√24. Then

(E[(min1

2, ||A∗SAS − I||2→2)p]

) 1p ≤

(min

(1

2

)p,E[||A∗SAS − I||

p2→2

]) 1p

≤ min1

2,(E[||A∗SAS − I||

p2→2]

) 1p

≤ E1pp ≤ Dpmkτ

= τ2K

√k

m2

34pk

1pp

12 e−

12 .

It follows form Proposition 2.11, for u ≥√

2, γ = 2, β = 234k, α = 2τK

√kme−

12

P(||A∗SAS − I||2→2 ≥ 2τK

√k

mu)≤ 2

34ke−

u2

2 .

Hence for 2τK√

2km≤ δ ≤ 1

2,

P[||A∗SAS − I||2→2 ≥ δ

]≤ 2

34k exp

(− mδ2

8τ 2K2k

),

having set δ = 2τK√

kmu.

We have then proved that

P[||A∗SAS − I||2→2 ≥ δ

]≤ ε

provided

m ≥ log(234k

ε)8τ 2K2k

δ2=CK2k

δ2log(2

34k

ε)

where C = 8τ 2 = 8(

1+√

26√24

)2 ≈ 12.399. ut

79

Corollary 4.15 Let A ∈ Mm,N(C) be the sampling matrix (4.4) associated to an or-thonormal system that satisfies the boundedness condition (4.2) for some constant K ≥ 1.Then the coherence of the normalized matrix A = 1√

mA satisfies

µ ≤

√2CK2 log(2

34N2

ε)

m

with probability at least 1− ε, provided the right side is at most 12, where C = 9 +

√17 ≈

13.12.

Proof. Let S = j, l be a set of two elements. Then the matrix A∗SAS has 〈aj, al〉 as amatrix entry. Then, for j 6= l,

|〈aj, al〉| ≤ ||A∗SAS − I||2→2.

By Theorem 4.14 the probability that the operator norm ||A∗SAS− I||2→2 is not boundedby δ ∈ (0, 1

2] is at most

234 · 2 exp

(− mδ2

2CK

),

since k = 2. Since there are(N2

)= N(N−1)

2≤ N2

2subset of 1, ..., N of cardinality 2, we

haveP[µ ≥ δ] ≤ P[ max

S⊂1,...,n|S|≤2

||A∗SAS − I||2→2 ≥ δ]

≤∑

S⊂1,...,n|S|≤2

P[||A∗SAS − I||2→2 ≥ δ]

≤ 234 · 2 exp

(− mδ2

2CK

) N2

2

≤ 234N2 exp

(− mδ2

2CK

).

Requiring

234N2 exp

(− mδ2

2CK

)≤ ε,

we obtain that, with probability 1− ε,

µ ≤

√2CK2 log(2

34N2

ε)

m.

ut

We now prove Theorem 4.4.

Proof. Let A, x and y as in the statement of the theorem. Let 0 < δ < 12

such that

α =√kt

1−δ ≤1√2, for t > 0.

Proposition 4.12 and Proposition 4.13 assure that the event that x is not a solution ofthe `1 minimization problem has probability

234 (N − k) exp(

−α−2

2) + P[||A†S al||2 ≥ α]

≤ 234 (N − k) exp(

−α−2

2) + P[||A∗SAS − I||2→2 > δ] + P[µ > t]

80

where µ is the coherence of A. By Theorem 4.14, A = 1√mA satisfies

||A∗SAS − I||2→2 ≤ δ,

with probability at least 1− 234k exp(− mδ2

CK2k). This means that

P[||A∗SAS − I||2→2 > δ] ≤ ε

where 234k exp(− mδ2

CK2k) ≤ ε, which means

m ≥ CK2k

δ2log(2

34k

ε). (4.12)

By Corollary 4.15, the coherence µ of A satisfies

P[µ > t] ≤ ε,

provided t ≥

√2CK2 log(2

34 N

2

ε)

m, which means

m ≥ 2CK2

t2log(2

34N2

ε). (4.13)

Setting t = δ√

12k, since N > k we have that (4.13) implies (4.12).

By Proposition 4.13, we have that

α =

√kt

1− δ=

1√2

δ

1− δ.

If we require that x is the unique solution of the `1 minimization with probability at least1− 3ε we have to set that

234 (N − k) exp(

−α−2

2) ≤ ε

which is implied by

234N exp(

−α−2

2) ≤ ε.

Then we obtain α−2 ≥ 2 log(

234Nε

)which is equivalent to

δ−2(1− δ)2 ≥ 4 log(2

34N

ε

).

Choosing

δ−2 = 4 log(

234N

ε

),

we have just proved that x is the unique solution to the `1 minimization problem provided

m ≥ 2CK2

t2log(2

34N2

ε) where δ−2 = 4 log(2

34N

ε), t = δ

√1

2k,

which means

m ≥ 16CK2k log(234N

ε)[

log(234N

ε) + log(N)

].

81

since log(234Nε

) + log(N) ≤ 2 log(234Nε

),

m ≥ 32CK2k log2(234N

ε).

We conclude choosing ε = ε3.

The constant is C = 32C ≈ 396.768. ut

4.4 Proof of Theorem 4.10

Before prove Theorem 4.10 we introduce the following results. We start with a technicallemma.

Lemma 4.16 Let x1, ..., xm ∈ CN such that ||xl||∞ ≤ K, for l ∈ 1, ...,m and assumek ≤ m. Then

E[|||

m∑l=1

εlxlx∗l |||k

]≤ C1K

√k log(100k)

√log(4N) log(10m)

√√√√||| m∑l=1

xlx∗l |||k, (4.14)

where C1 = 94.81. Furthermore, for p ≥ 2,

(E[|||

m∑l=1

εlxlx∗l |||

pk

]) 1p ≤ β

1p C2√pK√k log(100k)

√log(4N) log(10m)

√√√√||| m∑l=1

xlx∗l |||k,

(4.15)where C2 ≈ 82.56 and β = 6.028 is the constant in Dudley’s inequality.

The norm ||| · |||k is defined in (4.16) below.

Observe that in the above lemma the vectors xl, l = 1, . . . ,m are fixed and the expectationis taken with respect to the random variables εl, l = 1, . . . ,m.

The proof of Lemma 4.16 is a the end of the section.

We now consider the following theorem.

Theorem 4.17 Assume that the random sampling points t1, .., tm are chosen indepen-dently according to the measure ν. Let A ∈ Mm,N(C) be the sampling matrix (4.4) asso-ciated to an orthonormal system that satisfies the boundedness condition (4.2) for someconstant K ≥ 1. Let k < m, ε ∈ (0, 1) and δ ∈ (0, 1

2], such that

m

log(10m)≥ DK2δ−2k log2(100k) log(4N) log(7ε−1)

where D ≤ 229740, then with probability at least 1− ε the restricted isometry constant ofthe renormalized matrix 1√

mA satisfies δk ≤ δ.

Proof. We use the characterization of the restricted isometry constant

δk = maxS⊂1,..,N|S|≤k

||A∗SAS − I||2→2.

82

For any complex self-adjoint matrix B = B∗ ∈MN(C) we define the norm

|||B|||k := supz∈D2

k,N

|〈Bz, z〉|, (4.16)

whereD2k,N := z ∈ Σk | ||z||2 ≤ 1.

Then we have thatδk = |||A∗A− I|||k.

Let Xl = (φj(tl))j=1,...,N ∈ CN be the random column vector associated to the samplingpoint tl, l = 1, ...,m. Then X∗l is a row of A.Since E[XlX

∗l ] = I, by the orthonormality condition (4.1), we can express the restricted

isometry constant of A as

δk = ||| 1m

m∑l=1

XlX∗l − I|||k =

1

m|||

m∑l=1

(XlX

∗l − E[XlX

∗l ])|||k.

For p ≥ 2, denote

Ep := (E[δpk])1p =

(E[||| 1

m

m∑l=1

XlX∗l − I|||

pk]) 1p.

Let ε = (ε1, ..., εm) is a Rademacher sequence, independent of the random sampling pointstl, l = 1, ...,m. Using the symmetrization inequality, we estimate the moment of δk,(

E[|||m∑l=1

(XlX

∗l − E[XlX

∗l ])|||pk]

) 1p

≤ 2

(E[|||

m∑l=1

εlXlX∗l |||

pk]

) 1p

. (4.17)

It follows that

Epp = E[δpk] =

1

mpE[ |||

m∑l=1

(XlX

∗l − E[XlX

∗l ])|||pk]

≤ 2p1

mpEEε[ |||

m∑l=1

εlXlX∗l |||

pk]

By Lemma 4.16, setting

DN,m,k,p = β1p C2√pK√k log(100k)

√log(4N) log(10m),

we have that

Epp ≤ (2DN,m,k,p)

p 1

mp2

E[ 1

m|||

m∑l=1

XlX∗l |||k

] p2

by (4.15)

≤(2DN,m,k,p√

m

)pE[ ||| 1

m

m∑l=1

XlX∗l |||

p2k ]

≤(2DN,m,k,p√

m

)pE[ (||| 1m

m∑l=1

XlX∗l − I|||k + 1

) p2]

≤(2DN,m,k,p√

m

)p (E[(||| 1m

m∑l=1

XlX∗l − I|||k + 1

)p]) 12

≤(2DN,m,k,p√

m

)p (E[||| 1m

m∑l=1

XlX∗l − I|||

pk

] 1p

+ 1

) 12

83

since |||Y |||k ≤ |||Y − I|||k + |||I|||k = |||Y − I|||k + 1. In the forth line we used Jenseninequality and in the last line we used Lemma 2.6. We conclude that

E[δpk]1p = Ep ≤

2DN,m,k,p√m

√Ep + 1.

Similary to the proof of Theorem 4.14, setting τ =√

26+1√24, it yields

E[

min1

2, δkp

] 1p ≤ min1

2, E

1pp

≤ 2DN,m,k,p√m

τ

= 2 β1p C2√p K

√k

mlog(100k)

√log(4N) log(10m) τ.

By Proposition 2.11,

P[

min1

2, δk ≥ e

12αu

]≤ βe−

u2

2 < 7e−u2

2 ,

where

α = 2C2 K

√k

mlog(100k)

√log(4N) log(10m) τ, β = 6.028 < 7, γ = 2.

Having set ε = 7e−u2

2 , that means u2 = 2 log(7ε−1), by Theorem 4.14, we have thatδk ≤ δ ≤ 1

2with probability at least 1− ε provided

m ≥ DK2kδ−2 log2(100k) log(4N) log(10m) log(7ε−1),

where D = 2(2e12 C2τ)2 ≈ 229 737.897. ut

Next theorem immediately implies Theorem 4.10 by noting Corollary 3.11.

Theorem 4.18 Let A be the random sampling matrix (4.4) associated to a randomsampling in a bounded orthonormal system obeying (4.2) with some constant K ≥ 1. Letε ∈ (0, 1), δ ∈ (0, 1

2]. If, for some constant C and D,

m

log(10m)≥ Cδ−2K2k log2(100k) log(4N), (4.18)

m ≥ Dδ−2K2k log(ε−1),

then with probability at least 1 − ε the restricted isometry constant δk of 1√mA satisfies

δk ≤ δ.

Proof. Set E = Eδk, using (4.17) and (4.14) proceeding similarly to the previousTheorem, we obtain

E ≤ 2DN,m,k,1√m

√E + 1 =: GN,m,k

√E + 1,

84

where

GN,m,k = 2C1K

√k

mlog(100k)

√log(4N) log(10m).

It follows that

E ≤√

1

4G4N,m,k +G2

N,m,k +G2N,m,k

2

By assumption,

GN,m,k = 2C1K

√k

mlog(100k)

√log(4N) log(10m)

≤2C1K

√km

log(100k)√

log(4N)√Cδ−1K

√k log(100k)

√log(4N)

=2C1√Cδ =: σδ

with δ ≤ 12. Hence

E ≤GN,m,k

(√1

4G2N,m,k + 1 +

GN,m,k

2

)≤ σs δ,

(4.19)

where s = σ4

+√

σ2

16+ 1. Since

|〈Xl, z〉|2 = z∗XlX∗l z = 〈XlX

∗l z, z〉 (4.20)

then 〈∑m

l=1

(XlX

∗l − I

)z, z〉 ∈ R. It follows that,

mδk = |||m∑l=1

(XlX

∗l −I

)|||k = sup

z∈D2k,N

|〈m∑l=1

(XlX

∗l −I

)z, z〉| ≤ sup

z,w∈D2k,N

Re(〈m∑l=1

(XlX

∗l −I

)z, w〉).

Moreover,

Re(〈m∑l=1

(XlX

∗l − I

)z, w〉) ≤ |〈

m∑l=1

(XlX

∗l − I

)z, w〉| ≤ ‖

m∑l=1

(XlX

∗l − I

)‖2→2.

Hence, using the characterization (3.7) we obtain that

mδk = |||m∑l=1

(XlX

∗l − I

)|||k = sup

z,w∈D2k,N

m∑l=1

Re(〈(XlX

∗l − I

)z, w〉).

We want to apply Theorem 2.35, then we consider the countable set⋃S∈1,..,N|S|=k

z ∈ QN + iQN | supp z ⊂ S ⊂ Σk.

Introducing fz,w(X) = Re(〈(XX∗ − I

)z, w〉) we have

mδk = supz,w∈D2

k,N

m∑l=1

fz,w(Xl).

85

Since E[XlX∗l ] = I, it follows E[fz,w(Xl)] = 0. Indeed,

E[fz,w(Xl)] = Re(〈E[X∗l Xl − I]z, w〉) = Re(〈(E[X∗l Xl]− I)z, w〉) = 0.

Furthermore,

|fz,w(Xl)| ≤ |〈(XlX

∗l − I

)z, w〉|

≤ ‖z‖2‖w‖2||XSl (XS

l )∗ − I||2→2 ≤ ||XSl (XS

l )∗ − I||2→2

≤ ||XSl (XS

l )∗ − I||1→1 by (2.2)

= maxj∈S

∑l∈S

|φj(t)φl(t)− δjl|

= maxj∈S

∑l∈S,l 6=j

|φj(t)φl(t)|

≤ kK2.

Moreover, E[|fz,w(Xl)|2] ≤ E[|〈(XlX∗l −I)z, w〉|2]. Using the fact, proved in the Appendix,

that ‖uu∗‖2→2 = ‖u‖22

E[|fz,w(Xl)|2] ≤ E[|〈(XlX∗l − I)z, w〉|2]

= E[w∗(XlX∗l − I)z((XlX

∗l − I)z)∗w]

≤ ‖w‖22 E[‖(XlX

∗l − I)z((XlX

∗l − I)z)∗‖2→2]

= E[‖(XlX∗l − I)z‖2

2]

= E[‖Xl‖22|〈Xl, z〉|2]− 2E[|〈Xl, z〉|2] + 1.

Observe that, since E[XlX∗l ] = I,

E[|〈Xl, z〉|2] = E[〈XlX∗l z, z〉] = 〈E[XlX

∗l ]z, z〉 = 〈z, z〉 = 1.

We have so proved that

E[|fz,w(Xl)|2] ≤ E[|〈Xl, z〉|2||Xl||22]− 1 ≤ kK2 − 1 ≤ kK2,

since ||Xl||22 = X∗l Xl =∑

j∈S|S|≤k

|φj(tl)|2 ≤ K2k.

We have that

P[δk ≥ δ] = P[|||m∑l=1

(XlX

∗l − I

)|||k ≥ mδ]

= P[|||

m∑l=1

(XlX

∗l − I

)|||k ≥ E[|||

m∑l=1

(XlX

∗l − I

)|||k] + t

]where t = mδ − E[|||

∑ml=1

(XlX

∗l − I

)|||k] > 0, which implies σs < 1.

Applying Theorem 2.35, with v = mkK2 + 2E[|||∑m

l=1

(XlX

∗l − I

)|||k],

P[δk ≥ δ] ≤ exp(− t2

2v + 23t

)≤ exp

(− t2

2mkK2 + 4E[|||∑m

l=1

(XlX∗l − I

)|||k] + 2

3t

)= exp

(− t2

2mkK2 + 103E[|||

∑ml=1

(XlX∗l − I

)|||k] + 2

3mδ

).

86

Using the fact that E[|||∑m

l=1

(XlX

∗l − I

)|||k] ≤ mσsδ, we obtain

P[δk ≥ δ] ≤ exp(− mδ2(1− σs)2

2kK2 + 103σsδ + 2

3δ

)

≤ exp(− mδ2(1− σs)2

2kK2(1 + (53σs+ 1

3) δkK2 )

)

≤ exp(− mδ2(1− σs)2

2kK2(43

+ 53σs)

)since

δ

kK2< 1.

Then P[δk ≥ δ] ≤ ε provided

m ≥ kK2

δ2log(ε−1)

2

3

4 + 5σs

(1− σs)2.

In conclusion we have that

m ≥ DkK2

δ2log(ε−1)

where

D =2

3

4 + 5σs

(1− σs)2, σ =

2C1√C

and s = σ4

+√

σ2

16+ 1 and σs < 1. The table 4.4 below shows the relation between the

constant C and D.

Table 4.1: Relation between the constant C and D depending on σ.

σ s σs C D0.05 1.0251 0.0506 14382297.76 3.14590.1 1.0253 0.1025 3595574.44 3.73510.15 1.0382 0.1557 1598033.08 4.46940.2 1.0512 0.2102 898893.61 5.39920.25 1.0644 0.2661 575291.91 6.59820.3 1.0778 0.3233 399508.27 8.17810.35 1.0913 0.3819 293516.28 10.31460.4 1.1049 0.4419 224723.40 13.29600.45 1.1188 0.5035 177559.23 17.62280.5 1.1328 0.5664 143822.98 24.22460.55 1.1469 0.6308 118861.96 34.98920.6 1.3112 0.6967 99877.068 54.23850.65 1.1756 0.7641 85102.35 93.73240.7 1.1902 0.8331 73379.07 195.51780.75 1.2049 0.9037 63921.32 612.30850.8 1.2198 0.9758 56180.85 10143.824740.85 1.2348 1.0496 49765.74 2505.6204

ut

87

In order to choice σ and consequently the constants C and D we observe that the equations

m

log(10m)≥ Cδ−2K2k log2(100k) log(4N)

m ≥ Dδ−2K2k log(ε−1)

are homogenous in δ−2K2k. Hence it is sufficient to compare log2(100k) log(4N) andlog(ε−1). Let fixed N = 1000, k = 10 and ε = 0.01. We obtain that

log2(100k) log(4N) ≈ 395.77 log(ε−1) ≈ 2.30.

From these results, we conclude it is convenient to chose σ such that the constant C issmall. For example, a suitable choice would be σ = 0.8 which implies C = 56180.85 andD = 10143.8. Analogous considerations can be done choosing N = 100 or N = 10000 andε = 0.01.

4.4.1 Proof of Lemma 4.16

Proof. Observe that

Ep :=

(E[|||

m∑l=1

εlxlx∗l |||

pk]

) 1p

=

(E[ sup

u∈D2k,N

|m∑l=1

εl|〈xl, u〉|2|p]

) 1p

since|〈xl, u〉|2 = 〈xlx∗l u, u〉. (4.21)

This is the moment of the supremum of a Rademacher process Xu =∑m

l=1 εl|〈xl, u〉|2, u ∈D2k,N , that is

Ep =

(E[ sup

u∈D2k,N

|Xu|p]

) 1p

with the associated pseudo-metric

d(u, v) =(E[|Xu −Xv|2]

) 12

=

√√√√ m∑l=1

(|〈xl, u〉|2 − |〈xl, v〉|2

)2

.

For u, v ∈ D2k,N we have

d(u, v) =

(m∑l=1

(|〈xl, u〉| − |〈xl, v〉|

)2(|〈xl, u〉|+ |〈xl, v〉|)2

) 12

≤ maxl∈1,...,m

∣∣|〈xl, u〉| − |〈xl, v〉|∣∣ supu,v∈D2

k,N

√√√√ m∑l=1

(|〈xl, u〉|+ |〈xl, v〉|

)2

≤ 2R maxl∈1,...,m

|〈xl, u− v〉|,

where

R = supu∈D2

k,N

√√√√ m∑l=1

|〈xl, u〉|2 =

√√√√||| m∑l=1

xlx∗l |||k

88

using (4.21). Introducing the seminorm

||u||X := maxl∈1,...,m

|〈xl, u〉|, u ∈ CN (4.22)

we have proved that d(u, v) ≤ 2R||u − v||X , which means that the rescaled process Xu2R

satisfies (E[|Xu

2R− Xv

2R|2]) 1

2 ≤ ||u− v||X .

From now on we will consider the rescaled process. By Hoeffding’s inequality (Proposition2.19), a Rademacher process satisfies condition (2.10) . By Dudley’s inequality (Theorem2.30) with t0 = 0,

Ep ≤ 2Rβ1p√p(C

∫ ∆(D2k,N )

0

√log(N(D2

k,N , || · ||X , t)) dt + D∆(D2k,N)

).

By the Cauchy-Schwarz inequality and using the hypothesis, for u ∈ D2k,N ,

||u||X = maxl∈1,...,m

|〈xl, u〉| ≤ ||u||1 maxl∈1,...,m

||xl||∞ ≤ ||u||1K ≤√kK||u||2 ≤

√kK. (4.23)

Therefore, the diameter ∆(D2k,N) satisfies

∆(D2k,N) = ∆(D2

k,N , || · ||X) = supu,v∈D2

k,N

d(u, v) ≤ 2K√k.

We combine the inequalities of Proposition 4.19 below, which provides an estimate of thecovering number of D2

k,N , to estimate the Dudley’s integral. For any τ > 0, we obtain

I :=

∫ ∆(D2k,N )

0

√log(N(D2

k,N , || · ||X , t)) dt

≤√

2k

∫ τ

0

√1

2log(

eN

k) +

√log(1 +

2K√k

t) dt+

+ 3K√

2k√

log(10m) log(4N)

∫ ∆(D2k,N )

τ

t−1dt

≤√

2k

√1

2log(

eN

k)τ +

√2k2K

√k

∫ τ

2K√k

0

√log(1 + u−1) du+

+ 3K√

2k√

log(10m) log(4N)

∫ 2K√k

τ

t−1dt

≤√

2kτ(√1

2log(

eN

k) +

√log(e

(1 +

2K√k

τ

)))

+ 3K√

2k√

log(10m) log(4N) log(2K√k

τ

)where in the last step we have applied Lemma 4.20 below. Choosing τ = K

5yields

I ≤√

2kK

5

(√1

2log(

eN

k)+

√log(e(1 + 10

√k)))

+3K√

2k√

log(10m) log(4N) log(10√k).

89

Observe that

log(eNk

)≤ log

(4N

k

)= log(4N)

≤ log2(100k) log(10m) log(4N)1

log2(100) log(10)

and

log(e(1 + 10

√k))≤ log

(e(√100k

10+√

100k))

= log(11

10e√

100k)

= log(11

10e)

+1

2log(100k)

≤ log2(100k) log(10m) log(4N)

(log(

1110e)

log2(100) log(10) log 4+

1

2 log(100) log(10) log 4

)

= log2(100k) log(10m) log(4N)

(log(11e)

log2(100) log(10) log 4

).

It follows thatI ≤√kK√

2C0 log2(100k) log(10m) log(4N)

where

C0 :=1

5√

2

√1

log2(100) log(10)+

1

5

√log(11e)

log2(100) log(10) log(4)+

3

2.

We can then conclude

Ep ≤ 2Rβ1p√p(C

∫ ∆(D2k,N )

0

√log(N(D2

k,N , || · ||X , t)) dt + D∆(D2k,N)

)= 2Rβ

1p√p(CI + D∆(D2

k,N))

≤ 2Rβ1p√p(C√

2C0K√k√

log(10m) log(4N) log(100k) + D2K√k)

≤ Rβ1p√p√kK√

log(10m) log(4N) log(100k)2

(√

2CC0 +2D

log(100)√

log(10) log 4

)≤ Rβ

1p√p√kKC2 log(100k)

√log(10m) log(4N)

where

C2 = 2

(√

2CC0 +2D

log(100)√

log(10) log 4

)≈ 44.6.

If p = 1, then we have

E[|||

m∑l=1

εlxlx∗l |||k

]≤ RC1K

√k log(100k)

√log(4N) log(10m)

where

C1 = 2

(√

2C1 +2D1

log(100)√

log(10) log 4

)≈ 41.85.

ut

90

We now prove Proposition 4.19 and Lemma 4.20.

Proposition 4.19 Let || · ||X be the seminorm on R2N defined in (4.22). It holds that

√log(N(D2

k,N , || · ||X , t)) ≤√

2k(√1

2log(

eN

k) +

√log(1 +

2K√k

t))

for t > 0√log(N(D2

k,N , || · ||X , t)) ≤ 3K√

2k t−1√

log(10m) log(4N) for 0 < t ≤ 2K√k.

Proof. Let use the notation of Lemma 4.16. In order to prove the first inequality it issufficient to prove

N(U, || · ||X , t) ≤(

1 +2

t

)2N

(4.24)

with U a subset of the unit ball B = x ∈ R2N | ||x||X ≤ 1. Indeed, once we have provedequation (4.24) we observe that for small t we observe that, since ||u||1 ≤

√k||u||2,

D2k,N ⊂

√kD1

k,N =⋃|S|=k

BS1 , BS

1 = x ∈ CN | ||x||1 ≤√k, supp x ⊂ S.

By equation (4.23) we have that ||u||X ≤ K||u||1, so that

BS1 ⊂ KBS

X = x ∈ CN | ||x||X ≤ K, suppx ⊂ S.

After identifying Ck with R2k it follows from equation (4.24) that

N(BS1 , || · ||X , t) ≤ N(BS

X , || · ||X ,t

K) ≤

(1 +

2K

t

)2k

.

Hence by subadditivity of the covering numbers,

N(D2k,N , || · ||X , t) ≤

∑|S|=k

N(BS1 , || · ||X ,

t√k

) ≤(eNk

)k(1 +

2K√k

t

)2k

where(eNk

)kis the number of subset of 1, ..., N of cardinality k. Indeed, by Stirling

Theorem, e−kkk ≤ k!, then(N

k

)=N(N − 1) . . . (N − k + 1)

k!≤ Nk

k!=kkNk

kkk!≤ ek

Nk

kk=(eNk

)k.

It follows that, for t > 0

√log(N(D2

k,N , || · ||X , t)) ≤√

2k

√1

2log(

eN

k) + log(1 +

2K√k

t)

≤√

2k(√1

2log(

eN

k) +

√log(1 +

2K√k

t)).

We now prove equation (4.24). Without loss of generality we may assume || · ||X is anorm, otherwise we consider X = R2N/N , where N = x ∈ R2N | ||x||X = 0 = ker ||·||X .Let x1, . . . , xNt ⊂ U be a maximal t packing of U, that is, a maximal set satisfying

d(xi, xj) > t for every i 6= j.

91

Then the balls

B(xl,t

2) = x ∈ R2N | ||xl − x||X ≤

t

2

do not intersect and they are contained in the scaled unit ball (1 + t2)B. Comparing the

Lesbegues measures (volumes) we have that

µ((1 +

t

2)B)≥ µ

( Nt⋃l=1

B(xl,t

2))

= Ntµ(B(xl,

t

2)).

Note that µ(B) <∞ since || · ||X is a norm. Indeed, all norms in R2N are equivalent, theballs are compact and hence they have finite volumes, [2]. Moreover in Rn we have thatµ(tB) = t2Nµ(B). Then ( t

2

)2NNtµ(B) ≤

((1 +

t

2))2N

µ(B)

that is

Nt ≤((1 +

2

t))2N

.

To conclude the proof, we observe that the balls B(xl, t), l = 1, . . . , Nt are a covering ofU. Indeed, if there were x ∈ U not covered, then for all l = 1, . . . , Nt, d(x, xl) > t. Thisimplies that x could be added to the packing, which is a contradiction since the packingis maximal.

We now need to prove the second inequality. We introduce the norm

||z||∗1 :=N∑j=1

(|Re(zj)|+ |Im(zj)|), z ∈ CN .

We have that, given U a subset of BN||·||∗1

:= x ∈ CN | ||x||∗1 ≤ 1 and 0 < t <√

2K, then√log(N(U, || · ||X , t)) ≤ 3Kt−1

√log(10m) log(4N). (4.25)

Once we have proved (4.25) we obtain the second inequality. Indeed, since |Re(z)| +|Im(z)| ≤

√2||z||1 ≤

√2√k||z||2, then

D2k,N ⊂

√2kBN

||·||∗1

together with equation (4.25) we get, for 0 < t ≤ 2K√k√

log(N(D2k,N , || · ||X , t)) ≤ 3K

√2k t−1

√log(10m) log(4N).

Let prove (4.25). Fix x ∈ U and let (ej)j=1,...,N be the canonical basis of CN . Let definea random vector Z ∈ CN that takes the value sgn(Re(xj))ej with probability |Re(xj)|,the value isgn(Im(xj))ej with probability |Im(xj)|, for j = 1, . . . , N and the zero vectorwith probability 1− ||x||∗1. Since ||x||∗1 ≤ 1 this is a probability distribution. Note that

EZ =N∑j=1

sgn(Re(xj))ej|Re(xj)|+isgn(Im(xj))ej|Im(xj)| =N∑j=1

(Re(xj)+iIm(xj)

)ej = x.

92

Let Z1, . . . , ZM be independent copies of Z, where M is to be determined later. We wantto approximate x with the M sparse vector

z =1

M

M∑j=1

Zj.

By the symmetrization inequality, using the definition of || · ||X , we have that

E[||z−x||X ] = E[|| 1

M

M∑j=1

(Zj−E[Zj])||X ] ≤ 2

ME[||

M∑j=1

εjZj||X ] =2

ME[ max

l=1,...,m|M∑j=1

εj〈xl, Zj〉|]

where ε is a Rademacher sequence, which is independent of Z1, . . . , ZM .We now fix a realization Z1, . . . , ZM and consider only expectation and probability withrespect to ε. Since by the hypothesis ||xl||∞ ≤ K,

|〈xl, Zj〉| ≤ ||xl||∞||Zj||1 ≤ K.

It follows that

||(〈xl, Zj〉)j=1,...,M ||2 =

√√√√ M∑j=1

|〈xl, Zj〉|2 ≤ K√M, l = 1, . . . ,m.

By Hoeffding’s inequality (Proposition 2.19), we obtain

Pε[|M∑j=1

εj〈xl, Zj〉| ≥ K√Mu

]≤ Pε

[|M∑j=1

εj〈xl, Zj〉| ≥ ||(〈xl, Zj〉)j=1,...,M ||2u]≤ 2 exp

(−u

2

2

).

By Lemma 2.13 setting β = 2 and Xl = 1K√M

∑Mj=1 εj〈xl, Zj〉

Eε[ maxl=1,...,m

|M∑j=1

εj〈xl, Zj〉|] ≤ CK√M√

log(8m),

where C =√

2 + 14√

2 log(8)< 3

2. By Fubini Theorem,

E[||z − x||X ] ≤ 2

MEEε[ max

l=1,...,m|M∑j=1

εj〈xl, Zj〉|] ≤2

MCK√M√

log(8m) ≤ 3K√M

√log(8m).

This implies there exists a vector of the form

z =1

M

M∑j=1

zj

where zj ∈ ±ek, ±iek | k = 1, . . . , N, such that

||z − x||X ≤3K√M

√log(8m).

In particular ||z − x||X ≤ t, if

t ≥ 3K√M

√log(8m). (4.26)

93

This means that for every x ∈ U, we can find a ball centered in a suitable z of radius twhich contains x. Since actually each zj takes 4N + 1 values, then z can take at most(4N + 1)M values. Actually it takes strictly less than (4N)M values, since if some ejappears more than once in the sum, then it always appears with the same sign. Then Uis covered by at most (4N)M ball of radius t. Condition (4.26) means

M ≥ 9K2

t2log(8m).

In particular we can choose,

M =[9K2

t2log(10m)

],

indeed

M ≥ 9K2

t2log(10m)− 1 =

9K2

t2log(

10

8) +

9K2

t2log(8m)− 1

≥ 9K2

t2log(8m) +

9

2log(

10

8)− 1 ≥ 9K2

t2log(8m)

since t ≤√

2K and 92

log(108

) > 1. Therefore,√log(N(U, || · ||X , t)) ≤

√log((4N)M)

=

√[9K2

t2log(10m)

]log(4N)

≤ 3K t−1√

log(10m) log(4N).

ut

Lemma 4.20 For α > 0, it holds∫ α

0

√log(1 + t−1) dt ≤ α

√log(e(1 + α−1)).

Proof. We first apply the Cauchy Schwarz inequality to obtain∫ α

0

√1 + t−1 dt ≤

(∫ α

0

dt) 1

2(∫ α

0

log(1 + t−1) dt) 1

2=√α(∫ α

0

log(1 + t−1) dt) 1

2.

A change of variable and integration by parts gives∫ α

0

log(1 + t−1) dt =

∫ ∞α−1

u−2 log(1 + u) du

= −u−1 log(1 + u)∣∣∣∞α−1

+

∫ ∞α−1

u−1 1

1 + udu

≤ α log(1 + α−1) +

∫ ∞α−1

1

u2du

= α log(1 + α−1) + α = α(log(1 + α−1) + 1)

= α log(e(1 + α−1)).

ut

Chapter 5

Equivalent approach

5.1 Subgradient

Let F : Rn → R ∪ ∞ be a convex function, that is for any x, y ∈ Rn and λ ∈ [0, 1],

F (λx+ (1− λ)y) ≤ λF (x) + (1− λ)F (y).

If F is convex the domain of F domF = x ∈ Rn | F (x) < ∞ is a convex set. Recallthat every convex function F is continuos on its domain domF, [22].

Definition 5.1 Let F : Rn → R ∪ ∞ be a convex function and let x0 ∈ domF. Thesubgradient of F at x0 is

∂F (x0) = v ∈ Rn | f(x) ≥ f(x0) + 〈v, x− x0〉Rn for all x ∈ Rn

If F is convex and x0 ∈ domF then the subgradient at x0 is not empty, that is ∂F (x0) 6= ∅.

In the following proposition we summarize the main properties of the sugradient, [15].

Proposition 5.2 The following facts hold:

1. A function F is differentiable in x if and only if ∂F (x) = v. In this case v =∇F (x).

2. If F is defined on R, that is F : R → R, x0 ∈ domF, then F admits left and rightderivate and ∂F (x0) = [F ′−(x0), F ′+(x0)].

3. Let x0 ∈ domF, then x0 is a minimizer of F if and only if 0 ∈ ∂F (x0). If F isstrictly convex the minimizer is unique.

4. Let G : Rn → R∪ ∞ another convex function and let x0 ∈ Rn. Then for a, b ≥ 0,aF + bG is convex and for any x ∈ Rn,

∂(aF + bG)(x) = a∂F (x) + b∂G(x).

5. Let F be convex, A ∈Mm,n(R), and b ∈ R, then for any x ∈ Rn,

∂F (Ax+ b) = A∗∂F (Ax+ b).

95

96

We now consider the subgradient of the `1 norm. Let x = (x1, . . . , xn) ∈ Rn, the `1 normis defined as

‖x‖1 =n∑i=1

|xi|.

Then, by properties 2 and 4, the subgradient is

∂‖x‖1 =n∑i=1

∂|xi|ei where ∂|xi| =

1 if xi > 0

−1 if xi < 0

[−1, 1] if xi = 0

.

5.2 The LASSO estimator

The `1 minimization problem

min ||x||1 subject to Ax = y (5.1)

in the real case is equivalent to the linear problem

min2N∑j=1

vj subject to v ≥ 0, (A| − A)v = y. (5.2)

The solution x∗ of (5.1) is obtained from the solution v∗ of (5.2) via

x∗ = (I| − I)v∗.

Let consider the `1 regularized least square functionals

Fλ(x) =1

2||Ax− y||22 + λ||x||1, x ∈ RN , λ > 0 (5.3)

and let xλ be its minimizer. This regularized version of least squares is called LASSO(Least Absolute Shrinkage and Selection Operator). By the above proposition, xλ isminimizer of Fλ if and only if

0 ∈ ∂Fλ(x) = A∗(Ax− y) + λ∂||x||1,

which is equivalent to (A∗(Ax− y))i = −λ if xi > 0

(A∗(Ax− y))i = λ if xi < 0

|(A∗(Ax− y))i| ≤ λ if xi = 0

for all i = 1, . . . , N.

The following proposition provides the connection between the solution of (5.1) and (5.3).

Proposition 5.3 Let consider a sequence (λn)n∈N and let xn be the (unique) minimizerof Fλn . Then

limλn→0

xn = x∗,

where x∗ is the solution of (5.1).

97

Proof. Let xn be the minimizer of Fλn and x∗ the minimizer of (5.1), which impliesF0(x∗) = 0. It follows that

Fλn(xn) ≤ Fλn(x∗)

that is1

2‖Axn − y‖2

2 + λn‖xn‖1 ≤1

2‖Ax∗ − y‖2

2 + λn‖x∗‖1 = λn‖x∗‖1.

It follows‖xn‖1 ≤ ‖x∗‖1 (5.4)

1

2‖Axn − y‖2

2 ≤ λn‖x∗‖1. (5.5)

Consequently, by (5.4) there exist a subsequence that we will denote with xn such thatlimλn→0 xn = x0. Moreover, by (5.5), taking the limit for λn to zero, ‖Ax0 − y‖2

2 ≤ 0which means Ax0 = y. Since x∗ is the solution of (5.1) and (5.4) holds, we may concludethat x0 = x∗. ut

Chapter 6

Appendix

6.1

We want to show that||xl||22 = ||xlx∗l ||2→2.

Since|〈xl, u〉|2 = 〈xlx∗l u, u〉,

then taking the supremum over ||u||2 = 1, we have that

||xl||22 ≥ ||xlx∗l ||2→2.

Furthermore we have

||xlx∗l ||2→2 = sup||u||2=1

||xlx∗l u||2 ≥ ||xlx∗lxl||xl||2

||2 = ||xl||xl||2||2 = ||xl||22.

6.2

Let A ∈ Cm×M of rank r with columns a1, ..., aM . Then∑

j ||aj||22aja∗j ha rank at most r.

Indeed, let define B ∈ Cm×M , with columns ||aj||22 aj, for j = 1, . . . ,M, then∑j

||aj||22aja∗j = BB∗.

Since rk(B) = rk(B∗), we have that

rk(BB∗) ≤ minrk(B), rk(B∗) = rk(B).

We prove that if the columns of A are dependent, also the columns of B are dependent.Indeed, if the columns of A are dependent, there exist non vanishing cj such that

0 =M∑j=1

cjaj =∑j

aj 6=0

cjaj =∑j

aj 6=0

cj||aj||22

||aj||22aj,

wherecj||aj ||22

are non vanishing. We have then proved that if A has rank r then BB∗ has

rank at most r.

99

Bibliography

[1] Olivier Bousquet, Concentration inequalities for sub-additive functions using theentropy method, Stochastic inequalities and applications, Progr. Probab., vol. 56,Birkhauser, Basel, 2003, pp. 213–247. MR 2073435 (2006e:60023)

[2] Haım Brezis, Analyse fonctionnelle, Collection Mathematiques Appliquees pour laMaıtrise. [Collection of Applied Mathematics for the Master’s Degree], Masson, Paris,1983, Theorie et applications. [Theory and applications]. MR 697382 (85a:46001)

[3] Artur Buchholz, Operator Khintchine inequality in non-commutative probability,Math. Ann. 319 (2001), no. 1, 1–16. MR 1812816 (2001m:46142)

[4] Emmanuel J. Candes, Compressive sampling, International Congress of Mathe-maticians. Vol. III, Eur. Math. Soc., Zurich, 2006, pp. 1433–1452. MR 2275736(2008e:62033)

[5] Emmanuel J. Candes and Yaniv Plan, A probabilistic and RIPless theory of com-pressed sensing, IEEE Trans. Inform. Theory 57 (2011), no. 11, 7235–7254. MR2883653

[6] Emmanuel J. Candes and Justin Romberg, Quantitative robust uncertainty principlesand optimally sparse decompositions, Found. Comput. Math. 6 (2006), no. 2, 227–254. MR 2228740 (2007a:94035)

[7] Emmanuel J. Candes, Justin Romberg, and Terence Tao, Robust uncertainty princi-ples: exact signal reconstruction from highly incomplete frequency information, IEEETrans. Inform. Theory 52 (2006), no. 2, 489–509. MR 2236170 (2007e:94020)

[8] Emmanuel J. Candes, Justin K. Romberg, and Terence Tao, Stable signal recoveryfrom incomplete and inaccurate measurements, Comm. Pure Appl. Math. 59 (2006),no. 8, 1207–1223. MR 2230846 (2007f:94007)

[9] Emmanuel J. Candes and Terence Tao, Near-optimal signal recovery from randomprojections: universal encoding strategies?, IEEE Trans. Inform. Theory 52 (2006),no. 12, 5406–5425. MR 2300700 (2008c:94009)

[10] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders, Atomic decompo-sition by basis pursuit, SIAM J. Sci. Comput. 20 (1998), no. 1, 33–61. MR 1639094(99h:94013)

[11] R. Coifman, F. Geshwind, and Y. Meyer, Noiselets, Appl. Comput. Harmon. Anal.10 (2001), no. 1, 27–44. MR 1808198 (2002c:42049)

101

102

[12] Geoffrey Davis, Stephane Mallat, and Zhifeng Zhang, Adaptive time-frequency ap-proximations with matching pursuits, Wavelets: theory, algorithms, and applications(Taormina, 1993), Wavelet Anal. Appl., vol. 5, Academic Press, San Diego, CA,1994, pp. 271–293. MR 1321432

[13] David L. Donoho, Compressed sensing, IEEE Trans. Inform. Theory 52 (2006), no. 4,1289–1306. MR 2241189 (2007e:94013)

[14] , For most large underdetermined systems of linear equations the minimall1-norm solution is also the sparsest solution, Comm. Pure Appl. Math. 59 (2006),no. 6, 797–829. MR 2217606 (2007a:15004)

[15] Ivar Ekeland and Thomas Turnbull, Infinite-dimensional optimization and convexity,Chicago Lectures in Mathematics, University of Chicago Press, Chicago, IL, 1983.MR 769469 (86i:49001)

[16] Massimo Fornasier, Numerical methods for sparse recovery, Theoretical foundationsand numerical methods for sparse recovery, Radon Ser. Comput. Appl. Math., vol. 9,Walter de Gruyter, Berlin, 2010, pp. 93–200. MR 2731598 (2012c:65088)

[17] Massimo Fornasier and Holger Rauhut, Compressive sensing, Handbook of Mathe-matical Methods in Imaging, Springer, O. Scherzer Ed., 2011.

[18] Simon Foucart, A note on guaranteed sparse recovery via `1-minimization, Appl.Comput. Harmon. Anal. 29 (2010), no. 1, 97–103. MR 2647014 (2011h:94017)

[19] Simon Foucart, Alain Pajor, Holger Rauhut, and Tino Ullrich, The Gelfand widthsof `p-balls for 0 < p ≤ 1, J. Complexity 26 (2010), no. 6, 629–640. MR 2735423(2012b:41039)

[20] V. A. Kotelnikov, On the transmission capacity of the “ether” and wire in electro-communications, Modern sampling theory, Appl. Numer. Harmon. Anal., BirkhauserBoston, Boston, MA, 2001, Translated from the Russian by V. E. Katsnelson, pp. 27–45. MR 1865680

[21] Xin Li and Chao-Ping Chen, Inequalities for the gamma function, JIPAM. J. In-equal. Pure Appl. Math. 8 (2007), no. 1, Article 28, 3 pp. (electronic). MR 2295722(2008b:33004)

[22] Constantin P. Niculescu and Lars-Erik Persson, Convex functions and their appli-cations, CMS Books in Mathematics/Ouvrages de Mathematiques de la SMC, 23,Springer, New York, 2006, A contemporary approach. MR 2178902 (2006m:26001)

[23] Harry Nyquist, Certain topics in telegraph transmission theory, AIEE Trans. 47(1928), 617–644.

[24] Holger Rauhut, Random sampling of sparse trigonometric polynomials, Appl. Com-put. Harmon. Anal. 22 (2007), no. 1, 16–42. MR 2287383 (2008d:62035)

[25] , Compressive sensing and structured random matrices, Theoretical founda-tions and numerical methods for sparse recovery, Radon Ser. Comput. Appl. Math.,vol. 9, Walter de Gruyter, Berlin, 2010, pp. 1–92. MR 2731597

103

[26] Claude E. Shannon, Communication in the presence of noise, Proc. I.R.E. 37 (1949),10–21. MR 0028549 (10,464e)

[27] Thomas Strohmer and Robert W. Heath, Jr., Grassmannian frames with applicationsto coding and communication, Appl. Comput. Harmon. Anal. 14 (2003), no. 3, 257–275. MR 1984549 (2004d:42053)

[28] Michel Talagrand, The generic chaining, Springer Monographs in Mathematics,Springer-Verlag, Berlin, 2005, Upper and lower bounds of stochastic processes. MR2133757 (2006b:60006)

[29] Robert Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist.Soc. Ser. B 58 (1996), no. 1, 267–288. MR 1379242 (96j:62134)

[30] Jan Vybıral, Widths of embeddings in function spaces, J. Complexity 24 (2008), no. 4,545–570. MR 2432104 (2009f:41034)

[31] Edmund T. Whittaker, On the functions which are represented by the expansions ofthe interpolation theory, Proc. Royal Soc. Edinburgh 35 (1915), 181–194.

[32] P. Wojtaszczyk, A mathematical introduction to wavelets, London Mathematical So-ciety Student Texts, vol. 37, Cambridge University Press, Cambridge, 1997. MR1436437 (98j:42025)

compressive sensing and sparse recovery2.15) and the hoe dings inequality for real rademacher sums...

Documents