skript: random matrix theory (fs2019) · chapter 1 introduction random matrix theory is the study...

Skript: Random Matrix Theory(FS2019)

David Belius

I am grateful to Leon Fröber for his help in preparing these lecturenotes.

Contents

Bibliography 5

Chapter 1. Introduction 61.1. Historical comments/applications/connections 6

Chapter 2. Definitions of random matrix distributions 92.1. IID and Wigner ensembles 92.2. The Wishart ensemble 102.3. The classical groups 11

Chapter 3. Basic properties 143.1. Wigner matrices 143.2. GOE and GUE 203.3. Upper bound on the operator norm of Wigner matrices 41

Chapter 4. Wigner’s semicircle law 474.1. Statement of the theorem 474.2. The Stieltjes transform 504.3. Reduction 544.4. Proof of main estimate 60

Chapter 5. Sample-covariance matrices 755.1. Motivation 755.2. The Marchenko-Pastur law 765.3. Sample covariance matrices for correlated vectors 92

Appendix A. Analysis 100

Appendix B. Linear algebra 101

Appendix C. Wahrscheinlichkeitstheorie 106Masstheorie 106Grundlegende Definitionen 106Verteilungen 107Weak and vague convergence 109Konvergenz von Zufallsvariablen 110Ungleichungen 111

3

CONTENTS 4

Grenzwertsätze 112Momenterzeugende Funktion 112

Bibliography

[1] Z. D. Bai and Y. Q. Yin. Necessary and sufficient conditions for almost sure con-vergence of the largest eigenvalue of a wigner matrix. Ann. Probab., 16(4):1729–1741, 10 1988.

[2] Zhi-Dong Bai, Jack W Silverstein, et al. No eigenvalues outside the support ofthe limiting spectral distribution of large-dimensional sample covariance matri-ces. The Annals of Probability, 26(1):316–345, 1998.

[3] Jean-Philippe Bouchaud and Marc Potters. Theory of financial risk and deriva-tive pricing: from statistical physics to risk management. Cambridge universitypress, 2003.

[4] Brian Hall. Lie groups, Lie algebras, and representations: an elementary intro-duction, volume 222. Springer, 2015.

[5] Rafał Latała. Some estimates of norms of random matrices. Proceedings of theAmerican Mathematical Society, 133(5):1273–1282, 2005.

[6] Madan Lal Mehta. Random matrices, volume 142. Elsevier, 2004.[7] Terence Tao. Topics in random matrix theory, volume 132. American Mathe-

matical Soc., 2012.[8] Yong Q Yin. Limiting spectral distribution for a class of random matrices. Jour-

nal of multivariate analysis, 20(1):50–68, 1986.

[7]

5

CHAPTER 1

Introduction

Random matrix theory is the study of matrices whose entries arerandom variables. For instance

(1.0.1) X =

(X1,1 X1,2

X2,1 X2,2

),

with Xi,j, 1 ≤ i, j ≤ 2, independent and standard Gaussian (say) isa random 2 × 2 matrix. This is only one of many different ways todefine interesting probability distributions of random matrices, in thenext section we will see more (the law of the entries can be different,they can be dependent, the matrix can be symmetric or Hermitian ororthogonal...).

One is typically interested in spectral properties of the random ma-trices involved, that is about properties of the eigenvalues and eigen-vectors. For instance, one could ask what the distribution of the eigen-values of the random matrix X in (1.0.1) is. Often the results willbe for large random matrices (unlike X), that is asymptotic results ofn× n matrices in the limit n→∞.

An important theme in random matrix theory is universality. Thisis the phenomenon that, in the the asymptotic limit n → ∞, manyspectral properties do not depend on the particularities of the randommatrix distribution, but mainly on the symmetry class of the randommatrix (e.g. symmetric/Hermitian/...). This is somewhat analogous tohow in the central limit theorem the Gaussian fluctuations of the sumdepend only on moment conditions, not on the particular distributionof the summands.

1.1. Historical comments/applications/connections

In this course we will build up the theory of random matrices asa stand-alone mathematical theory for its interest of its own right.Historically, random matrix theory arose because of connections tomany important applications.

• Random matrices play an important role in physics. One suchconnection is due to Wigner (1902-1995), who proposed the

6

1.1. HISTORICAL COMMENTS/APPLICATIONS/CONNECTIONS 7

statistics of eigenvalues of random matrices as a toy model for“nuclear resonances” of heavy atoms1. As we will see, his nameis attached to several important random matrix distributionsand results in random matrix theory.• Random matrices play an important role in statistics (andrelated fields such as machine learning). Here is one exam-ple: We want to study the correlations between p randomquantities, such as the returns on p different stocks. LetX1, X2, . . . , Xn be vectors in Rp modeling the return on n dif-ferent trading days. If one assumes that the the vectors havethe same distribution, then they have a common unknown co-variance matrix Σ. A natural way to estimate Σ from data isthe estimator

(1.1.1) Σij =n∑k=1

(Xki − Xi

) (Xkj − Xj

),

where

(1.1.2) Xi =1

n

n∑k=1

Xki,

is the empirical mean of the i-th component of the randomvectors. The estimator Σ is a p× p matrix, and if the Xi aremodeled as random vectors, then Σ is a random matrix. Whenassessing how good an estimator Σ is of the true unknown Σthe spectral properties of the random matrix Σ come into play.• There are connections between random matrix theory andnumber theory, for instance the following. Recall that theRiemann zeta function

(1.1.3) ζ : C→ C, defined by ζ (s) =∞∑n=1

1

nsfor Re (s) > 1,

(and by analytical continuation to all of C) has infinitely manyzeros on the critical line s = 1

2+it, t ∈ R. Let sn = 1

2+itn, n ≥

1 be an enumeration of the ones with positive imaginary partordered so that t1 < t2 < . . .. It has been conjectured by Mont-gomery, and numerically verified to a high degree of accuracyby Odlyzko, that the statistical distribution of the spacingstn+1 − tn (appropriately normalized) coincides with the dis-tribution of spacings between eigenvalues of a certain randommatrix.

1For more on this see [6, Section 1.1]

1.1. HISTORICAL COMMENTS/APPLICATIONS/CONNECTIONS 8

• Let HN : Rn → R be a smooth random high-dimensional func-tion, which one may think of as a random landscape. Suchfunctions appear in settings such as statistical physics or theo-retical computer science. A lot of information about the localproperties of such a function (is a certain point x a local max-imum? minimum? saddle point?) can be gained from thegradient ∇HN (x), which at each point in space is a randomvector, and the Hessian∇2HN (x), which at each point in spaceis a random matrix. The properties of a critical point (saddlepoint vs. maximum vs. minimum) is given by the spectralproperties of this random matrix.

CHAPTER 2

Definitions of random matrix distributions

In this chapter we define the most commonly studied randommatrixdistribution. Random matrix distributions are often called randommatrix ensembles, ensemble being a word originating in physics.

2.1. IID and Wigner ensembles

The simplest to define is the IID random matrix ensemble, whichis the distribution of a matrix with IID random entries.

Definition 2.1.1. (IID random matrix ensemble) Let n ≥ 1 and letY be a real or complex valued random variable. LetXi,j, 1 ≤ i, j ≤ n beIID random variables with the law of X. We say that X = (Xi,j)1≤i,j≤nis an IID random matrix.

Even when the distribution of the entries is real, the eigenvalues ofsuch a random matrix will in general be complex-valued. By imposinga symmetry condition we can ensure that the eigenvalues of the randommatrix will be real. The symmetric Wigner ensemble are symmetricrandom matrices with real-valued entries.

Definition 2.1.2. (Symmetric Wigner ensemble) Let n ≥ 1 andlet Y, Z be a real-valued random variables. Let Xi.j, 1 ≤ i ≤ j ≤ n, beindependent, such that Xi,j, i < j have the distribution of Y , and Xi,i

have the distribution of Z. Finally let Xi,j = Xj,i for i > j. We thensay that X is a symmetric Wigner random matrix.

The Hermitian Wigner ensemble are Hermitian random matriceswith complex-valued off-diagonal entries.

Definition 2.1.3. (Hermitian Wigner ensemble) Let n ≥ 1 andlet Y be a complex-valued random variable and let Z be real valuesrandom variable. Let Xi.j, 1 ≤ i ≤ j ≤ n, be independent, such thatXi,j, i < j have the distribution of Y , and Xi,i have the distribution ofZ. Finally let Xi,j = Xj,i for i > j. We then say that X is a HermitianWigner random matrix.

Both the symmetric and Hermitian Wigner ensembles are randommatrices with real eigenvalues.

9

2.2. THE WISHART ENSEMBLE 10

The cases where the entries are Gaussian are an important spe-cial case, where certain exact computations are possible. The case ofa symmetric matrix with independent Gaussian entries is called theGaussian Orthogonal Ensemble (GOE for short).

Definition 2.1.4. (Gaussian Orthogonal Ensemble; GOE) Let Xbe a symmetric Wigner matrix such that Xi,j, i < j are iid N (0, 1) ran-dom variables, and Xii are iid N (0, 2) random variables. Then we saythat X is a random matrix from the Gaussian Orthogonal Ensemble/isa GOE random matrix.

Warning: A GOE random matrix is in general not orthogonal.The word orthogonal comes from the fact that such matrices are in-variant with respect to orthogonal conjugation (see below).

Note the special variance of the diagonal elements. In principle onecould of course make a different choice, but this one gives the GOEcertain nice properties. If X is a IID random matrix with standardGaussian entries, then X+XT

√2

is a GOE random matrix.For the case of a Hermitian Gaussian random matrix we need to

introduce the complex Gaussian distribution.

Definition 2.1.5. (Complex Gaussian Distribution) Let A,B beindependent Gaussians with variance 1

2. Let Y = A+ iB. We then say

that A follows the standard complex Gaussian distribution.

Note that E [Y ] = 0 and E[|Y |2

]= 1. Also note that Y has a

density with respect to Lebesgue measure on C (=Lebesgue measureon R2) given by f (y) = 1

πe−|y|

2

.The case of a Hermitian matrix with independent Gaussian entries

is called the Gaussian Unitary Ensemble (GUE for short).

Definition 2.1.6. (Gaussian Unitary Ensemble; GUE) Let X bea Hermitian Wigner matrix such that Xi,j, i < j are iid standard com-plex Gaussian random variables, and Xii are iid real N (0, 1) randomvariables. Then we say that X is a random matrix from the GaussianUnitary Ensemble/is a GUE random matrix.

Warning: Similarly to the GOE, a GUE random matrix is in gen-eral not unitary. The word unitary comes from the fact that suchmatrices are invariant with respect to unitary conjugation (see below).

If X is a IID random matrix with standard complex Gaussian en-tries, then X+XT

√2

is a GOE random matrix.

2.2. The Wishart ensemble

The Wishart random matrix distribution is important in statistics.

2.3. THE CLASSICAL GROUPS 11

Definition 2.2.1. (Wishart random matrix distribution) Let Xbe a p × n random matrix whose rows are independent multivariateGaussian vectors of mean zero with covariance matrix Σ. Consider thep× p matrix

(2.2.1) M = n−1XXT .

We say that M has the Wishart random matrix distribution with co-variance Σ.

Note that

(2.2.2) Mij =1

n

p∑k=1

XkiXkj,

soM is the empirical covariance matrix of the vectors X·1, X·2, . . . , X·n(when the mean is assumed to be known to be zero), i.e. a statisticalestimator of Σ.

2.3. The classical groups

Another way to define random matrices is to use the “uniform distri-bution” on all matrices of a certain type (when this can be made senseof). For this we need some abstract machinery, namely Haar measure(i.e. “uniform measure”) on compact matrix groups. These matricesdo not have independent entries. We recall the existence theorem forHaar measure.

Theorem 2.3.1. (Existence of Haar measure) Let

• (G, ·) be a group– equipped with a topology s.t. (a, b) → a · b, a → a−1 are

continuous (G is a topological group)– G is Hausdorff in this topology– G is locally compact in this topology

then there exists• a measure µ on (G,B (G)), where B (G) denotes the Borelsigma-algebra– such that µ (aS) = µ (S) for all a ∈ G and measurableS ⊂ G (left-translation invariance),

– µ (K) <∞ for all compact K ⊂ G– µ is unique up to multiplicative constant

Haar measure on the group (Rn,+) is Lebesgue measure on Rn.Let

(2.3.1) Mn (R) = X : X is n× n real matrix


denote the linear space of all n× n matrices with real entries, and let

(2.3.2) Mn (C) = X : X is n× n complex matrix

denote the linear space of all n× n matrices with complex entries.Recall that the General Linear groups over R and C

(2.3.3)GL (n,R) = X : X is n× n real inverible matrix ⊂Mn (R) ,GL (n,C) = X : X is n× n complex inverible matrix ⊂Mn (C) ,

are indeed groups with group operation matrix multiplication, identityelement I, and the inverse element operation given by matrix inversion.They are topological groups with the topology inherited from Rn×n andCn×n.

Recall that the set

(2.3.4) O (n) =X : X is n× n real matrix, XXT = I

⊂ GL (n,R) ,

of n×n orthogonal real matrices with matrix multiplication is a (topo-logical) subgroup of GL (n,R), which is furthermore Hausdorff andcompact (it is closed and bounded as a subset of Rn×n). These matri-ces have det (X) = ±1. The set(2.3.5)

SO (n) =X : X is n× n real matrix, XXT = I, det (X) = 1

⊂ O (n) ,

of orthogonal matrices with determinant 1 forms is a (topological) sub-group named the special orthogonal group.

We can define Haar measure on SO (n), and since the group iscompact we can normalize it so that it becomes a probability measure.

Definition 2.3.2. (“Special orthogonal random matrix”1) Let Xbe random matrix whose probability distribution is Haar measure Pon SO (n), normalized so that P (SO (n)) = 1. We call X a “specialorthogonal random matrix”.

The set

(2.3.6) U (n) = X : X is n× n complex matrix, XX∗ = I ,⊂ GL (n,C) ,

1I am not aware of a standard name for this random matrix ensemble.


of unitary complex matrices is a compact topological subgroup ofGL (n,C),to which the Haar existence theorem applies. It also applies to the spe-cial unitary subgroup(2.3.7)SU (n) = X : X is n× n complex matrix, XX∗ = I, det (X) = 1 .⊂ U (n) .

We can therefore make the following definition.

Definition 2.3.3. (Circular Unitary Ensemble; CUE) Let X berandom matrix whose probability distribution is Haar measure P onU (n), normalized so that P (U (n)) = 1. We say X is a random matrixfrom the Circular Unitary Ensemble.

The “circular” in the name comes from the fact that the eigenvaluesof X must necessarily (since X is unitary) lie in the unit circle of thecomplex plane; the same holds for random matrices in O (n) , SO (n).

CHAPTER 3

Basic properties

In this chapter we prove some basic properties about some of therandom matrix distributions defined in the previous chapter.

3.1. Wigner matrices

We first prove that a Wigner matrix with entries that have a density(and in particular GOE and GUE) are almost surely invertible. Thismeans there is almost surely no zero eigenvalue.

Lemma 3.1.1. (Almost sure invertiblity) Let X be a random matrixfrom the symmetric or Hermitian Wigner ensembles, or an IID ran-dom matrix, where the entries have a densities with respect to Lebesguemeasure on R or C. Then

(3.1.1) P (X is degenerate) = 0.

To see the intuition behind this result, note that if A is a “d − 1-dimensional” set in Rd, then the Lebesgue measure of A is zero (think ofa sphere with two-dimensional surface in R3). Roughly speaking, thisis because the set A has d − 1 degrees of freedom in a d dimensionalspace. Considering for instance the case of a IID random matrix withreal entries, the degrees of freedom in specifying a n×n such matrix isn2, while the degrees of freedom in specifying a n×n degenerate matrixof type this is n2−1. Essentially for this reason that, the Rn2-Lebesguemeasure of the set X : X is degenerate seen as a subset of Rn2 is zero.Since under the assumptions of the Lemma P is absolutely continuouswith respect to Lebesgue measure on Rn2 this implies (3.1.1). Thisargument can be made precise (particularly elegantly in the languageof differential geometry, since the matrix groups are also manifolds1),but we take a more elementary route.

Proof (IID Ensemble case). Let the entries Xij be IID withcommon density f (x) with respect to Lebesgue measure on R. If X is

1For instance, IID real case: For each k, the set X ∈MN (R) : rank (X) = k isa submanifold of Rn2

of dimension n2−(n− k)2. Since it has positive co-dimension,

it has measure zero. So X ∈MN (R) : rank (X) < n also has measure 0.

14

3.1. WIGNER MATRICES 15

degenerate, then it must hold for some k that the k-th row Xk· lies inthe span of the k − 1 previous rows, i.e.

(3.1.2) Xk· ∈ 〈X1·, X2·, . . . , Xk−1,·〉 .

Therefore

(3.1.3) P (X is degenerate) ≤n∑k=1

P (Xk· ∈ 〈X1·, X2·, . . . , Xk−1,·〉) .

Now conditioning on S = 〈X1·, X2·, . . . , Xk−1,·〉 we have

(3.1.4) P (Xk· ∈ S) = P (P (Xk· ∈ S|X1·, . . . , Xk−1,·)) .

For any k − 1-dimensional linear space V ⊂ Rk we have that the Rk-Lebesgue measure of V is zero, so

(3.1.5) P (Xk· ∈ V ) =

∫1V

n∏i=1

f (yi) dyi = 0.

Thus P (Xk· ∈ S) = 0 for all k, so (3.1.1) follows.Replacing R by C everywhere gives a proof of the result for the IID

ensemble with complex valued entries.

Proof (Symmetric/Hermitian Wigner ensemble case). Firstassume X is a symmetric Wigner random matrix. Let Xk be the k× kupper right minor of X. If X is degenerate there must be a k such thatXk−1 is invertible and Xk is degenerate. Thus(3.1.6)

P (X is degenerate) ≤n∑k=1

P(Xk is degenerate, Xk−1 is invertible

).

We now write

(3.1.7)P(Xk is degenerate, Xk−1 is invertible

).

= P(P(Xk is degenerate|Xk−1

)1Xk−1 is invertible

).

Let Y be the row vector (Xk,1, . . . , Xk,k−1) and let Z = Xk,k. Then Xk

is degenerate iff (Y, Z) lies in the span of(Xk, Y T

), i.e. iff there exist

θ such that

(3.1.8)∑i

θiXki· = Y and θ · Y = Z,

i.e. iff

(3.1.9)((Xk)−1

Y)· Y = Z.


Now

(3.1.10)P(((

Xk)−1

Y)· Y = Z

)= P

(P(Z =

((Xk)−1

Y)· Y |Xk, Y

))= 0,

since(3.1.11) P (Z = a) = 0 for any a,if Z has a density.

The above argument is in fact independent of whether Y, Z takevalues in C or R. Therefore it applies in the Hermitian Wigner case.

Remark 3.1.2. The proof easily extends to non-iid entries as longas they have a joint density. But the fact that the entries do have adensity is important. If for instance the entries of X are IID randomvariables so that P (Xij = 0) = P (Xij = 1) = 1

2, then there is a positive,

albeit exponentially small, probability that X = 0, so

(3.1.12) P (X is degenerate) ≥ P (X = 0) =

(1

2

)n2

> 0.

Not only are random matrices (with absolutely continuous entries)generically invertible, they also have no repeated eigenvalues.

Lemma 3.1.3. (No repeated eigenvalues) Let X be a random matrixfrom the symmetric or Hermitian Wigner ensembles, where the entrieshave a densities with respect to Lebesgue measure on R as appropriate.Let λ1, λ2, . . . , λn ∈ R be the eigenvalues of X, ordered so that λ1 ≤. . . .λn. Then

(3.1.13) P (∃i 6= j : λi = λj) = 0.

Remark 3.1.4. This can also be understood at an intuitive levelby counting degrees of freedom. Take the symmetric case. One way tocount the degrees of freedom is to note that in the first row one has nchoices, in the second n−1 choices (to ensure symmetricity), etc. So intotal one has n+(n− 1)+ . . .+1 = n(n+1)

2degrees of freedom. Another

way to arrive at the same number is to note that any symmetric matrixcan be written as(3.1.14) OTDO,

whereD is diagonal and O is orthogonal (by the spectral theorem; The-orem B.1). In specifying D one has n degrees of freedom. In specifyingO one has n− 1 (the first row must have norm 1) + n− 2 (the secondrow must have norm 1 and also by orthogonal to the first row, etc) +(n− 3)+(n− 4)+ . . .+1 = (n−1)n

2degrees of freedom. That is, one has


in total (n−1)n2

+n = (n+1)n2

degrees of freedom, consistent with the pre-vious calculation. But now let us consider matrices with D11 = D22.One loses one degree of freedom, so intuitively the R

(n+1)n2 -Lebesgue

measure of the set of such matrices should be zero. All matrices withrepeated eigenvalues arise in a similar way, so the set of such matricesmust also have R

(n+1)n2 -Lebesgue measure zero.

In this case we do want to make this intuition into a proof. For thiswee need the next lemma.

Lemma 3.1.5. Let m < k and let A ⊂ Rm and f : A → Rk

be locally Lipschitz (f is Lipschitz on compact subsets of A). ThenLebRk (f (A)) = 0.

Remark 3.1.6. The Lipschitz assumption is important. There existcontinuous non-Lipschitz functions f : R→ R2 that are “space filling”in the sense that f (R) = R2.

Proof. Let B1, B2, . . . be boxes of the form ×mi=1 [ai, ai + 1] suchthat ∪Bi = Rm. Consider Ai = A ∩ Bi. Since f is locally Lipschitz,f |Ai is Lipschitz for all i.

There is a constant c = c (n), such that the following holds. Forall ε > 0 and i, there is an l ≤ c(n)

εnand x1, . . . , xl ∈ Bi such that

Bi ⊂ ∪lj=1B (xj, ε). Since

(3.1.15) f (B (xj, ε) ∩ Ai) ⊂ B (f (xj) , Kε) ,

where K is the Lipschitz constant of f |Ai , it hold that

(3.1.16) f (Ai) ⊂ ∪li=1B (f (xi) , Kε) .

Thus

(3.1.17)LebRkf (Ai) ≤

∑li=1 LebRkB (f (xi) , Kε)

≤ c (k) lεk

≤ c (k) c (n) εk−m.

Note that k −m > 0, and this is true for all ε > 0. Thus in fact

(3.1.18) LebRkf (Ai) = 0,

and so

(3.1.19) LebRkf (A) =∞∑i=1

LebRkf (Ai) = 0.

With this lemma, we can prove Lemma 3.1.3.


Proof (symmetric Wigner ensemble case; Optional). Weprove that

(3.1.20) Xreal symmetric : X has a repeated eigenvalue ,

has Lebesgue measure zero, when identified with a subset of Rn(n+1)

2 .This then implies that if under P the entries of X have absolutelycontinuous distribution with respect to Lebesgue measure, then the sethas probability zero under P.

Consider the map

(3.1.21) f : O (n)× Rn−1 → Symn (R) ,

where Symn (R) is the set of all n × n symmetric real matrices, givenby

(3.1.22) f (O, λ2, . . . , λn) = OTΛO,

where Λ is diagonal and Λ11 = Λ22 = λ2 and Λii = λi, i ≥ 3. Note thatby the spectral theorem (Theorem B.1)

(3.1.23) f (O (n)× Rn−1)= Xreal symmetric : X has a repeated eigenvalue .

Also note that f is locally Lipschitz in the Euclidean norm.We will now use the Implicit Function Theorem (Theorem A.2) to

construct maps from open subsets of Rn(n+1)

2 to O (n) that “cover” allof O (n).

Consider the map F : Rn2 → Rn(n+1)

2 given by F (X) =(XXT − I

)ij,i≤j.

Note that X ∈ O (n) ⇐⇒ F (X) = 0, where we identify Mn (R) withRn2 in the natural way. Write the map as F (X) = F (x, y) where X =

(x, y) for x ∈ Rn2−n(n+1)2 (the first n2 − n(n+1)

2entries) and y ∈ R

n(n+1)2

(the last n(n+1)2

entries). The map F is obviously continuously differen-

tiable and one can check that the Jacobian(∂Fi∂yj

)i,j=1,...,

n(n+1)2

with re-

spect to y is invertible for allX = (x, y) ∈ O (n). Therefore the implicitfunction theorem implies the existence for each X ∈ O (n) of an openset x ∈ VX ⊂ Rn2−n(n+1)

2 and y ∈ UX ⊂ Rn(n+1)

2 and a continuously dif-ferentiable gX : UX → VX such that gX (UX) = VX and for X ∈ VX×UXit holds that X ∈ O (n) ⇐⇒ F

(X)

= 0 ⇐⇒ y = gX (x) (thus

given x, the first n2− n(n+1)2

of a matrix, g (x) are the unique n(n+1)2

thattogether with those in x give an orthogonal matrix in the set VX×UX).Let

(3.1.24) sX : VX → Rn2

,


be given by

(3.1.25) sX (a) = (a, gX (a)) .

The map sX is a continuously differentiable bijection between the neigh-borhood VX ⊂ Rn2−n(n+1)

2 and a the neighborhood VX ×UX ⊂ O (n) ofRn2 .

Now ∪X∈O(n)VX ×UX is an open cover of O (n), so by compactnessof O (n) it has a finite subcover VXi × UXi , i = 1, . . . ,m. We thus have

(3.1.26) ∪mi=1f (UXi × Rn−1)= Xreal symmetric : X has a repeated eigenvalue .

To show that the latter set has measure 0 it thus suffices to check thatf (UXi × Rn−1) has measure zero for each i. To this end consider

(3.1.27) h : UXi × Rn → Symn (R) ,

given by

(3.1.28) h (a, λ) = f (sXi (a) , λ) .

Since g is continuously differentiable it is locally Lipschitz, and f isalso locally Lipschitz, so h is locally Lipschitz. Therefore by lemma3.1.5 the set h (UXi ,Rn−1) = f (VXi × Rn−1) has measure 0.

Proof (Hermitian Wigner ensemble case; Optional). In thiscase we define the map

(3.1.29) f : U (n)× Rn−1 → Hermn (C) ,

where Hermn (C) denotes the set of Hermitian n× n matrices by

(3.1.30) f (U,Λ) = U∗ΛU.

This is locally Lipschitz and

(3.1.31) f (U (n)× Rn−1)= Xhermitian : X has a repeated eigenvalue .

The map F : Rn+2n(n−1)

2 (∼= Rn×Cn(n−1)

2 )→ Rn+n(n−1)(∼= Rn×Cn(n−1)

2 )given by F (X) = (XX∗ − I)ij,i≤j satisfies X ∈ U (n) ⇐⇒ F (X) =

0. Write the map as F (X) = F (x, y) where X = (x, y) for x ∈Rn+2

n(n−1)2−(n+n(n−1)) and y ∈ Rn+n(n−1). The map F is obviously con-

tinuously differentiable and one can check that the Jacobian(∂Fi∂yj

)i,j=1,...,n+n(n−1)

with respect to y is invertible for all X = (x, y) ∈ U (n). Therefore theimplicit function theorem implies the existence for each X ∈ U (n) ofan open set x ∈ VX ⊂ Rn+2

n(n−1)2−(n+n(n−1)) and y ∈ UX ⊂ Rn(n+1) and

a continuously differentiable gX : UX → VX such that g (UX) = VX and

3.2. GOE AND GUE 20

for X ∈ VX × UX it holds that X ∈ U (n) ⇐⇒ F(X)

= 0 ⇐⇒ y =

gX (x). Let

(3.1.32) sX : VX → Rn+n(n−1),

be given by

(3.1.33) sX (a) = (a, gX (a)) .

The map sX is a continuously differentiable bijection between the neigh-borhood VX ⊂ Rn+2

n(n−1)2−(n+n(n−1)) and the neighborhood VX ×UX ⊂

U (n) of Rn+2n(n−1)

2 .

The open cover ∪X∈O(n)VX × UX of the compact U (n) has a finitesubcover VXi × UXi , i = 1, . . . ,m,and

(3.1.34) ∪mi=1f (UXi × Rn−1)= Xhermitian : X has a repeated eigenvalue .

Consider

(3.1.35) h : UXi × Rn → Hermn (C) ,

given by

(3.1.36) h (a, λ) = f (sXi (a) , λ) .

The map h is locally Lipschitz and therefore by lemma 3.1.5 the seth (UXi ,Rn−1) = f (VXi × Rn−1) has measure 0, so so does

(3.1.37) X hermitian : X has a repeated eigenvalue ..

3.2. GOE and GUE

The Gaussian ensembles GOE and GUE have many nice specialproperties.

First compute a formula for the density of a GOE/GUE matrix Xwith respect to the entries of the matrix.

Lemma 3.2.1. a) Let X be a GOE random matrix. The densityof the on-and-above-diagonal entries (Xij)1≤i≤j≤n of X with respect to

Lebesgue measure on Rn(n+1)

2 can be written as

(3.2.1) f (x) = ce−14Tr(x2) = ce−

14

∑ni=1 λ

2i (x),

where λ1 (x) ≤ λ2 (x) ≤ . . . ≤ λn (x) denote the eigenvalues of thematrix represented by x ∈ R

(n−1)n2 and c = 1

2(n+3)n

4

1

π(n+1)n

4

.

3.2. GOE AND GUE 21

b) Let X be a GUE random matrix. The density of the on-and-above-diagonal entries (Xij) of X with respect to Rn × C

(n−1)n2 can be

written as

(3.2.2) f (x) = ce−12Tr(x2) = ce−

12

∑ni=1 λ

2i (x),

for c = 1

2n2

1

πn22

.

Remark 3.2.2. 1. Note the similarity of the GOE formula (3.2.1)and the GUE formula (3.2.2), up to the factor 1

4vs. 1

2in the exponent.

2. Especially if one writes more informally that “the density of GOEX is proportional to

(3.2.3) e−14

∑ni=1 λ

2i ”,

one may be tempted to conclude that the eigenvalues λi are indepen-dent Gaussians. This is however not a valid conclusion, since the den-sity while being function only of the eigenvalues is the density for theentries of the matrix, not for its eigenvalues.

Proof. a) The entriesXij are independent and Gaussian with vari-ance 1 off the diagonal and variance 2 on the diagonal. Therefore thejoint density of Xij is

(3.2.4) f (x) =

(∏i

1√4πe−

x2ii4

)(∏i<j

1√2πe−

x2ij2

).

This can be rewritten as

(3.2.5)

f (x) = 12n

1

πn2e−

∑ni=1

x2ii4

1

2(n−1)n

4 π(n−1)n

4

e−∑i<j

x2ij2

= 1

2n+(n−1)n

4

1

πn2 +

(n−1)n4

e−∑ni=1

x2ii4−∑i<j

x2ij2

= 1

2(n+3)n

4

1

π(n+1)n

4

e−∑ni=1

x2ii4−∑i<j

x2ij2

Now note that

(3.2.6)n∑i=1

x2ii + 2

∑i<j

x2ij =

∑i,j

x2ij = Tr

(xxT

)= Tr

(x2)

and

(3.2.7) Tr(x2)

=n∑i=1

λi (x)2 ,

so

(3.2.8) f (x) =1

2(n+3)n

4

1

π(n+1)n

4

e−14

∑ni=1 λi(x)2

3.2. GOE AND GUE 22

b)The entries Xij are independent and standard complex Gaussianoff the diagonal and standard real Gaussian on the diagonal. Thereforethe joint density of Xij is

(3.2.9) f (x) =

(∏i

1√2πe−

x2ii2

)(∏i<j

1

πe−|xij |

2

).

This can be rewritten as

(3.2.10)

f (x) = 1

2n2

1

πn2e−

12

∑ni=1 x

2ii 1

π(n−1)n

2

e−∑i<j |xij |

2

= 1

2n2

1

πn2 +

(n−1)n2

1e−12

∑ni=1 x

2ii−∑i<j |xij |

2

= 1

2n2

1

πn22

e−12

∑ni=1 x

2ii−∑i<j |xij |

2

Now note that

(3.2.11)n∑i=1

x2ii + 2

∑i<j

|xij|2 =∑i,j

xijxij = Tr(xxT

)= Tr

(x2)

and

(3.2.12) Tr(x2)

=n∑i=1

λi (x)2 ,

so

(3.2.13) f (x) =1

2n2

1

πn2

2

e−12

∑ni=1 λi(x)2

Next we prove the invariance properties of the GOE and GUE en-sembles, for GOE stating that if X is GOE and O is a fixed orthogonalmatrix then OTXO is also GOE. These properties motivate the “or-thogonal” and “unitary” in their names

Lemma 3.2.3. (Orthogonal invariance of GOE/GUE) a) Let X bea GOE random matrix. Let O be a fixed orthogonal matrix. ThenY = OTXO is also a GOE random matrix.

b) Let X be a GUE random matrix. Let U be a fixed unitary matrix.Then Y = U∗XU is also a GUE random matrix.

Remark 3.2.4. One can argue via a change of variables formulafor transformations of random vectors (Lemma C.25), saying that ifa random vector X has density fX and g is a differentiable bijection,then Y = g

(X)has a density given by

(3.2.14) fY (y) = fX (x) |det (Jg−1 (x))| for x = g−1 (y) ,

3.2. GOE AND GUE 23

where J is the Jacobian matrix of g−1. In the GOE case say Ywould be the on-and-above-diagonal entries of Y , X the on-and-above-diagonal entries of X, g the map Y =

(OTXO

)i≤j, whose inverse

X =(OY OT

)i≤j can be seen to have Jacobian of determinant 1 since

O is rotation, thus preserving area. Thus X and Y have the same den-sity and are both GOE. The argument below is more simple-mindedbut less notationally heavy to write in detail.

Proof. a) We check that Y is symmetric (this is immediate, sinceX is), and that (up to symmetricity) the entries are Gaussians with thecorrect variance. Since the Yij =

∑k,lO

TikXklOlj are linear combina-

tions of jointly Gaussian random variables, Yij are also jointly Gaussian.Therefore their joint law is completely determined by their mean andcovariance. The mean is trivially zero. Note that the covariance of Xis given by

(3.2.15) E [XijXi′j′ ] = δij,i′j′ + δij,j′i′ .

Turning to the covariance of Y , we have

(3.2.16)

E [YijYi′j′ ]= E

[∑klk′l′ O

TikXklOljO

Ti′k′Xk′l′Ol′j′

]=∑

klk′l′ OTikOljO

Ti′k′Ol′j′E [XklXk′l′ ]

=∑

klk′l′ OTikOljO

Ti′k′Ol′j′ (δkl,k′l′ + δkl,l′k′)

=∑

klOTikOljO

Ti′kOlj′ +

∑klO

TikOljO

Ti′lOl′k

= (O·i ·O·i′) (O·j ·O·j′) + (O·i ·O·j′) (O·j ·O·i′) .Since O is an orthogonal matrix we have that O·a ·O·b = δab , so

(3.2.17) E [YijYi′j′ ] = δij,i′j′ + δij,j′i′ ,

which proves that Yij have the same joint law as Xij, so Y is a GOErandom matrix.

b) Omitted.

An important consequence of the invariance of GOE and GUE isthat “the eigenbasis and the eigenvalues are independent”. To see theintuition behind this in the GOE case, note that by the spectral theo-rem we can write for a GOE X

(3.2.18) X = OTΛO,

for a diagonal matrix Λ and orthogonal matrix O. Take any fixedorthogonal O′, then by the invariance

(3.2.19) (OO′)T

Λ (OO′) ,

3.2. GOE AND GUE 24

is also a GOE. But then we can certainly also take a random Haardistributed O ∈ SO (n), independent of O,Λ, and obtain that

(3.2.20) OTΛO,

is a GOE random matrix, where

(3.2.21) O := OO.

Note that O has the same law as O and is independent of O,Λ, sincefor any A,B

(3.2.22)

P(O ∈ A,Λ ∈ B

)= P

(P(O ∈ A,Λ ∈ B|Λ, O

))= P

(P(O ∈ A|O

)1Λ∈B

)= P

(P(OO ∈ A|O

)P (Λ ∈ B)

)= P (P (O ∈ A)P (Λ ∈ B))= P (O ∈ A)P (Λ ∈ B) ,

where we used that by the invariance property of Haar measure for anyfixed O′ ∈ O (n)

(3.2.23) P(O′O ∈ A

)= P

(O ∈ A

)implying also that

(3.2.24) P(OO ∈ A|O

)= P

(O ∈ A

)Thus

(3.2.25) OTΛOlaw=(OO)T

Λ(OO)

= OTΛO,

which suggests that the pair (O,Λ) has the same law as(O,Λ

), which

would imply that the eigenbasis matrix O and the eigenvalue matrix Λare independent, and O is Haar distributed.

The above argument is not completely rigorous, and the problem isthat the spectral decomposition (3.2.18) is not unique. Thus O,Λ arenot well-defined random variables as stated (they are not well-definedfunctions of X). First of all the eigenvalues in Λ can be written in anyorder; this is easily dealt with by defining Λ so that λ1 ≤ λ2 ≤ . . . ≤λn. Secondly, if for instance λ1 = λ2 then there are infinitely manydifferent choices of two orthonormal vectors for the eigenspace of theeigenvectors λ1, λ2. This can be dealt with by working on the eventλ1 < . . . < λn, which we know has probability 1 by Lemma 3.1.3.Lastly even on this event, each row of O can be multiplied by ±1 to

3.2. GOE AND GUE 25

obtain a different matrix that fulfills (3.2.18). This can be dealt withby considering equivalence classes of O:s.

Lemma 3.2.5. (Independence of GOE eigenbasis and eigenvalues/eigenbasisHaar distributed) Let X be a GOE random matrix. Let λ1 ≤ λ2 ≤ . . . ≤λn be the eigenvalues of X. Let [V ] be the (random) set of all orthogonalmatrices such that X = V TΛV for Λ = diag (λ1, . . . , λn). Then withprobability 1, [V ] is an equivalence class of (O (n) / ∼) where A ∼ Biff A = DB for a diagonal matrix D with ±1 on the diagonal.

Finally, [O] and (λ1, . . . , λn) are independent, and [O]law=[O]where

O is distributed according to Haar measure on O (n) .

Proof. On the event λ1 < . . . < λn the orthogonal matrix V inX = V TΛV is uniquely defined up to multiplication of each row by±1. Therefore, on this event [V ] is a well-defined equivalence class of(V (n) / ∼).

Let O be the matrix O ∈ [V ] such that Oii > 0∀i (and if Oii = 0,Oij > 0 for the first j s.t. Oij 6= 0;this is uniquely defined). Then wehave

(3.2.26) X = OTΛO.

Now for any fixed orthogonal O′ we have that

(3.2.27) (OO′)T

Λ (OO′) ,

is a GOE random matrix. Therefore it also holds if O has the Haardistribution on O (n) and is independent of (X,O,Λ), that

(3.2.28)(OO)T

Λ(OO),

is a GOE random matrix. But O := OO is independent of (Λ, O) andO

law= O (as in (3.2.22)) so

(3.2.29) OTΛOlaw=(OO)T

Λ(OO)

= OTΛO,

which by the uniqueness of the diagonalizing basis up to multiplyingby ±1 implies that

(3.2.30)([O],Λ)

= ([O] ,Λ) .

The former is an independent pair by construction, so the latter pairis also independent and [V ] has the desired law.

For a GUEX the spectral decomposition isX = U∗ΛU for a unitaryU , and the non-uniqueness coming from multiplying ±1 here becomesnon-uniqueness coming from multiplying each row by eiθ for any θ.

3.2. GOE AND GUE 26

Lemma 3.2.6. (Independence of GUE eigenbasis and eigenvalues/eigenbasisHaar distributed) Let X be a GUE random matrix. Let λ1 ≤ λ2 ≤ . . . ≤λn be the eigenvalues of X. Let [U ] be the (random) set of all unitarymatrices such that X = UTΛU for Λ = diag (λ1, . . . , λn). Then withprobability 1, [U ] is an equivalence class of (U (n) / ∼) where A ∼ B iffA = DB iff A = DB for a diagonal matrix D with complex numbersof magnitude one on the diagonal.

Finally, [U ] and (λ1, . . . , λn) are independent, and [U ]law=[U]where

U is distributed according to Haar measure on U (n) .

Proof. Proof. On the event λ1 < . . . < λn the unitary ma-trix U in X = UTΛU is uniquely defined up to multiplication of eachrow by eiθ for some θ ∈ R. Therefore, on this event [U ] is a well-definedequivalence class of (U (n) / ∼).

Let U be the matrix U ∈ [U ] such that Im(Uii

)= 0 and Re

(Uii

)≥

0 (or if Uii = 0 such that the same holds for Uij for the first j such thatUij 6= 0; this is uniquely defined). Then we have

(3.2.31) X = UTΛU .

Now for any fixed unitary U we have that

(3.2.32)(UU)T

Λ(UU),

is a GUE random matrix. Therefore it also holds if U has the Haardistribution on U (n) and is independent of

(X, U,Λ

), then

(3.2.33)(U U)T

Λ(U U),

is a GUE random matrix. But U is independent of(

Λ, U)and U U law

=

U so

(3.2.34) UTΛUlaw=(U U)T

Λ(U U)

law= UTΛU ,

which by the uniqueness of the diagonalizing basis up to multiplyingthe rows implies that

(3.2.35)([U],Λ)

law=([U],Λ)

= ([U ] ,Λ) .

The former is an independent pair by construction, so the latter pairis also independent and [U ] has the desired law.

3.2. GOE AND GUE 27

Finally we prove the important explicit formulas of the joint den-sities of the eigenvalues of GOE/GUE (now really the density of theeigenvalues rather than the entries, cf. Lemma 3.2.1)

Theorem 3.2.7. a) The eigenvalues λ1 ≤ . . . ≤ λn of a GOErandom matrix X have a density with respect to Lebesgue measure onRn given by

(3.2.36) f (λ1, . . . , λn) = c1λ1<...<λne−

∑ni=1 λ

2i

4

∏i<j

(λj − λi) ,

for a normalizing constant c = c (n).

b) The eigenvalues λ1 ≤ . . . ≤ λn of a GUE random matrix X havea density with respect to Lebesgue measure on Rn given by

(3.2.37) f (λ1, . . . , λn) = c1λ1<...<λne−

∑ni=1 λ

2i

2

∏i<j

(λj − λi)2 ,

for a normalizing constant c = c (n).

Remark 3.2.8. 1. Again, note the similarity of the formulas (3.2.36)and (3.2.37), up to the different exponents.

2. If one defines(λ1, . . . , λn

)to be an independent random per-

mutation of the ordered eigenvalues (λ1, . . . λn), one gets the formulasthat

(λ1, . . . , λn

)has density

(3.2.38) f (λ1, . . . , λn) = ce−∑ni=1 λ

2i

4

∏i<j

(λj − λi) ,

in the GOE case and

(3.2.39) f (λ1, . . . , λn) = ce−∑ni=1 λ

2i

2

∏i<j

(λj − λi)2 ,

in the GUE case.3. Note that without the factor

∏i<j (λj − λi)β , β = 1, 2 these

would be the densities of IID normal random variables of variance 2resp. 1. The extra factor modifies the distribution, making the λidependent, and in particular making them repulsive, since the factorgoes to zero as two eigenvalues approach each other. Thus repulsionis a crucial different between the distribution of eigenvalues and otherdistributions for an independent set of points, or random set of pointssuch as the Poisson point process.

For the proofs we need the following basic results about the matrixexponential.

3.2. GOE AND GUE 28

Lemma 3.2.9. The maps

(3.2.40) exp : Mn (R)→Mn (R) ,

(3.2.41) exp : Mn (C)→Mn (C)

defined by the series

(3.2.42) exp (X) = I +∞∑k=1

Xk

k!,

(which is convergent for all X ∈Mn (C)) satisfies the following:a) exp is continuously differentiable when viewed as a map Rn×n →

Rn×n (resp. Cn×n → Cn×n).b) We have for all X

(3.2.43) exp (X) = I +X +O(|X|2HS e

|X|2HS).

c) There is a neighborhood V1 of 0 and a neighborhood V2 of I, suchthat

(3.2.44) exp : V1 → V2,

is invertible.d) If X ∈ Skewn (R) then exp (X) ∈ O (n) .e) If X ∈ Skewn (C) then exp (X) ∈ U (n).f) The neighborhoods V1, V2 in (3.2.44) can be chosen so that

(3.2.45) exp : V1 ∩ Skewn (R)→ V2 ∩O (n) ,

and

(3.2.46) exp : V1 ∩ Skewn (C)→ V2 ∩ U (n) ,

are surjective, so also invertible.

Proof. (Optional) See Lemmas B.4, B.6, B.7.

We first give an overview of the proof of Theorem 3.2.7 in the GOEcase. We already know from Lemma 3.2.1 that for a GOE the entries(Xij)1≤i≤j≤n have density

(3.2.47) e−14

∑ni=1 λi(X)2

.

Now the ordered eigenvalues are a function, albeit a complicated one,of (Xij)1≤i≤j≤n:

(3.2.48) gλ : Rn(n+1)

2 → Rn

(3.2.49) Λ := (λ1, . . . , λn) = gλ

((Xij)1≤i≤j≤n

).

3.2. GOE AND GUE 29

Recall that the change of variables formula for densities (LemmaC.25) implies that if X ∈ R

n(n+1)2 has a density and g : R

n(n+1)2 →

Rn(n+1)

2 is invertible and continuously differentiable, then

(3.2.50) Y = g (X) ,

has a density given by

(3.2.51) fY (y) = |det J (y)| fX (x) for x = g−1 (y) ,

where J (y) is the Jacobian of the inverse g−1 at y.We would like to use this with g = gλ to conclude that Λ has a

density. Unfortunately gλ is not map from Rn(n+1)

2 to itself, but to thelower-dimensional space Rn.

The standard way to deal with this is to add auxiliary variables tothe right-hand side of (3.2.49), to make the map one from R

n(n+1)2 to

Rn(n+1)

2 . In doing so, one must ensure that the resulting g is invertibleand continuously differentiable.

There is an obvious candidate for the n(n−1)2

degrees of freedomwe must “add” to the right-hand side: The diagonalizing orthogonalmatrix O. We could thus imagine setting up a map

(3.2.52) Rn(n+1)

2 → Rn ×O (n)(Xij, 1 ≤ i ≤ j ≤ n) → (Λ, O)

This is not a map into Euclidean space. So we also need to parameterizeO (n) by R

n(n−1)2 , giving

(3.2.53) Rn(n+1)

2 → Rn ×O (n) → Rn × Rn(n−1)

2,

(Xij, 1 ≤ i ≤ j ≤ n) → (Λ, O) → (Λ, o) .

To keep all our maps as maps between Euclidean spaces we actuallyreplace O (n) by Rn×n, thus identifying a matrix O ∈ O (n) with itsvector of entries. Thus we seek to define a map

(3.2.54) Rn(n+1)

2 → Rn × Rn×n → Rn × Rn(n−1)

2

(Xij, 1 ≤ i ≤ j ≤ n) → (Λ, O) → (Λ, o) .

This map is easiest to specify in the reverse direction:

(3.2.55) Rn × Rn(n−1)

2 → Rn × Rn×n → Rn(n+1)

2

(Λ, o) → (Λ, O) → (Xij, 1 ≤ i ≤ j ≤ n) .

A natural map

(3.2.56) Rn(n−1)

2 → Rn×n (∼= O (n)) ,

3.2. GOE AND GUE 30

is given by exponential map exp : Skewn (R) → O (n) (See Lemma3.2.9 d)). Thus we define

(3.2.57) e : Rn(n−1)

2 → Rn×n,

given by

(3.2.58) eij (x) = exp (SkewR (x))ij , 1 ≤ i, j ≤ n,

where

(3.2.59) SkewR (x)

is the Skew-symmetric matrix with entries above the diagonal given bythe entries of the vector x.

The natural map from Rn×Rn×n (∼= R×O (n)) to Rn(n+1)

2 is givenby (Λ, O)→ OTDΛO, where DΛ denotes the diagonal matrix with theentries of Λ on the diagonal. We can thus define

(3.2.60) h : Rn × Rn×n → Rn(n+1)

2 ,

by

(3.2.61) hij (λ,O) =(OTΛO

)ij

for 1 ≤ i ≤ j ≤ n.

We have thus set up our desired maps:(3.2.62)

Rn × Rn(n−1)

2 → Rn × Rn×n → Rn(n+1)

2

(Id, e) h(Λ, o) → (Λ, O) → (Xij, 1 ≤ i ≤ j ≤ n) ,

which can be composed to give

(3.2.63) w : Rn × Rn(n−1)

2 → Rn(n+1)

2 ,

given by

(3.2.64) w (Λ, o) =(e (o)T DΛe (o)

)1≤i≤j≤n

.

Since exp is continuously differentiable (Lemma 3.2.9 a)), it is easy tosee that w is also continuously differentiable.

Our map g should be the inverse of w. But unfortunately it is notclear if w is surjective or invertible2.

In fact, exp : Skewn (R)→ O (n) is not invertible, so e is not invert-ible. However, Lemma 3.2.9 guarantees invertiblity in a neighborhoodof 0. Thus setting

(3.2.65) A = [−ε, ε]n(n−1)

2 ,

2in fact it is surjective, since exp : Skewn (R) → SO (n) is surjective, but thisis a non-trivial result

3.2. GOE AND GUE 31

we have that for small enough ε and

(3.2.66) V = e (A) ,

that the map

(3.2.67) e : A→ V given by e (o) = e (o)

is invertible.Also the map h is not invertible, since (Λ, O) → OTDΛO is not.

However, letting

(3.2.68) L = (λ1, . . . , λn) : λ1 < . . . < λn ,

and

(3.2.69) B =OTDΛO : Λ ∈ L,O ∈ V

the map (Λ, O)→ OTDΛO restricted to

(3.2.70) L× V → B,

is invertible by the spectral theorem (two symmetric matrices withdistinct eigenvalues have the same diagonalizing matrix O up to multi-plying rows by ±1; thus a matrix with distinct eigenvalues has at mostone O which satisfies Oii > 0∀i; by (3.2.43) all O in V are such O, if εis small enough).

Thus we now have maps

(3.2.71)L× A → L× V → B

(Id, e) h(Λ, o) → (Λ, O) → (Xij, 1 ≤ i ≤ j ≤ n) ,

that are all invertible, so the map

(3.2.72) w : L× A→ B,

defined by

(3.2.73) w (Λ, o) = w (Λ, o) =(e (o)T DΛe (o)

)1≤i≤j≤n

,

is invertible and can define

(3.2.74) g = w−1,

giving us a map

(3.2.75) g : B → L× A,

which is continuously differentiable. On the event

(3.2.76)

(Xij)1≤i≤j≤n ∈ B

= Λ ∈ L ∩ O ∈ V ,

3.2. GOE AND GUE 32

it holds that

(3.2.77) λi = gi

((Xij)1≤i′≤j′≤n

)for i = 1, . . . , n.

If it was true that (Xij)1≤i′≤j′≤n ∈ B almost surely, we could now usethe change of variables formula to conclude that the random variablesgi


)have a joint density,

(3.2.78) fΛ,o

(λ1, . . . , λn, (oij)1≤i<j≤n

):= |det J (Λ, o)| e−

14

∑ni=1 λ

2i ,

where J (Λ, o) is the Jacobian of g at (Λ, o).Recall that we have proven that for a GOEX, the eigenvalues Λ and

the equivalence class [O] of the diagonalizing matrix are independent.Now we exploit this to actually allow us to use the change of variablesformula to that λ1, . . . , λn have a joint density.

Lemma 3.2.10. Let X be a GOE. Then the eigenvalues λ1 ≤ λ2 ≤. . . ≤ λn have a density with respect to Lebesgue measure on Rn, andit is given by

(3.2.79) fΛ (λ1, . . . , λn) = c1λ1<...<λnfΛ,o (λ1, . . . , λn, 0) ,

for a normalizing constant c > 0, where the latter is defined in (3.2.78).

Proof. Let (Ω,A,P) be a probability space on which we constructa random vector Λ distributed according to the distribution of theeigenvalues of a GOE sorted so that λ1 ≤ λ2 ≤ . . . ≤ λn, and a randommatrix O independent of Λ and distributed according to Haar measure.Let(3.2.80) X = OTΛO.

Then by Lemma 3.2.5 X is a GOE. So by Lemma 3.2.1, the entries(Xij)1≤i≤j≤n have the density

(3.2.81) fX (x) = ce−14

∑ni=1 λ

2i (x),

under the probability measure P.Recall the set B and note that since Λ and O are independent by

construction(3.2.82) P (X ∈ B) = P (Λ ∈ L,O ∈ V ) = P (Λ ∈ L)P (O ∈ V ) .

Furthermore by Lemma 3.1.3 we have(3.2.83) P (Λ ∈ L) = 1,

and since V contains an open neighborhood in O (n), so it must holdthat(3.2.84) P (O ∈ V ) > 0.

3.2. GOE AND GUE 33

Therefore

(3.2.85) P (X ∈ B) > 0,

and we can define the conditional law

(3.2.86) Q (·) = P (·|X ∈ B) .

Furthermore for any F1, F2 it holds that

(3.2.87)

Q (Λ ∈ F1, O ∈ F2)

= P(Λ∈F1,Λ∈L,O∈F2,O∈V )P(Λ∈L,O∈V )

= P(Λ∈F1,Λ∈L)P(O∈F2,O∈V )P(Λ∈L)P(O∈V )

= P (Λ ∈ F1|Λ ∈ L)P (O ∈ F2|O ∈ V ) ,

or in other words: the independence of Λ and O under P implies thatunder Q they are also independent. Also,

(3.2.88) P (Λ ∈ F1|Λ ∈ L) = P (Λ ∈ F1) ,

since P (Λ ∈ L) = 1 which shows that the Q-law of Λ coincides withthe P-law of Λ. So it suffices to show that under Q the random vectorΛ has a density.

Note that for any measurable R ⊂ B

(3.2.89)

Q (X ∈ R) = P(X∈R,X∈B)P(B)

= P(X∈R)P(X∈B)

=∫R∩B e

− 14

∑ni=1 λ

2i (X)∏n

i=1 dλiP(X∈B)

,

since X has density under P, so the Q-law of (Xij)1≤i≤j≤n also has adensity

(3.2.90) ce−∑ni=1 λ

2i (X)

4 ,

(with a different normalizing constant c).Now recall that we have defined

(3.2.91) g : B → L× A.

We may thus define the random variables o ∈ Rn(n−1)

2 by

(3.2.92) (Λ, o) = g (Xij : 1 ≤ i ≤ j ≤ n) ,

and using the change of variables formula we get that the Q-law of

(3.2.93) (Λ, oij, 1 ≤ i < j ≤ n) ,

has a density with respect to Lebesgue measure that is given by fΛ,o (Λ, o).

3.2. GOE AND GUE 34

But under Q we had that Λ and o are independent. Since thedensity of independent random variables factors, we therefore knowthat there exist fΛ and fo so that

(3.2.94) fΛ,o (Λ, o) = fΛ (Λ) fo (o) for almost all Λ ∈ L, o ∈ A.

Using this for o = 0 it implies that in particular

(3.2.95)fΛ (Λ) =

fΛ,o(Λ,0)

fo(0)

= cfΛ,o (Λ, 0)

= c |det J (Λ, 0)| e−∑ni=1 λ

2i (X)

4

for some normalizing c > 0. This implies that under P the Λ has adensity given by

(3.2.96) 1LfΛ (Λ) .

In the next lemma we compute |det J (Λ, 0)|.

Lemma 3.2.11. It holds that

(3.2.97) |det J (Λ, 0)| =∏

1≤k<l≤n

(λl − λk) .

Proof. The Jacobian matrix J = J (Λ, 0) of w at Λ, o = 0 is theunique matrix such that

(3.2.98) w(

Λ + Λ, o)

= w (Λ, 0) + J

(Λo

)+O

(∣∣∣Λ∣∣∣2 + |o|2),

where J is a n(n+1)2× n(n+1)

2matrix and o is a vector in R

n(n−1)2 . We

can write

(3.2.99) J =(J1 J2

),

for J1 a n(n+1)2×n matrix and J2 a n(n+1)

2× n(n−1)

2matrix and get that

J1, J2 are defined by

(3.2.100) w(

Λ + Λ, o)

= w (Λ, 0) + J1Λ + J2o+O

(∣∣∣Λ∣∣∣2 + |o|2).

3.2. GOE AND GUE 35

Since(3.2.101)w(

Λ + Λ, o)

=(e (o)T DΛ+Λe (o)

)=((I + SkewR (o) +O

(|u|2))T

(DΛ +DΛ)(I + SkewR (o) +O

(|u|2)))

= Λ− SkewR (o)DΛ +DΛSkewR (u) +DΛ +O

(∣∣∣Λ∣∣∣2 + |o|2),

(since ST = −S for any Skew-symmetric matrix, in particular for S =SkewR (u)), and w (Λ, 0) = Λ we obtain that

(3.2.102) J1 =

(In×n

0

),

and

(3.2.103) J2 =

(0S

),

for

(3.2.104) S =

(∂wik∂ukl

)1≤i<j≤n,1≤k<l≤n

.

Since

(3.2.105) (DΛSkewR (o))ij = λioij

and

(3.2.106) (SkewR (o)DΛ)ij = λj oij,

we obtain that

(3.2.107)(−SkewC (o)DΛ +DΛSkewC (o))ij= (λi − λj) oij

and so

(3.2.108)∂wij∂ukl

= δij=kl (λi − λj) .

Thus we obtain

(3.2.109) |det J (Λ, 0)| = |detS| =∏

1≤i<j≤n

(λj − λi)2 .

Proof of Theorem 3.2.7 a) . This follows directly from Lem-mas 3.2.10 and 3.2.11.

3.2. GOE AND GUE 36

We now adapt the proof the GUE case. We identify a HermitianmatrixX with a vector of its entries lying Rn×C

n(n−1)2 , which we in turn

identify with Rn × Rn(n−1) = Rn2 (by considering real and imaginaryparts of complex entries, ordering the entries so the n real entries comefirst, and then the real and imaginary parts of each diagonal entryappears adjacent to one another). Firstly we have the map

(3.2.110) Rn × U (n) → Rn2

(Λ, U) → U∗DΛU.

To make it a map between Euclidean spaces we identify U (n) withCn2 ∼= Rn2 and define

(3.2.111) hij (λ, U) = (U∗ΛU)ij for 1 ≤ i ≤ j ≤ n,

where hii is real and hij, i < j is the 2-vector identified with the complexnumber in the (i, j)-th entry of U∗ΛU . This gives

(3.2.112)Rn × Rn2 → Rn2

h(Λ, U) → U∗DΛU.

To make the map invertible we can restrict to the set L from (3.2.68)and the subset U (n) = U ∈ U (n) : Uii ∈ R, Uii > 0:

(3.2.113)L× U (n) → Rn2

h(Λ, U) → U∗DΛU.

We then want to parameterize U (n) by a space with n (n− 1) de-grees of freedom. We use without proof that there exists an open subset0 ∈ A ⊂ Rn2−n and a subset V ⊂ U (n) such that if U is distributedaccording to Haar measure on U (n), then

(3.2.114) P (p (U) ∈ V ) > 0,

where p (U) denotes the map that each U with a complex unit normnumber so that Uii ∈ R, and a continuously differentiable map

(3.2.115) e : A→ V,

3.2. GOE AND GUE 37

such that3

(3.2.116) e (u) = I + SkewC (u) +O(|x|2),

where SkewC (u) denotes the Skew-Hermitian matrix with zero on thediagonal and real and imaginary parts of the upper-triangular partgiven by the entries of u (we order the entries of u so that the real andcomplex part of each entry appears adjacent).

Setting

(3.2.117) B = e (u)∗DΛe (u) : Λ ∈ L, u ∈ A ,we obtain an invertible and smooth differentiable map,

(3.2.118)L× A → L× V → B

(Id, e) h(Λ, u) → (Λ, U) → (Xij, 1 ≤ i ≤ j ≤ n) .

given by

(3.2.119) w : L× A→ Rn2

,

and

(3.2.120) w (Λ, u) =(

(e (u)∗DΛe (u))ij , 1 ≤ i ≤ j ≤ n).

As discussed above, we interpret (e (u)∗DΛe (u))ij as a real vector oflength 2 (complex and imaginary part) when i < j and a real numberwhen i = j, which are combined to form the vector w (Λ, u), the realentries coming first.

We define

(3.2.121) g = w−1,

giving us a map

(3.2.122) g : B → L× A,which is continuously differentiable. On the event

(3.2.123)

(Xij)1≤i≤j≤n ∈ B

it holds that

(3.2.124) λi = gi


)for i = 1, . . . , n.

3The original intention when writing these notes was to parameterize U (n) bySkewn (C) by the exponential map (see Lemma 3.2.9 d)). But this will not suffice,since one needs to parameterize U (n) rather than U (n) to “quotient out” all thedegrees of freedom causing non-uniqueness. Unfortunately the exponential mapdoes not provide an explicit parametrization of U (n).

3.2. GOE AND GUE 38

Furthermore we define

(3.2.125) fΛ,u

(λ1, . . . , λn, (uij)1≤i<j≤n

):= |det J (Λ, u)| e−

12

∑ni=1 λ

2i ,

where J (Λ, u) is the Jacobian of w at (Λ, u), and (uij)1≤i<j≤n is inter-preted as vectors of 2 real numbers combined into one real vector.

Lemma 3.2.12. Let X be a GUE. Then the eigenvalues λ1 ≤ λ2 ≤. . . ≤ λn have a density with respect to Lebesgue measure on Rn, andit is given by

(3.2.126) fΛ (λ1, . . . , λn) = c1λ1<...<λnfΛ,u (λ1, . . . , λn, 0) ,

for a normalizing constant c > 0, where the latter is defined in (3.2.125).

Proof. Let (Ω,A,P) be a probability space on which we constructa random vector Λ distributed according to the distribution of theeigenvalues of a GUE sorted so that λ1 ≤ λ2 ≤ . . . ≤ λn, and a randommatrix U independent of Λ and distributed according to Haar measure.Let

(3.2.127) X = U∗ΛU.

Then by Lemma 3.2.6 X is a GUE. So by Lemma 3.2.1, the entries ofX have the density

(3.2.128) fX (x) = ce−12

∑ni=1 λ

2i (x),

under the probability measure P.Recall the set B and note that since Λ and U are independent by

construction(3.2.129)

P (X ∈ B) = P (Λ ∈ L, p (U) ∈ V ) = P (Λ ∈ L)P (p (U) ∈ V ) > 0,

(recall (3.2.114)). We define

(3.2.130) Q (·) = P (·|X ∈ B) .

As in (3.2.87) it holds that Λ and U are independent under P, and Λhas the same law under P and Q. As in (3.2.89) it holds that X (seenas a vector in Rn2 ) has density proportional to

(3.2.131) e−∑ni=1 λ

2i (X)

2 ,

(with a different normalizing constant c) also under Q.We define the random variable u by

(3.2.132) (Λ, u) = g (X) .

By the change-of-variables formula

(3.2.133) (Λ, uij, 1 ≤ i < j ≤ n) ,

3.2. GOE AND GUE 39

have a density with respect to Lebesgue measure that is given byfΛ,u (Λ, u).

Thus Λ also have a density fΛ,u (Λ) under Q, and since Λ and u areindependent under Q one has

(3.2.134) fΛ,u (Λ, 0) = fΛ (Λ) fu (0) ,

giving

(3.2.135)fΛ (Λ) =

fΛ,u(Λ,0)

fu(0)

= cfΛ,u (Λ, 0)

= c |det J (Λ, 0)| e−∑ni=1 λ

2i (X)

2 .

Thus under P the vector Λ has density

(3.2.136) 1LfΛ (Λ) .

In the next lemma we compute |det J (Λ, 0)|.

Lemma 3.2.13. It holds that

(3.2.137) |det J (Λ, 0)| =∏

1≤k<l≤n

(λl − λk)2

Proof. We order the outputs of w so that the first n outputs are

(3.2.138) wii (Λ, u) , i = 1, . . . , n,

i.e. the n real values on the diagonal of the matrix encoded by w (Λ, u),and the next n (n− 1) outputs are the real and imaginary parts of the

(3.2.139) wij (Λ, u) , 1 ≤ i < j ≤ n,

the complex above-diagonal entries.Note that for all Λ ∈ L, u ∈ V we have

(3.2.140) wij (Λ, u) = (e (u)∗DΛe (u))ij .

The Jacobian matrix J = J (Λ, 0) of w at Λ, u = 0 is the unique matrixsuch that

(3.2.141) w(

Λ + Λ, u)

= w (Λ, 0) + J

(Λu

)+O

(∣∣∣Λ∣∣∣2 + |u|2),

where J is a n2 × n2 matrix and u is a vector in Rn2−n given by

(3.2.142) (uij, 1 ≤ i < j ≤ n) ,

for uij complex but identified with a vectors in R2, that are stacked tocreate the vector u. We can write

(3.2.143) J =(J1 J2

),

3.2. GOE AND GUE 40

for J1 a n2×n matrix and J2 a n× (n2 − n) matrix and get that J1, J2

are defined by

(3.2.144) w(

Λ + Λ, u)

= w (Λ, 0) + J1Λ + J2u+O

(∣∣∣Λ∣∣∣2 + |u|2).

Since(3.2.145)w(

Λ + Λ, u)

= (e (u)∗DΛ+Λe (u))

=((I + SkewC (u)∗ +O

(|u|2))

(DΛ +DΛ)(I + SkewC (u) +O

(|u|2)))

= Λ− SkewC (u)DΛ +DΛSkewC (u) +DΛ +O

(∣∣∣Λ∣∣∣2 + |u|2),

(since S∗ = −S for any Skew-symmetric matrix, in particular for S =SkewC (u)), and w (Λ, 0) = Λ we obtain that

(3.2.146) J1 =

(In×n

0

),

and

(3.2.147) J2 =

(0S

),

for

(3.2.148) S =

(∂wik∂ukl

)1≤i<j≤n,1≤k<l≤n

,

where each

(3.2.149)∂wik∂ukl

is a 2× 2 matrix, the Jacobian matrix of the from C to C one obtainsby tracking the change in wik as one varies only ukl, seen as a map fromR2 to R2. Since

(3.2.150) (DΛSkewC (u))ij = λiuij

and

(3.2.151) (SkewC (u)DΛ)ij = λjuij,

we obtain that

(3.2.152)(−SkewC (u)DΛ +DΛSkewC (u))ij= (λi − λj)Reuij + (λi − λj) Imuij,

and so

3.3. UPPER BOUND ON THE OPERATOR NORM OF WIGNER MATRICES 41

(3.2.153)∂wij∂ukl

= δij=kl

(λi − λj 0

0 λi − λj

).

Since

(3.2.154) det

(∂wij∂ukl

)= (λj − λi)2 ,

we obtain

(3.2.155) |det J (Λ, 0)| = |detS| =∏

1≤i<j≤n

(λj − λi)2 .

Proof of Theorem 3.2.7 b). This is immediate consequence ofLemmas (3.2.12) and 3.2.13.

3.3. Upper bound on the operator norm of Wigner matrices

In this section we show an upper bound for the operator norm‖X‖op of a symmetric, Hermitian or independent Wigner matrix X.Equivalently we bound the magnitude of the largest eigenvalue, since

(3.3.1) ‖X‖op =n

maxi=1|λi| .

We start with the real independent case, the others are then easilydeduced as corollaries.

Proposition 3.3.1. (Operator norm UB, independent real entries)Set X be random matrix with independent real valued entries Xij s.t.

(3.3.2) E [Xij] = 0

and ∃c, κ > 0 s.t.

(3.3.3) P (|Xij| ≥ u) ≤ ce−κu2∀i, j, u ≥ 0.

Then there exists a constant c = c (κ) such that

(3.3.4) P(‖X‖op ≤ c

√n)→ 1.

Furthermore there exists constants V = V (κ) and c = c (κ) such thatfor all v ≥ V

(3.3.5) P(‖X‖op ≥ v

√n)≤ e−cv

2n.

Remark 3.3.2. a) Of course (3.3.5) implies (3.3.4).b) Note that we do not require the entries be IID, only that they

are independent.c) We do however require the tail condition (3.3.3) uniformly in the

entries, with the same c for the tail of each entry. The tail condition


is satisfied by e.g. Gaussian or bounded random variables, but notrandom variables with only exponential tails.

d) However, our tail condition is not optimal. In fact, with a moreinvolved method one can show [5] that if the second and fourth mo-ments are uniformly bounded (supi,j=1,...,n,k=1,2 E

[|Xij|k

]≤ c), then

(3.3.4) holds.e) Some moment or tail condition is indeed necessary. Take for

instance Xij IID with density c 11+x4 , for some normalizing c. Then

P (|Xij| ≥ u) 1u3 , so (3.3.3) is not satisfied, nor is the moment condi-

tion of c), since E[|Xij|4

]=∞. Also the largest entry will be of order

at least (log n)√n, since

(3.3.6)

P (maxij |Xij| ≤ (log n)√n) = P (|Xij| ≤ (log n)

√n)

n2

≥(

1− c

((logn)√n)

3

)n2

→ 1.

Since

(3.3.7) ‖X‖op ≥ maxi,j|Xij| ,

(we have |Xij| =∣∣eTj Xei∣∣ ≤ |ej| ‖X‖op |ei| = ‖Xop‖ for the standard

basis vectors ei) this shows that the claim (3.3.4) of the proposition isnot true in this case.

f) The mean zero condition (3.3.2) is also necessary. Take for in-stance Xij = 1 (deterministic). Then ‖X‖op = n.

Proof. We note that

(3.3.8) ‖X‖op = supa:|a|=1

aTXa.

To upper bound ‖X‖op we will upper bound the supremum on theright-hand side.

Step 0: There is a constant c = c (κ) such that for any θ > 0 andi, j

(3.3.9) E [exp (θXij)] ≤ ecθ2

.

Proof of step 0: We consider two cases. First if |θ| ≥ 1 we have

(3.3.10)

E [exp (θXij)] ≤∑

u≥1 eθ(u+1)P (Xij ≥ u)

≤∑

u≥1 eθ(u+1)−κu2

≤∫∞

1eθ(u+2)−κ(u−1)2

du

≤ e2θ+cθ2 ∫R e−κ(u−cθ)2

du

≤ ecθ2,


where we used that for |θ| ≥ 1 it holds that θ ≤ θ2. Secondly, if |θ| < 1then we can expand the exponential and use E [Xij] = 0 to get

(3.3.11) E [exp (θXij)] = 1 +∑∞

k=2

θkE[Xkij]

k!

≤ 1 + θ2∑∞

k=2

E[|Xij |k]k!

.

It holds that

(3.3.12) E[|Xij|k

]≤

∞∑u=1

(u+ 1)k P (|Xij| ≥ u) ≤∞∑u=1

(2u)k e−cu2

,

so that

(3.3.13)

∑∞k=2

E[|Xij |k]k!

≤∑∞

k=2

∑∞u=1(2u)ke−cu

2

k!

=∑∞

u=1 e−cu2 ∑∞

k=2(2u)k

k!

≤ c∑∞

u=1 e−cu2

u2∑∞

k=2(2u)k−2

k!.

We have

(3.3.14)∞∑k=2

(2u)k−2

k!≤ 1 +

∞∑k=1

(2u)k

k!= e2u.

Thus

(3.3.15)∞∑k=2

E[|Xij|k

]k!

≤∞∑u=1

e−cu2

(2u)2 e2u ≤ c,

and so

(3.3.16) E [exp (θXij)] ≤ 1 + cθ2 ≤ cecθ2

.

Step 1: There is a constant c = c (κ) such that for all a s.t. |a| = 1it holds for all t ≥ 0 that

(3.3.17) P(aTXa ≥ t

)≤ e−ct

2

.

Proof of step 1: By the exponential Chebyshev inequality

(3.3.18) P(aTXa ≥ t

)≤ E

[exp

(λaTXa

)]e−λt,

for all λ > 0. Now

(3.3.19)E[exp

(λaTXa

)]= E

[exp

(λ∑

i,j aiajXij

)]= E

[∏i,j exp (λaiajXij)

]=

∏i,j E [exp (λaiajXij)] ,


where we used independence of the entries Xij. Applying the previousstep with θ = λaiaj we have

(3.3.20)E[exp

(λaTXa

)]≤

∏ij e

cλ2a2i a

2j

= ecλ2∑i,j a

2i a

2j

= ecλ2,

for all λ > 0, which implies that

(3.3.21) P(aTXa ≥ t

)≤ ecλ

2−λt.

Now picking λ = ct for small enough c we obtain

(3.3.22) P(aTXa ≥ t

)≤ e−ct

2

.

Step 2: There is a subset Σ ⊂ a : |a| = 1 such that |Σ| ≤ 9n andfor all a ∈ Σ we have infb∈Σ |a− b| ≤ 1

4.

Proof of step 2: Fix r > 0. Let Σ be a maximal subset ofa : |a| = 1 such that B

(b, 1

8

)∩ B

(b′, 1

8

)are disjoint for b 6= b′ ∈ Σ

(keep adding points a with |a| = 1 to Σ until you can’t anymore withoutviolating this condition). Then

(3.3.23) a : |a| = 1 ⊂ ∪b∈ΣBn

(b,

1

4

),

since if for some a with |a| = 1 we a /∈ ∪b∈ΣBn

(b, 1

4

)then B

(a, 1

8

)is disjoint from B

(b, 1

8

)for all b ∈ Σ, so Σ is not maximal. This

implies infb∈Σ |a− b| ≤ 14for all a, |a| = 1. Furthermore since the

B(b, 1

8

), b ∈ Σ are disjoint and

(3.3.24) ∪b∈Σ Bn

(b,

1

8

)⊂ Bn

(0, 1 +

1

8

),

we have

(3.3.25) |Σ|Vol(Bn

(0,

1

8

))≤ Vol

(Bn

(0, 1 +

1

8

)),

which implies

(3.3.26) |Σ| ≤Vol(Bn

(0, 1 + 1

8

))Vol(Bn

(0, 1

8

)) =

(1 + 1

8

)n(18

)n = 9n.

Step 3: ‖X‖op ≤ 2 supb∈Σ

∣∣bTXb∣∣.Proof of step 3: For all a with |a| = 1, there is a b ∈ Σ such that

|a− b| ≤ 14. Let a be an a s.t.

(3.3.27) aTXa = ‖X‖op.


Then for a b ∈ Σ with |a− b| ≤ 18we have

(3.3.28)

‖X‖op =∣∣aTXa∣∣

=∣∣aTXb+ aTX (a− b)

∣∣=

∣∣∣bTXb+ (a− b)T Xb+ aTX (a− b)∣∣∣

≤∣∣bTXb∣∣+

∣∣∣(a− b)T Xb∣∣∣+∣∣aTX (a− b)

∣∣≤

∣∣bTXb∣∣+ 2 |a− b| ‖X‖op≤

∣∣bTXb∣∣+ 12‖X‖op.

Thus

(3.3.29)1

2‖X‖op ≤

∣∣bTXb∣∣ ,which gives the claim.

Step 4: Conclusion: We have from steps 1, 2 and 3 that

(3.3.30)

P (‖X‖op ≥ v√n) ≤ P

(supb∈Σ

∣∣bTXb∣∣ ≥ 12v√n)

≤ |Σ|P(∣∣bTXb∣∣ ≥ 1

2v√n)

≤ 9ne−cv2n

= exp ((c− cv2)n)

≤ e−cv2n,

for v ≥ V for a constant V = V (κ).

The equivalent result for several other random matrix ensemblescan be easily deduced.

Corollary 3.3.3. (Operator norm UB; complex entries, symmet-ric and Hermitian Wigner ensembles) Let X be a symmetric or Her-mitian random matrix, with independent entries on-and-above the di-agonal, so that ∃c > 0 such that

(3.3.31) P (|Xij| ≥ u) ≤ e−cu2∀u ≥ 0.

Then there exists a constant c such that

(3.3.32) P(‖X‖op ≤ c

√n)→ 1.

Furthermore there exists constants V and c such that for all v ≥ V

(3.3.33) P(‖X‖op ≥ v

√n)≤ e−cv

2n.

Remark 3.3.4. a) The tail condition (3.3.31) is satisfied for in-stance if the entries are Gaussians, so the theorem applies to GOE andGUE.

b) For a GOE or GUE this bound would in principle be obtainablefrom the density formulas (3.2.36), (3.2.37) for the eigenvalues, sinceall information abut the distribution of the maximum eigenvalue is


contained in it. In practice, manipulating the resulting integral seemsinfeasible.

Proof. Note that the operator norm, being indeed a norm on thelinear spaces Mn (R) resp. Mn (C), satisfies the triangle inequality(3.3.34) ‖A+B‖op ≤ ‖A‖op + ‖B‖op,for any real or complex matrices A and B.

a) Symmetric real matrices. Write(3.3.35) X = A+B,

where A is upper triangular (Aij = 0 for i > j) and B is lower triangular(Bij = 0 for i ≤ j). Then A and B are matrices with independententries, so by (3.3.34) and the previous proposition

(3.3.36)P (‖X‖op ≥ v

√n)

≤ P(‖A‖op ≥ v

√n

2

)+ P

(‖B‖op ≥ v

√n

2

)≤ e−cv

2n,

for v ≥ V ′ := 2V , where V is the constant whose existence is guaran-teed by the previous Proposition.

b) Complex matrices with independent entries: Write(3.3.37) X = A+ iB,

where Aij = ReXij and Bij = ImXij. Then

(3.3.38)‖X‖op,Mn(C) ≤ ‖A‖op,Mn(C) + ‖iB‖op,Mn(C)

= ‖A‖op,Mn(C) + ‖B‖op,Mn(C)

= ‖A‖op,Mn(R) + ‖B‖op,Mn(R),

where the subscript indicates if we are viewing the operator norm as anorm onMn (C) orMn (R). The last equality follows since the operatornorm is the largest magnitude of any eigenvalue. Now we can deducethe bound (3.3.33) in the present case from the previous propositionas in a).

c) Hermitian random matrices: Write(3.3.39) X = A+B,

where A is upper triangular and B is lower triangular, both with inde-pendent complex entries. Then the bound follows from the bound forcomplex matrices with independent entries (proved in b) together with(3.3.34).

CHAPTER 4

Wigner’s semicircle law

4.1. Statement of the theorem

In this Chapter we prove the celebrated Wigner semicircle law. Itgives the asymptotics as n→∞ for the the number of eigenvalues λi ofa n×n symmetric real or complex Hermitian Wigner matrix that lie inan interval [a

√n, b√n] for a < b (under appropriate moment conditions

on the entries). The claim is that the number of eigenvalues in such aninterval scales like ca,bn, where ca,b is given by measure of the interval[a, b] under the semi-circle density

(4.1.1) fsc (x) = 1[−2,2]1

2π

√4− x2,

i.e.

(4.1.2) ca,b =

∫ b

a

fsc (x) dx.

More precisely, let

(4.1.3) N (a, b,X) = |i : a ≤ λi (X) ≤ b| ,denote the number of eigenvalues of a matrix X that lies in the interval[a, b]. If Xn, n ≥ 1, is a sequence of symmetric or Hermitian randommatrices then the claim is that

(4.1.4)N(a, b, 1√

nXn

)n

→∫ b

a

fsc (x) dx,

where the convergence can be shown to happen almost surely (notethat λi

(1√nXn

)= 1√

nλi (X)).

An elegant and equivalent way to express this claim involves theEmpirical Spectral Distribution.

Definition 4.1.1. Let X be a matrix. The Empirical SpectralDistribution of X is the probability measure (on R of C)

(4.1.5) µX =1

n

n∑i=1

δλi(X)

47

4.1. STATEMENT OF THE THEOREM 48

Note that for any matrix X with real eigenvalues(4.1.6) N (a, b,X) = µX ([a, b]) .

Recall that the Portmanteau lemma (Lemma C.28) states that µXw→ µ

for a measure µ (where w→ means weak convergence for probabilitymeasures; see Definition C.26) iff µX ([a, b])→ µ ([a, b]) for all a, b thatare continuity points of µ (i.e. µ (a) = µ (b) = 0). If we(4.1.7) define µsc to be the measure with density fsc,then all a, b are continuity points, so for any deterministic sequence ofmatrices

(4.1.8)N (a, b,Xn)

n→∫ b

a

fsc (x) dx∀a, b,

iff(4.1.9) µXn

w→ µsc.

We will state the Wigner semi-circle law as the result that almostsurely µ 1√

nXn

w→ µsc. Note µ 1√nXn

is then a random measure, not tobe confused with the probability measure P; µ 1√

nXn

is a random vari-able that is itself a probability measure - a probability measure-valuedrandom variable.

Remark 4.1.2. (Optional) To be allowed to speak of µ 1√nXn

asa random variable we must strictly speaking define it as a measur-able function from Ω to a measure space. This requires some ab-stract machinery. Let P be the set of all probability measures onR. The Lévy-Prokhorov metric (see Definition C.29) defines a metricd : P×P → [0,∞) (which is such that µn

w→ µ iff d (µn, µ)→ 0). Thus(P , d) is metric space, so one can define a Borel sigma algebra B (P)which makes (P ,B (P)) a measurable space. For a random matrix Xdefined on some probability space (Ω,A,P) we can therefore define µXas a function(4.1.10) µX : Ω→ P .It can be shown (we omit the proof) that µX as defined by (4.1.5) and(4.1.10) is a measurable function, and therefore µX is a well definedmeasure-valued random variable.

Theorem 4.1.3. (Wigner’s semi-circle law) Let

(4.1.11) Xij, 1 ≤ i < j <∞,be random variables such that Xii, i ≥ 1, are iid and real-valued, Xij, i <j are iid and complex-valued, and Xji = Xij. Consider the complex

4.1. STATEMENT OF THE THEOREM 49

Hermitian (or in the special case ImXij = 0, real symmetric) Wignermatrices

(4.1.12) Xn,i,j = (Xij)1≤i<j<∞ .

If

(4.1.13) E [X11] = E [X12] = 0,

and

(4.1.14) E[|X12|2

]= 1,

and

(4.1.15) E[|X1,1|k

],E[|X1,2|k

]<∞ for k ≤ 16,

then

(4.1.16) P(µ 1√

nXn

w→ µsc

)= 1,

(i.e. the ESD of 1√nXn converges weakly almost surely to the Wigner

semi-circle law).

Remark 4.1.4. a) This applied in particular to GOE and GUErandom matrices. Deriving the result in this case from the densityformulas (3.2.36) and (3.2.37) for the eigenvalues is in principle possiblebut seems infeasible.

b) The moment condition (4.1.15) is not optimal; with a more in-volved proof it can be weakened to only requiring supi,j E

[|Xi,j|2

]<∞.

c) Since µ 1√nXn

w→ µsc implies that for all ε > 0 we haveN(√

2,√

2−ε, 1√nXn)

n→∫ √2√

2−ε fsc (x) dx > 0 , the result (4.1.16) implies that(4.1.17)

supiλi

(1√nXn

)≥√

2− ε for large enough n, almost surely,

giving since ‖Xn‖op =√n‖ 1√

nXn‖op that for all ε > 0

(4.1.18) ‖Xn‖op ≥√n(√

2− ε)for large enough n, almost surely,

and of course also the weaker statement for all ε > 0

(4.1.19) P(‖Xn‖op ≥

√n(√

2− ε))→ 1.

Thus we have a lower bound to accompany the upper bound Corollary3.3.3.

4.2. THE STIELTJES TRANSFORM 50

d) Since the semi-circle law is supported on[−√

2,√

2]it is tempt-

ing to also conclude that for all ε > 0

(4.1.20) ‖Xn‖op ≤√n(√

2 + ε)

almost surely, for n large enough,

so that

(4.1.21)‖Xn‖√n→√

2 almost surely.

Indeed, this is in fact true under our assumptions1, but unfortunatelydoes not follow from Theorem 4.1.3; it requires a different proof. The-orem 4.1.3 does imply that for any ε > 0

(4.1.22)N(√

2 + ε,√

2 + ε, 1√nXn

)n

→ 0,

which would give (4.1.20) if we could conclude thatN(√

2 + ε,√

2 + ε, 1√nXn

)→

0. But of (4.1.22) would be satisfied also if for instanceN(√

2 + ε,√

2 + ε, 1√nXn

)≈

log n→∞.

4.2. The Stieltjes transform

A standard way to prove the convergence of a sequence probabilitymeasures (or distributions of random variables) is to prove the con-vergence of some transform of the distribution. The Central LimitTheorem can be proven by showing the convergence of the Laplacetransform (=Moment Generating Function) of the normalized sum, orby showing the convergence of its Fourier Transform (=CharacteristicFunctions). For proving the Wigner semi-circle law a different trans-form turns out to be useful:

Definition 4.2.1. (Stieltjes transform) Let µ be a probability mea-sure on R. Define the function

(4.2.1) sµ : C\R→ C,

by

(4.2.2) sµ (z) =

∫µ (dx)

x− z, Im z 6= 0.

We call sµ (z) the Stieltjes transform of µ.

1For real symmetric matrices, the optimal assumptions under which it is trueis that E

[|Xii|2

]<∞,E

[|Xii|4

]<∞, see [1].


The next lemma gives basic properties of the Stieltjes transform.Part c) shows that the transform can be inverted for µ without atoms,thus justifying naming it a transform.

Lemma 4.2.2. a) The integral defining sµ (z) is well defined forImz 6= 0.

b) For any µ it holds

(4.2.3) limη↓0

1

π

∫ b

a

Imsµ (E + iη) dE = µ ((a, b))+1

2µ (a)+

1

2µ (b) .

c) If µ, ν are measures on R and sµ (z) = sν (z) for all z ∈ C\R thenµ = ν (i.e. µ is uniquely determined by sµ).

Proof. a) For z = E + iη for η 6= 0 we have∫ µ(dx)|x−z| ≤

∫ µ(dx)η

=1η<∞b) Note that

(4.2.4)

1πImsµ (E + iη) = 1

π

∫Im 1

x−E−iηµ (dx)

= 1π

∫η

(x−E)2+η2µ (dx)

=∫hη (x− E)µ (dx)

where

(4.2.5) hE,η (x) =1

π

η

x2 + η2.

Recalling that

(4.2.6)∫

1

1 + x2dx = atan (x) + C,

and using a change of measure one can show that

(4.2.7)∫hE,η (x) dx = 1.

(Note that for η small, hE,η (x) is sharply centered around E; intuitivelyif µ has a density f (x) one would expect 1

πImsµ (E + iη) ≈ f (x) for η

small). From (4.2.4) it follows that

(4.2.8)

1π

∫ baImsµ (E + iη) dE =

∫ ba

∫hη (x− E)µ (dx) dE

=∫ (∫ b

ahη (x− E) dE

)µ (dx)

=∫Ha,b (x)µ (dx) ,

where

(4.2.9) Ha,b (x) =

∫ b

a

hη (x− E) dE


From (4.2.6) it follows that

(4.2.10) Ha,b (x) =1

π

(atan

(x− aη

)− atan

(x− bη

)).

Since

(4.2.11) limη↓0

atan(x− dη

)=

π2

if y > d,

0 if y = d,

−π2

if y < d,

one sees that

(4.2.12) limη↓0

Ha,b (x) = 1x∈(a,b) +1

21x=a +

1

21x=b.

Since |Ha,b (x)| ≤ 1 and µ is a probability measure we can now concludeby dominated convergence that

(4.2.13)∫Ha,b (x)µ (dx) →

∫ (1x∈(a,b) + 1x=a + 1x=b

)µ (dx)

= µ ((a, b)) + 12µ (a) + 1

2µ (b) .

c) First note that for any u

(4.2.14) limε↓0

µ ((u, u+ ε)) = 0,

and

(4.2.15) limε↓0

µ (u+ ε) = 0,

(the latter since otherwise there is a c > 0 and a sequence εn such thatµ (u+ εn) ≥ c for all n, which would imply µ (R) =∞). Thus

(4.2.16) µ (u) = 2 limε↓0

limη↓0

1

π

∫ u+ε

u

Imsµ (E + iη) dE.

Thus for any a < b we can write µ ((a, b)) as a limit of integrals ofImsµ (E + iη). Thus if sµ = sν almost everywhere then µ ((a, b)) =ν ((a, b)) for all a < b, showing that µ = ν.

Then next Lemma relates weak convergence and convergence ofStieltjes transforms

Lemma 4.2.3. Let µn, µ be probability measures on R. Then µnw→ µ

iff sµn (z)→ sµ (z) for all z ∈ C\R.

Proof of Lemma 4.0.7 ” =⇒ ”. For each z ∈ C\R the (realand imaginary parts) of the function f (x) = 1

x−z are continuous boundedfunctions, so by the definition of weak convergence sµn (z) =

∫f (x)µn (dx)→∫

f (x)µ (dx) = sµ (z) .


Proof. ” ⇐= ” Since |H−K,K (x)| ≤ 1 it holds that for all η > 0that(4.2.17)

1

π

∫ K

−KImsµn (E + iη) dE =

∫H−K,K (x)µ (dx) ≤ µn ([−K,K]) .

Now sµn (z) → sµ (z) and |sµn (z)| ≤ 1η, so for any η > 0 bounded

convergence implies that

(4.2.18)∫ K

−K

1

πImsµ (E + iη) dE ≤ lim inf

n→∞µn ([−K,K]) .

Now taking η ↓ 0 we get that

(4.2.19)∫ K

−K

1

πImsµ (E + iη) dE ≤ lim inf

n→∞µn ([−K,K]) ,

which gives that

(4.2.20) µ ((−K,K)) ≤ lim infn→∞

µn ([−K,K]) .

Since limK→∞ µ ((−K,K)) = 1 this shows that the sequence µn, n ≥ 1,is a tight sequence of measures.

Assume that µnw

6→ µ. Then there is a continuous bounded functionf , so that

∫f (x)µn (dx) 6→

∫f (x)µ (dx). We can in fact pick a ε > 0

and a subsequence so that

(4.2.21)∣∣∣∣∫ f (x)µnk (dx)−

∫f (x)µ (dx)

∣∣∣∣ ≥ ε

for all nk. But by Prokhorov’s theorem this subsequence has a furthersubsequence nl that is weakly convergent. Along this subsequence wehave

(4.2.22) µnlw→ ν,

for some ν, so the “ =⇒ ” part implies that sµnl (z) → sν (z) for allz ∈ C\R, so in fact sν = sµ so ν = µ by Lemma 4.2.2 c). Thusµnl

w→ w, contradicting (4.2.21). So it must hold that µnw→ µ.

Thus to prove (4.1.16) it would suffice to show that

(4.2.23) P(∀z ∈ C\R, sµ 1√

nXn

(z)→ sµsc (z)

)= 1.

The integral

(4.2.24)∫ 2

−2

1

x− z1

π

√4− x2dx,

4.3. REDUCTION 54

is not trivial to compute, but can be explicitly computed (see exercise)and shown to equal

(4.2.25) sµsc (z) =

−z+

√z2−4

4for Imz > 0,

−z−√z2−4

4for Imz < 0,

,

with the convention for the square root chosen so that Im√z2 − 4 ≥ 0

(i.e.√x =

√rei

θ2 for x = reiθ with 0 ≤ θ < 2π). In the proof

of Theorem 4.1.1 we will obtain an indirect proof of 4.1.1 as a side-product.

4.3. Reduction

The main step in proving Theorem 4.1.3 will be to so show that for

(4.3.1) sn (z) := sµ 1√nXn

(z) ,

we have for large n

(4.3.2) sn (z) +1

z + sn (z)≈ 0,

i.e. sn (z) approximately satisfies the quadratic (in s) equations s +1z+s

= 0. The formal version of (4.3.2) we need to prove Theorem 4.1.3is the following.

Proposition 4.3.1. Under the assumptions of Theorem 4.1.3 itholds for all z ∈ C\R that

(4.3.3) P(∣∣∣∣sn (z) +

1

z + sn (z)

∣∣∣∣→ 0

)= 1.

We will postpone the proof of Proposition 4.3.1 until later and firstsee how it implies Theorem 4.1.3. Roughly speaking, (4.3.2) suggeststhat sn (z) converges to a solution of

(4.3.4) s+1

z + s= 0.

This is quadratic equation that can be solved explicitly to give

(4.3.5) s =−z ±

√z2 − 4

2.

We will see that one of these solutions gives the Stieltjes transform ofprobability measure, and that in fact

(4.3.6) sµsc (z) =−z +

√z2 − 4

2for Imz > 0,

4.3. REDUCTION 55

suggesting that

(4.3.7) sn (z)→ sµsc (z) ,

and thus that µ 1√nXn

w→ µsc. The above argument has several technicalgaps. We now close them with a more careful rigorous argument.

Lemma 4.3.2. If for z ∈ C\R it holds that

(4.3.8) P(∣∣∣∣sn (z) +

1

z + sn (z)

∣∣∣∣→ 0

)= 1,

then

(4.3.9) P(

limn→∞

sn (z)→ s (z))

= 1,

where2

(4.3.10) s (z) =

−z+

√z2−4

2if Im (z) > 0,

−z−√z2−4

2if Im (z) < 0,

and we define the complex square root√x as

√rei

θ2 for x = reiθ with

0 ≤ θ < 2π (note that with this convention Im√x ≥ 0 for all x ∈ C).

Proof. Consider first z with Imz > 0, and let ω ∈ Ω be such that

(4.3.11)∣∣∣∣sµn (ω, z) +

1

z + sµn (ω, z)

∣∣∣∣→ 0.

Let sµnk (ω, z) be any convergent subsequence of sµn (ω, z). The limitpoint of s (z) this subsequence must satisfy

(4.3.12) s (z) +1

z + s (z)= 0.

By solving the quadratic equation this implies that

(4.3.13) s (z) =−z +

√z2 − 4

2or s (z) =

−z −√z2 − 4

2,

To distinguish between the possibilities recall that

(4.3.14) Imsµn (ω,E + iη) =

∫η

(E − x)2 + η2µn (dx) ,

2A previous version of the lecture notes erroneously defined s (z) = −z+√z2−4

2

for all z ∈ C\R.

4.3. REDUCTION 56

so that Imsµn (z) and Imz always have the same sign. This mean thatif Imz > 0, then Ims (ω, z) ≥ 0. In our convention it holds that forImz > 0,

(4.3.15) Im(−z −

√z2 − 4

2

)< 0,

which implies that in fact if Imz > 0

(4.3.16) s (z) =−z +

√z2 − 4

2.

If Imz < 0 then the fact that

(4.3.17) Im(−z +

√z2 − 4

2

)> 0,

implies that

(4.3.18) s (z) =−z −

√z2 − 4

2.

Thus any convergent subsequence of sn (ω, z) converges to s (z).Since |sn (ω, z)| ≤ 1

Imz sequential compactness implies that in fact

(4.3.19) limn→∞

sn (ω, z)→ s (z) ,

for all ω such that (4.3.11) holds.

We have not yet proven that s (z) is the Stieltjes transform of ameasure. The next Lemma does this.

Lemma 4.3.3. For all z ∈ C\R it holds that

(4.3.20) sµsc (z) = s (z) .

Proof. In this proof we work in the particular case that Xij arerandom variables that satisfy both the hypothesis of Theorem 4.1.3 andthose of Corollary 3.3.3 (e.g. Xij are Gaussian). The bound (3.3.33)implies that for large enough c

(4.3.21)∑n≥1

P(

nsupi=1

∣∣∣∣λi( 1√nXn

)∣∣∣∣ ≥ c

)≤∑n≥1

e−cn <∞,

so that by the Borel-Cantelli lemma

(4.3.22) P(

nsupi=1

∣∣∣∣λi( 1√nXn

)∣∣∣∣ ≤ n for all n large enough)

= 0.

For this Xij it must then hold that A previous ver-sion of theselecture notesneglected theintersection overQ + iQ ∩ (0,∞)and considered afixed z. This wasan error since itultimately onlyshows that forthis specific zthere is a mea-sure ν such thats (z) = sν (z),not that thereis a ν such thats (z) = sν (z) forall z.

4.3. REDUCTION 57

(4.3.23)

P

(∩z∈Q+iQ∩(0,∞) sµn (z)→ s (z)⋂

supni=1

∣∣∣λi ( 1√nXn

)∣∣∣ ≤ c for all n large enough )

= 1.

Pick any ω such that

(4.3.24) sµn (ω, z)→ s (z) for all z ∈ Q + iQ ∩ (0,∞) ,

and

(4.3.25)∣∣∣∣λi( 1√

nXn (ω)

)∣∣∣∣ ≤ c for n large enough.

The latter implies that

(4.3.26) µn (ω, [−c, c]c) = 0 for n large enough,

so the sequence of measures µn (ω) is tight, and by Prokhorov’s theoremhas a convergent subsequence µnk (ω) such that µnk (ω) → ν weakly.Then

(4.3.27) sµnk (ω) (z)→ sν (z) ,

which implies that

(4.3.28) sν (z) = s (z) for z ∈ Q + iQ ∩ (0,∞) .

Now s (z) is continuous in z : Imz > 0 (since z2−4 is never in (0,∞),the discontinuity points of

√·, for z in this set), and Q + iQ ∩ (0,∞)

is dense in z : Imz > 0, so in fact

(4.3.29) sν (z) = s (z) for z ∈ C, Imz > 0.

The fact that sν (z) = ¯sν (¯)z and s (z) = ¯s (¯)z proves that in fact

(4.3.30) sν (z) = s (z) for z ∈ C\R,so s (z) is indeed the Stieltjes transform of a measure ν. Note that thisis a statement about the deterministic function s (z), which we can nowuse this conclusion for proving Theorem 4.1.3 also when the hypothesisof Corollary 3.3.3 is not satisfied.

Since we now know s (z) is the Stieltjes transform of ν we canrecover ν by using the inversion formula we have with z = E + iη,

(4.3.31)ν ((a, b)) + 1

2ν (a) + 1

2ν (b)

= limη↓0∫ ba

1πIms (z) dE.

Since |s (z)| ≤ |z|+√|z2−4|

2is bounded on the compact interval [a, b] we

can take the limit inside to get

(4.3.32)ν ((a, b)) + 1

2ν (a) + 1

2ν (b)

=∫ ba

limη↓01πIms (z) dE.

4.3. REDUCTION 58

Now

(4.3.33)

Ims (z) dE

= limη↓0 Im(−z+

√z2−4

2

)= limη↓0 Im

(√z2−42

).

Note that z2 − 4 lies in the subset of x : Imx ≥ 0 and with our con-vention

√x is continuous in this set. Thus for E2 > 4 we have

(4.3.34)

limη↓0

Im(√

z2 − 4)→ lim

η↓0Im(√

E2 − 4)

=

0 if E ≥ 2,√

4− E2 if E ≤ 2.

Thus

(4.3.35) ν ((a, b))+1

2ν (a)+

1

2ν (b) =

∫ b

a

1

2π1x∈[−2,2]

√4− x2dx.

Letting a = u and b = u+ε and taking ε ↓ 0 shows that that ν (u) = 0for all u, so in fact

(4.3.36) ν ((a, b)) =

∫ b

a

1

2π1x∈[−2,2]

√4− x2dx,

for all a < b so in fact

(4.3.37) ν = µsc,

and so

(4.3.38) sµsc (z) = s (z) .

With this we have shown that

(4.3.39) ∀z ∈ C\R,P(sµ 1√

nXn

(z)→ sµsc (z)

)= 1,

which is in fact slightly weaker than what we need to conclude theproof of Theorem 4.1.3, namely

(4.3.40) P(∀z ∈ C\R, sµ 1√

nXn

(z)→ sµsc (z)

)= 1,

For general functions fn (z) , f (z) in place of sµ 1√nXn

(z) , µsc (z) the

statement (4.3.39) in fact does not imply (4.2.23), not even if fn, f areassumed to be continuous (counter-example for functions R → R: Urandom uniform on [0, 1], fn (U) = 1, fn (z) = 0 for |z − U | ≥ 1

nare

linearly interpolated in-between, and f (z) = 0). But for functions witha uniform in n Lipschitz constant the implication does hold, and since

4.3. REDUCTION 59

the Stieltjes transforms are such functions we can prove that (4.3.39)implies (4.1.16) in the following lemma.

Lemma 4.3.4. Let µn, µ be random measures on R such that

(4.3.41) ∀z ∈ C\R,P (sµn (z)→ sµ (z)) = 1.

Then

(4.3.42) P (sµn (z)→ sµ (z)∀z ∈ C\R) = 1.

Proof. Note that for any z, z′ ∈ C and any probability measure µwe have

(4.3.43)

|sµ (z)− sµ (z′)| ≤∫ ∣∣ 1

x−z −1

x−z′∣∣µ (dx)

=∫ ∣∣∣ z−z′

(x−z)(x−z′)

∣∣∣µ (dx)

≤ |z−z′||Imz|×|Imz′| ,

(showing that sµ (z) is locally Lipschitz with Lipschitz constant boundedby 2

|Imz|2 , independently of µ).Consider the set of A = x+ iy, x ∈ Q, y ∈ Q\ 0. This is a

countable set, so (4.3.41) implies that

(4.3.44) P (sµn (z)→ sµ (z)∀z ∈ A) = 1.

Now fix an ω ∈ Ω such that

(4.3.45) sµn (z, ω)→ sµ (z, ω)∀z ∈ A.

Pick any z ∈ C\R. Fix an ε > 0. Then we can find a z′ ∈ A such that|z − z′| ≤ ε and |Imz′| ≥ |Imz| so that by (4.3.43)

(4.3.46) |sµn (z′, ω)− sµn (z, ω)| , |sµ (z′, ω)− sµ (z, ω)| ≤ ε

|Imz|2.

This shows that for all ε > 0

(4.3.47) |sµn (z, ω)− sµ (z, ω)| ≤ 2ε

|Imz|2+ |sµn (z′, ω)− sµ (z′, ω)|

for all n large enough, thus showing that in fact

(4.3.48) sµn (z, ω)→ sµ (z, ω)∀z ∈ C\ 0 .

Since this hold for any ω such that (4.3.45) holds, we obtain (4.3.42).

We can now complete the proof of Theorem 4.1.3 conditional onthe not-yet-proven Proposition 4.3.1.

4.4. PROOF OF MAIN ESTIMATE 60

Proof of Theorem 4.1.3. With the previous Lemma (4.3.39)implies (4.3.40). With Lemma 4.2.3 this implies that

(4.3.49) P(µ 1√

nXn

w→ µsc

)= 1,

which is the claim of the theorem.

4.4. Proof of main estimate

In this section we prove Proposition 4.3.1, i.e. roughly speakingthat for large n

(4.4.1) sn (z) +1

sn (z) + z≈ 0.

The following identity is what makes the Stieltjes transform well-adapted to studying the convergence of the ESD.

Lemma 4.4.1. Let X be a n×n matrix with real spectrum and ESDµX . Then for all z ∈ C\R the matrix X − zI is invertible and

(4.4.2) sµX (z) =1

nTr((X − zI)−1) .

Proof. Letting λi be the eigenvalues of X, we have that X − zIhas eigenvalues λi − z. Since λi ∈ R and z ∈ C\R none of these arezero, so X − zI is invertible and has eigenvalues 1

λi−z . Thus the resultfollows since

(4.4.3)

1nTr((X − zI)−1) =

∑ni=1 λi

((X − zI)−1)

=∑n

i=11

λi−z=

∫1

x−zµX (dx) .

The key to deriving the equation (4.4.1) is the Schur complementformula.

Proposition 4.4.2 (Schur complement formula). Let

(4.4.4) X =

(A BC D

),

be a matrix consisting of blocks A,B,C,D, where A is n× n and D ism×m. If X and D are invertible then also A− BD−1C is invertibleand letting

(4.4.5) Ψ =(A−BD−1C

)−1

we have

(4.4.6) X−1 =

(Ψ −ΨBD−1

−DCΨ D−1CΨBD−1 +D−1

).


Proof. We apply Gaussian elimination to X to make it block di-agonal. Multiplying the second row by BD−1 and subtracting from thefirst row we get

(4.4.7)(A−BD−1C 0

C D

).

Since this operation can be achieved by pre-multiplying by

(4.4.8)(I −BD−1

0 I

),

we have

(4.4.9)(I −BD−1

0 I

)X =

(A−BD−1C 0

C D

).

Note that the matrix (4.4.8) is always invertible, and so is X by as-sumption, so the right-hand side of (4.4.9) must also be invertible,which implies that

(4.4.10) A−BD−1C,

is invertible. Next we multiply the right column by CD−1 and subtractfrom the first column, by multiplying by

(4.4.11)(

I 0−D−1C I

),

giving

(4.4.12)(I −BD−1

0 I

)X

(I 0

−D−1C I

)=

(A−BD−1C 0

0 D

).

Since

(4.4.13)(I U0 I

)−1

=

(I −U0 I

),

we have

(4.4.14) X =

(I BD−1

0 I

)(A−BD−1C 0

0 D

)(I 0

D−1C I

).

Taking the inverse of both sides we have

(4.4.15) X−1 =

(I 0

−D−1C I

)(Ψ 00 D−1

)(I −BD−1

0 I

).

and by multiplying out we get (4.4.6).

Recall that sµX (z) = 1nTr((X − zI)−1). The next lemma finds an

expression for the diagonal elements of (X − zI)−1, which will facilitatethe computation of its trace.


Lemma 4.4.3. Let X be a Hermitian matrix and Imz > 0. Let X(i)

be the (n− 1) × (n− 1) matrix obtain by removing the i-th row andcolumn from X. Then both X − zI and X(i) − zI are invertible. Alsoletting

(4.4.16) G = (X − zI)−1 ,

and

(4.4.17) G(i) =(X(i) − zI

)−1,

and

(4.4.18) G(i)kl ,

be the n×n matrix with G(i)kl = 0 for k = i or l = 0 and all other entries

equal to the corresponding entries of the (n− 1)× (n− 1) matrix G(i).Then we have

(4.4.19) Gii =1

Xii − z −∑

k,l 6=iXikG(i)klXli

.

Proof. The matrices X − zI and X(i) − zI are invertible sinceboth X and X(i) are Hermitian (so have real spectrum) and Imz > 0.By rearranging columns and rows, we can assume w.l.o.g. that i = 1.Write

(4.4.20) X =

(X11 BC X(1)

),

for

(4.4.21) Bj = X1,j+1, j = 1, . . . , n− 1,

(4.4.22) Cj = Xj+1,1, j = 1, . . . , n− 1.

Then

(4.4.23) X − zI =

(A BC D

),

for

(4.4.24) A = X11 − z,and

(4.4.25) D = X(1) − zI.By the Schur complement formula it holds that

(4.4.26) (X − zI)−1 =

(Ψ . . .. . . . . .

),


where A−BD−1C 6= 0 and

(4.4.27)

Ψ = (A−BD−1C)−1

=(X11 − z −B

(X(1) − zI

)−1C)−1

= 1

X11−z−B(X(1)−zI)−1C.

Since

(4.4.28)(X(1) − zI

)−1= G(1),

and

(4.4.29) BG(1)C =∑k,l 6=1

X1lG(1)kl Xl1,

we get the result.

With this formula and (4.4.2) we get that if X is a HermitianWigner matrix satisfying the conditions of Theorem 4.1.3 and

(4.4.30) sn (z) = sµ 1√nX

(z) ,

then with

(4.4.31) G =

(1√nX − zI

)−1

,

and

(4.4.32) G(i) =

(1√nX(i) − zI

)−1

,

and G(i) defined as a n× n version of G(i) with a zero row and columni inserted, we have

(4.4.33) sn (z) = − 1

n

n∑i=1

1

z +1√nXii +

1

n

∑k,l 6=i

XikXilG(i)kl︸︷︷︸

Ai

.

Note that Xi·, X·i are independent of X(i) and so of G(i). We nowtry to get a rough idea of the typical size of Ai, by considering G(i)

kl

fixed and taking an expectation only over Xil, Xli, l = 1, . . . , n. Notethat the latter X-s are independent of X(i), so also of G(i)!

To lighten notation let us write for fixed i

(4.4.34) G(i) = g.


Then we have

(4.4.35) Ai =1√nXii +

1

n

∑k,l 6=i

XikXilgkl.

We now take the conditional expectation given g to get(4.4.36)E [Ai|g]

= E[

1√nXii|g

]+ 1

n

∑k,l 6=i E

[XikXil|g

]gkl (linearity of expectation)

= E[

1√nXii

]+ 1

n

∑k,l 6=i E

[XikXil

]gkl. (indepdence)

Now E [Xii] = 0 and E[XikXil

]= 0 if k 6= l and E

[XikXik

]=

E[|Xik|2

]= 1 (by (4.1.14)), so in fact

(4.4.37) E [Ai|g] =1

n

∑k 6=i

gkk =1

nTr (g) .

Thus it seems reasonable to expect that “usually”

(4.4.38) Ai ≈1

nTr(G(i)

)=

1

nTr(G(i)

).

Intuitively we expect 1nTr (G) = sn (z) to converge as n → ∞ (in-

deed this is part of what we are proving). Considering the approxima-tion 1√

nX(i) − zI ≈ 1√

n−1X(i) − zI, note that the latter is precisely the

expression for G−1 for the (n− 1)× (n− 1) matrix X(i). Thus we mayalso expect that

(4.4.39)1

nTr(G(i)

)≈ 1

nTr (G) ,

in which case

(4.4.40) Ai ≈1

nTr (G) = sµ 1√

n

(z) .

If these approximations are accurate then

(4.4.41) sn (z) ≈ − 1

z + sn (z).

We have thus achieved a heuristic derivation of (4.4.1).Our task is now to obtain a formal version of (4.4.41), by making

the estimates (4.4.38) and (4.4.39) precise. The estimate (4.4.38) canbe split into three parts

(4.4.42)1√nXii ≈ 0,


(4.4.43)1

n

∑k,l 6=i,l 6=k

XikXilG(i)kl ≈ 0,

(4.4.44)1

n

∑k 6=i

|Xik|2 G(i)kk ≈

1

n

∑k 6=i

G(i)kk.

We start with the easiest estimate, namely (4.4.42).

Lemma 4.4.4. If X is as in Theorem 4.1.3, then it holds that

(4.4.45) maxi=1,...,n

∣∣∣∣ 1√nXii

∣∣∣∣ a.s.→ 0.

Proof. For any α > 0 and 1 ≤ p ≤ 16 (recall (4.1.15)) we have

(4.4.46)

P(

maxi=1,...,n

∣∣∣ 1√nXii

∣∣∣ ≥ n−α)

≤ nP(∣∣∣ 1√

nX11

∣∣∣ ≥ n−α)

= nP(∣∣∣ 1√

nX11

∣∣∣p ≥ n−pα)

≤ nE[∣∣∣ 1√

nX11

∣∣∣p]npα≤ cn1+αp− p

2 .

By picking e.g. p = 5 and α small enough but constant the right-handside is bounded above by n−1+ε which is summable, so Borel-Cantelliimplies that

(4.4.47) P(

maxi=1,...,n

∣∣∣∣ 1√nXii

∣∣∣∣ ≤ n−α for n large enough)

= 1.

For (4.4.43) and (4.4.44) we need to use theMarcinkiewicz–Zygmundinequality.

Lemma 4.4.5 (Marcinkiewicz–Zygmund inequality). Let p ≥ 1.There is a constant c = c (p) such that for independent (complex) ran-dom variables Y1, Y2, . . . , Yn with mean zero we have

(4.4.48) ‖n∑i=1

Yi‖p ≤ c‖

√√√√ n∑i=1

|Yi|2‖p.

Remark 4.4.6. If the Yi are deterministic and can take any valuethe best inequality we can hope for is

(4.4.49) ‖n∑i=1

Yi‖p = |Y · (1, . . . , 1)| ≤ |Y |√n =√n‖

√√√√ n∑i=1

Y 2i ‖p,


from Cauchy-Schwartz, where we write Y = (Y1, . . . , Yn). This isclearly much weaker than (4.4.48), showing that the conclusion of(4.4.48) holds only thanks the concentration of the sum

∑ni=1 Yi coming

from the fact the the Yi are mean-zero and independent, cf. the Lawof Large numbers and the familiar formula

(4.4.50) E

∣∣∣∣∣n∑i=1

Yi

∣∣∣∣∣2 =

n∑i=1

E[|Yi|2

],

for independent Yi, which can be rewritten as

(4.4.51) ‖n∑i=1

Yi‖2 = ‖

√√√√ n∑i=1

|Yi|2‖2.

We slightly reformulate the inequality for our purposes.

Corollary 4.4.7. Let p ≥ 2. There is a constant c = c (p) suchthat for independent random variables Y1, Y2, . . . , Yn and any determin-istic vector a = (a1, . . . , an)

(4.4.52) ‖∑

akYk‖p ≤ c

(n

supk=1‖Yk‖p

)|a| .

Proof. Write

(4.4.53)

‖∑n

k=1 akYk‖pp = |a|p ‖∑n

k=1ak|a|Yk‖

pp

≤ c |a|p ‖√∑n

k=1

a2k

|a|2 |Yk|2‖p

= c |a|p E((∑n

k=1

a2k

|a|2 |Y2k |)p/2)

.

Now using Jensen’s inequality on the inner sum we get that this is atmost

(4.4.54) c |a|p E(∑n

k=1

a2k

|a|2 |Yk|p)

≤ c |a|p supnk=1 E [|Yk|p] .

We now move on to the second easiest, estimate (4.4.44), whichfollows directly from (4.4.48).

Lemma 4.4.8. For X as as in Theorem 4.1.3 and G(i) as in 4.4.32it holds for all z with Imz > 0 that

(4.4.55) maxi=1,...,n

∣∣∣∣∣ 1n∑k 6=i

|Xik|2 G(i)kk −

1

n

∑k 6=i

G(i)kk

∣∣∣∣∣ a.s.→ 0.


Proof. Let

(4.4.56)∆i = 1

n

∑k 6=i |Xik|2 G(i)

kk − 1n

∑k 6=i G

(i)kk

= 1n

∑k 6=i|Xik|2 − 1

G

(i)kk.

Now for fixed i define

(4.4.57) Yk = |Xik|2 − 1, k 6= i,

and

(4.4.58) ak =1

nG

(i)kk,

so that

(4.4.59) ∆i =∑k 6=i

Ykak,

for Yk independent and independent of ak. By 4.1.3 we have

(4.4.60) E [Yk] = 0.

Using (4.4.52), taking the underlying measure of the Lp-space to be theconditional measure E [·|a] (i.e. ‖ · ‖p = (E [·p|a])1/p, we have

(4.4.61) (E [|∆i|p |a])1/p ≤ sup

k 6=i‖Yk‖p |a| .

The assumption(4.4.52) implies

(4.4.62) ‖Yk‖p ≤ c <∞ for p ≤ 8.

Now crudely bounding(4.4.63)

G(i)kk ≤ ‖G

(i)kk‖op ≤

n−1supi=1

∣∣∣λi (G(i)kk

)∣∣∣ =n−1supi=1

∣∣∣∣∣∣ 1

λi

(1√nX(i)

)− z

∣∣∣∣∣∣ ≤ 1

Imz,

we obtain

(4.4.64) |a| =√∑

k 6=i

1

n2G

(i)2kk ≤

1

Imz1√n,

so that

(4.4.65) E [|∆i|p] ≤ c1

np/2×(

1

Imz

)p,

and for any α > 0

(4.4.66) P(|∆i| ≥ n−α

)≤ nαp

np/2

(1

Imz

)p.


Thus

(4.4.67) P(

nmaxi=1|∆i| ≥ n−α

)≤ n1+αp

np/2

(1

Imz

)p.

so setting e.g. p = 6 and α small enough we get with Borel-Cantelli,that

(4.4.68) P(

nmaxi=1|∆i| ≤ n−α for n large enough

)= 0.

The estimate (4.4.43) needs a more sophisticated version of (4.4.52).We can define

(4.4.69) Ykl = XikXil, k 6= l,

and

(4.4.70) akl =1

nG

(i)kl

so that

(4.4.71)∑

k,l 6=i,l 6=k

XikXil1

nG

(i)kl =

∑k,l 6=i,l 6=k

Yklakl.

We would then like to apply (4.4.52) to the right-hand side, but wecannot since the Ykl are not independent. The proof of the next lemmafinds a trick to get around this issue.

Lemma 4.4.9. Let p ≥ 2. There is a constant c = c (p) such that forindependent random variables Y1, Y2, . . . , Yn, V1, . . . , Vm and any deter-ministic vector a = (akl)1≤k≤n,1≤l≤m we have

a)

(4.4.72) ‖∑k,l

aklYkVl‖p ≤ c supk‖Yk‖p sup

l‖Vl‖p

√∑k,l

|akl|2,

and b)

(4.4.73) ‖∑k 6=l

aklYkYl‖p ≤ c

(supk‖Yk‖2

p

)√∑k,l

|akl|2.

Proof. We first prove (4.4.72). To do so we write

(4.4.74) ak =m∑l=1

aklVl,


so that

(4.4.75)∑k,l

aklYkVl =∑k

akYk,

and applying (4.4.52) conditioned on Vl we get

(4.4.76) E

[(∑k

akYk

)p

|V1, . . . , Vm

]≤ c

supk‖Yk‖p

√∑k

|ak|2p

.

Taking the unconditioned expectation on both sides we have

(4.4.77) E

[(∑k,l

aklYkVl

)p]≤ c sup

k‖Yk‖ppE

(∑k

|ak|2) p

2

We have

(4.4.78) E

(∑k

|ak|2) p

2

= ‖∑k

|ak|2 ‖p/2p/2.

Using the triangle inequality we get

(4.4.79)‖∑

k |ak|2 ‖p/2 ≤

∑k ‖ |ak|

2 ‖p/2=

∑k E [|ak|p]p/2 .

Thus

(4.4.80) E

[(∑k,l

aklYkVl

)p]≤ c sup

l‖Yk‖pp‖

(∑k

E [|ak|p]p/2)p/2

For each k we can use (4.4.52) on on E [|ak|p] to obtain

(4.4.81) E [|ak|p] = E

[∣∣∣∣∣∑l

aklVl

∣∣∣∣∣p]≤ c sup

l‖Vl‖pp

√∑l

|akl|2p

which gives(4.4.82)

E

[(∑k,l

aklYkVl

)p]≤ c sup

k

‖Yk‖pp

supl

‖Vl‖pp

(∑k

∑l

|akl|2)p

,

proving (4.4.72).We now use (4.4.72) to prove (4.4.73). Here Vk = Yk, so (4.4.72)

does not directly apply. The trick to get around this is to write the sum


as a sum of sums over subsets of k-s and l-s that are disjoint. Definefor fixed k, l

(4.4.83) Zn =∑

KtL=1,...,n

1k∈K1l∈L,

(by symmetry Zn does not depend on k, l) where the sum is over alldisjoint non-empty K,L whose union is 1, . . . , n, so that for all k, l

(4.4.84) 1 =1

Zn

∑KtL=1,...,n

1k∈K1l∈L.

Then we can write

(4.4.85)

∑k 6=l YkYlakl

=∑

k 6=l

1Zn

∑KtL=1,...,n 1k∈K1l∈L

YkYlakl

= 1Zn

∑KtL=1,...,n

∑k∈K,l∈K YkYlakl.

In the inner sum, we can set Vl = Yl for l ∈ L, and then Yk, k ∈ K isindependent from Vl, l ∈ L, so we can apply (4.4.52) to get that

(4.4.86) ‖∑

k∈K,l∈K

YlYlakl‖p ≤ c supk∈K‖Yk‖p sup

l∈K‖Yl‖p

√ ∑k∈K,l∈L

|akl|2.

Using the triangle inequality we have that

(4.4.87)

‖∑

k 6=l YkYlakl‖p≤ 1

Zn

∑KtL=1,...,n ‖

∑k∈K,l∈K YkYlakl‖p

≤ c supk∈K ‖Yk‖p supl∈K ‖Yl‖p√∑n

k,l=1 |akl|2∑KtL=1,...,n 1

Zn.

Now we can compute exactly

(4.4.88)∑

KtL=1,...,n

1 = 2n − 2

(number of assignments of each i toK or L, except the two assignmentswhere K or L is empty) and fixing w.l.o.g. k = 1, l = 2 we have

(4.4.89) Zn =∑

KtL=1,...,n

11∈K12∈L = 2n−2,

(number of assignments of elements 3, . . . , n to K or L). Therefore

(4.4.90)

∑KtL=1,...,n 1

Zn≤ 4,

so the claimed result (4.4.73) follows.

We can now give the formal version of the estimate (4.4.43).


Lemma 4.4.10. For X as in Theorem 4.1.3 and z ∈ C with Imz > 0we have

(4.4.91) nsupi=1

∣∣∣∣∣ 1n ∑k,l 6=i,l 6=k

XikXilG(i)kl

∣∣∣∣∣ a.s.→ 0.

Proof. Let

(4.4.92) ∆i = 1n

∑k,l 6=i,l 6=kXikXilG

(i)kl

=∑

k,l∈1,...,n\i YkYlakl,

where

(4.4.93) akl =1

nG

(i)kl ,

and

(4.4.94) Yk = Xik.

Then we can apply (4.4.73) to ∆i to obtain that

(4.4.95) ‖∆i‖p ≤ c supi,j≥1‖Xij‖2

p

√∑k,l

|akl|2 .

If p ≤ 16 then

(4.4.96) ‖∆i‖p ≤ c

√∑k,l

|akl|2 = c1

n

√∑k,l

∣∣∣G(i)kl

∣∣∣2.Now

(4.4.97)∑k,l

∣∣∣G(i)kl

∣∣∣ = Tr(G(i)

(G(i)

)∗),

and

(4.4.98) Tr(G(i)

(G(i)

)∗)≤ n‖G(i)‖2

op ≤n

(Imz)2 ,

(as in (4.4.63)), so

(4.4.99) ‖∆i‖p ≤ c1√n

1

Imz.

Thus for p ≤ 16

(4.4.100) P(

nmaxi=1|∆i| ≥ n−α

)≤ n

(c

1√n

1

Imz

)pnαp,

so that e.g for p = 5 and α small enough the right-hand side is ≤ n−1−ε,so summable, and we get

(4.4.101) P(

nmaxi=1|∆i| ≤ n−α for all n large enough

)= 1.


The approximation (4.4.39) remains. To make it precise we needthe following.

Lemma 4.4.11 (Hoffman-Wielandt inequality). If A and B are n×nHermitian matrices then

(4.4.102)n∑i=1

(λi (A)− λi (B))2 ≤ ‖A−B‖2HS,

(where λ1 (A) ≤ . . . ≤ λn (A) and λ1 (B) ≤ . . . ≤ λn (B)).

We now make (4.4.39) formal.

Lemma 4.4.12. For X as in Theorem 4.1.3 we have for all z ∈ Cwith Imz > 0,

(4.4.103) maxi=1,...,n

∣∣∣∣ 1nTr(G(i))− 1

nTr (G)

∣∣∣∣ a.s.→ 0.

Proof. Note that the eigenvalues of G are

(4.4.104)1

λk

(1√nX)− z

, k = 1, . . . , n.

Let X(i) be the matrix X with the i-th row and column set to zero (notremoved). The X(i) has the same eigenvalues as X(i), and in additionthe eigenvalue zero. The eigenvalues of G(i) are 0

(4.4.105)

1

λk

(1√nX(i)

)− z

: 1 ≤ k ≤ n

\−1

z

∪ 0 .Thus(4.4.106)∣∣∣Tr(G(i)

)− Tr (G)

∣∣∣ ≤ c

|Imz|+

n∑k=1

∣∣∣∣∣∣ 1

λk

(1√nX)− z− 1

λk

(1√nX(i)

)− z

∣∣∣∣∣∣ .Now

(4.4.107)

∑nk=1

∣∣∣∣ 1

λk

(1√nX)−z− 1

λk

(1√nX(i)

)−z

∣∣∣∣=∑n

k=1

∣∣∣∣ λk

(1√nX(i)

)−λk

(1√nX)

(λk

(1√nX)−z)(λk

(1√nX(i)

)−z)∣∣∣∣

≤ 1Imz

∑nk=1

∣∣∣λk ( 1√nX(i)

)− λk

(1√nX)∣∣∣

≤ 1Imz

√n

√∑nk=1

∣∣∣λk ( 1√nX(i)

)− λk

(1√nX)∣∣∣2 .


By Lemma 4.4.11 we get

(4.4.108)

∑nk=1

∣∣∣λk ( 1√nX(i)

)− λk

(1√nX)∣∣∣2

≤ 1n‖X − X(i)‖2

HS

≤ 1n× 2n× sup1≤k,l≤n |Xk,l|2 .

Thus(4.4.109)

nmaxi=1

1

n

∣∣∣Tr(G(i))− Tr (G)

∣∣∣ ≤ c

|Imz|

(1

n+

1√n

sup1≤i,j≤n

|Xij|).

Next we show that

(4.4.110)1√n

sup1≤i,j≤n

|Xij|a.s.→ 0,

which completes the proof.To see this note that

(4.4.111)

P(

1√n

sup1≤i,j≤n |Xij| ≥ n−α)

≤ n2 supi,j P(

1np/2|Xij| ≥ n−pα

)= n2 supi,j P

(1

np/2|Xij|p ≥ n−pα

)≤ n2−p/2+pαE [|Xij|p]≤ n2−p/2+pα,

for 1 ≤ p ≤ 16 by (4.1.15). Now e.g. with p = 8 and α small enoughthis is summable, so by Borel-Cantelli (4.4.110) follows.

We can now finally prove Proposition 4.3.1, the formal version ofthe equation sn (z) ≈ − 1

z+sn(z).

Proof of Proposition 4.3.1. It suffices to consider z with Imz >0, since sn (z) = ¯sn (z). We have

(4.4.112) sn (z) =1

n

n∑i=1

1

z +1√nXii +

1

n

∑k,l 6=i

XikXilG(i)kl︸︷︷︸

Ai

.

Consider

(4.4.113)

∣∣∣∣∣ 1nn∑i=1

1

z + Ai− 1

z + sn (z)

∣∣∣∣∣ .


We have(4.4.114)∣∣∣∣∣ 1n

n∑i=1

1

z + Ai− 1

z + sn (z)

∣∣∣∣∣ =

∣∣∣∣∣ 1nn∑i=1

Ai − sn (z)

(z + Ai) (z + sn (z))

∣∣∣∣∣ .We have(4.4.115)maxni=1 |Ai − sn (z)| ≤ maxni=1

∣∣∣ 1√nXii

∣∣∣+ maxni=1

∣∣∣ 1n

∑k,l 6=i,l 6=kXikXilG

(i)kl

∣∣∣+ maxi=1,...,n

∣∣∣ 1n

∑k 6=i |Xik|2 G(i)

kk − 1n

∑k 6=i G

(i)kk

∣∣∣+ maxi=1,...,n

∣∣∣ 1nTr(G(i)

)− 1

nTr (G)

∣∣∣ ,-so by Lemmas 4.4.4, 4.4.10, 4.4.8, 4.4.12 we have

(4.4.116) nmaxi=1|Ai − sn (z)| → 0.

It holds that(4.4.117) Imsn (z) ≥ 0 if Imz ≥ 0,

which with (4.4.116) implies that mini Im (z + Ai) ≥ 12Imz for n large

enough, so for n large enough

(4.4.118)

∣∣∣∣∣ 1nn∑i=1

1

z + Ai− 1

z + sn (z)

∣∣∣∣∣ ≤ c

(Imz)2

nmaxi=1|Ai − sn (z)| ,

which proves that

(4.4.119)∣∣∣∣sn (z) +

1

z + sn (z)

∣∣∣∣→ 0 a.s..

This also concludes the proof of the Wigner semi-circle law, i.e.Theorem 4.1.3.

CHAPTER 5

Sample-covariance matrices

In this Chapter we study some questions related to sample covari-ance matrices. The motivation comes from multivariate statistics.

5.1. Motivation

Assume that Y ∈ Rm is a random column vector with covariancematrix

(5.1.1) Σ := (Cov [Yi, Yj])i,j=1,...,m .

We consider a situation where Σ is unknown, and we wish to estimateit from data x1, . . . , xn that we assume was sampled according to thedistribution of Y . A natural estimator of Σij is

(5.1.2) Σij =1

n

n∑k=1

(xk,i − µi) (Xk,j − µj) ,

where

(5.1.3) µi =1

n

n∑k=1

xk,i.

If the Xk, k ≥ 1, are independent, the Law of Large numbers impliesthat for all i, j

(5.1.4) Σija.s.→ Σij.

If we make x1, . . . , xk the columns of a matrix

(5.1.5) X =(x1 . . . xn

),

we can write in matrix form

(5.1.6) Σ =1

n(X − µ) (X − µ)T .

When the samples X1, . . . , Xn are modeled as random variables, theestimator Σ is a random matrix.

Now consider the question of assessing based on the estimator whether

(5.1.7) Σ = I or Σ 6= I,

75

5.2. THE MARCHENKO-PASTUR LAW 76

that is if the components of Y are independent with variance one, ornot so. One way to do so is simply to see if the on-diagonal elementsof Σ are approximately 1, and the off-diagonal elements approximatelyzero. An attractive alternative is to consider the spectrum of Σ, andcheck if all the eigenvalues are close to 1.

In the most standard situation where the number of samples n ismuch larger the dimension m of the vectors one can indeed expect thatif Σ = I, then the eigenvalues of Σ will all be close to 1, see Figure5.1.1. However, it is not uncommon for m also to be large, of sizecomparable to n. For instance, if one is estimating the covariancesamong the 500 stocks in the S&P500 index during 10 years, or roughly2500 trading days, then m = 500 and n = 2500. Figure 5.1.2 shows thedistribution of eigenvalues from the estimator Σ when m = 500 andn = 2500 computed from Xi with covariance I. We see that even forthis synthetic data where the covariance matrix is known to be I, thereare eigenvalues far away from 1, so that the eigenvalues cluster close toone is a far too restrictive condition to test whether Σ = I or not.1

This motivates the study of the eigenvalue distribution of the esti-mator Σ. For simplicity we will assume in what follows we will assumethat E [Xi] = 0 and drop the empirical variance terms, thus studyingsimply 1

nXXT .

5.2. The Marchenko-Pastur law

In this Chapter we use the Stieltjes transform method to derive thelimiting ESD of 1

nXXT , which is known as the Marchenko-Pastur law.

Let Xij, 1 ≤ i, j <∞ be IID random variables with

(5.2.1) E [Xij] = 0,

(5.2.2) E[|Xij|2

]= 1.

Let α > 0 and

(5.2.3) m = bαnc

and let

(5.2.4) Xn = (Xij)1≤i≤m,1≤j≤n

be a bαnc × n matrix. Let

(5.2.5) sn (z) = sµ 1nXnX

Tn

(z) .

1Figure 9.1 [3] shows the spectrum when one computes the the empirical co-variance matrix from actual S&P data.


0.0 0.5 1.0 1.5 2.0 2.50.0

0.2

0.4

0.6

0.8

1.0

Eigenvalue distribution of a 3x3 sample covariance matrix of IID Gaussians for n=2500

(5.1.8)

1.00329 −0.00252662 −0.0492262−0.00252662 1.00908 0.00528853−0.0492262 0.00528853 0.958605

(5.1.9) 1.03594, 1.00826, 0.926774

Figure 5.1.1. 3 × 3 empirical covariance matrix andeigenvalues for independently sampled Gaussian vectorswith n = 2500.

0.5 1.0 1.5 2.00

5

10

15

20

25Eigenvalue distribution of a 500x500 Wishart distribution with m=2500

Figure 5.1.2. Eigenvalues of 500×500 empirical covari-ance matrix of independently sampled Gaussian vectorswith n = 2500.


Lemma 5.2.1. (Sherman–Morrison formula) Let A be an invertiblen × n matrix, and let u ∈ Rn. Then if 1 + uTA−1v 6= 0 the matrixA+ uvT is invertible and

(5.2.6)(A+ uvT

)−1= A−1 − A−1uvTA−1

1 + vTA−1u.

Proof. One can verify that

(5.2.7)(A+ uvT

)(A−1 − A−1uvTA−1

1 + vTA−1u

)= I.

We now prove the main estimate needed to prove the convergenceto the Marchenko-Pastur law, namely the approximate equation

(5.2.8)αsn (z)

1 + αsn (z)− zαsn (z)− α ≈ 0.

Proposition 5.2.2. (Main estimate) Provided there is an ε > 0such that

(5.2.9) E[|Xij|4+ε] ≤ c <∞,

we have for all z ∈ C with Imz > 0 that

(5.2.10)∣∣∣∣ αsn (z)

1 + αsn (z)− zαsn (z)− α

∣∣∣∣ a.s.→ 0.

Proof. Let xj, j ≥ 1, denote the column vectors (Xij)1≤i≤m , j ≥ 1.Let

(5.2.11) A = XnXTn =

n∑i=1

xixTi .

Let

(5.2.12) G = (A− znI)−1 .

Note that

(5.2.13) sn (z) = 1mTr(

1nXnX

Tn − zI

)−1

= αnTr (A− znI)−1 ,

where

(5.2.14) αn =m

n= α +O

(1

n

).

We have on the one hand

(5.2.15) Tr((A− nzI) (A− nzI)−1) = Tr (Im×m) = m.


On the other hand(5.2.16)

(A− nzI) (A− nzI)−1 = A (A− nzI)−1 − nz (A− nzI)−1

= AG− nzG,

and

(5.2.17) AG =n∑i=1

xixTi G.

Also xkxTkG is a rank one matrix, and so has trace equal to it’s onenon-zero eigenvalue, which is xk · (xkG) = xTkGxk (alternatively usingthe cyclic property the trace Tr

(xkx

TkG)

= Tr(xTkGxk

)). Therefore

(5.2.18) Tr (AG) =n∑i=1

xTi Gxi,

and we obtain the equation

(5.2.19) m =n∑i=1

xTi Gxi − nzTrG,

giving

(5.2.20) αn =1

n

n∑k=1

xTi Gxi − zTrG,

i.e.

(5.2.21)m

n=

1

n

n∑k=1

xTi Gxi − zαnsn (z) .

Consider

(5.2.22) A(i) =n∑

k=1,k 6=i

xTk xk,

and

(5.2.23) G(i) =(A(i) − nzI

)−1.

Using (5.2.6) we have

(5.2.24)G =

(A(i) − nzI + xix

Ti

)−1

= G(i) − G(i)xixTi G

(i)

1+xTi G(i)xi

.


This gives

(5.2.25)

xTi Gxi

= xTi G(i)xi − xTi G

(i)xixTi G

(i)xi1+xTi G

(i)xi.

= xTi G(i)xi

(1− xTi G

(i)xi1+xTi G

(i)xi

)= xTi G

(i)xi1

1+xTi G(i)xi

=xTkG

(i)xk1+xTi G

(i)xi.

Thus we get that

(5.2.26) αn =1

n

n∑i=1

xTi G(i)xi

1 + xTi G(i)xi− zαnsn (z) .

Claim:

(5.2.27) maxi

∣∣xTi G(i)xi −Tr(G(i)

)∣∣ a.s.→ 0.

Proof of claim: Let

(5.2.28) ∆diagi =

∑k

G(i)kk

|xi,k|2 − 1

,

and

(5.2.29) ∆6=i =∑k 6=l

G(i)kkxi,kxi,k.

Then for each i

(5.2.30)∣∣xTi G(i)xk − Tr

(G(i)

)∣∣ ≤ ∣∣∣∆diagi

∣∣∣+∣∣∣∆6=i ∣∣∣ .

By (4.4.52), used conditioned on G(i) we have,

(5.2.31) ‖∆diagi ‖p ≤

√∑k

∣∣∣G(i)kk

∣∣∣2,and by (4.4.73), also used conditioned on G(i),

(5.2.32) ‖∆6=i ‖p ≤√∑

k,l

∣∣∣G(i)kl

∣∣∣2.Now

(5.2.33)

√∑k

∣∣∣G(i)kk

∣∣∣2 ≤ √∑k,l

∣∣∣G(i)kl

∣∣∣2=

√Tr (G(i)G(i)∗)

≤√n‖G(i)‖2

op

≤√n 1nImz = 1√

nImz ,


since

(5.2.34) ImG(i) ≤ 1

nImz.

Therefore using the moment assumption (5.2.9)

(5.2.35)

P(∣∣∣∆diag

i

∣∣∣ , ∣∣∣∆6=i ∣∣∣ ≥ 1logn

for some i)

≤ nP(∣∣∣∆diag

i

∣∣∣p ≥ ( 1logn

)p)+ P

(∣∣∣∆6=i ∣∣∣p ≥ ( 1logn

)p)≤ n (logn)p

np2 (Imz)p

,

for some p > 4. But with p > 4 the LHS is summable, so by Borel-Cantelli

(5.2.36) maxi

∣∣∣∆diagi

∣∣∣+∣∣∣∆6=i ∣∣∣ a.s.→ 0.

This proves the claim.Claim:

(5.2.37)∣∣Tr (G(i)

)− αnsn (z)

∣∣ ≤ 1

nImz.

Proof of claim: The LHS equals

(5.2.38)∣∣Tr (G(i)

)− Tr (G)

∣∣ .By (5.2.24) this equals

(5.2.39)∣∣∣∣Tr(G(i)xix

Ti G

(i)

1 + xTi G(i)xi

)∣∣∣∣ .Since A(i) − znI is Hermitian also G(i) is giving

(5.2.40)Tr(G(i)xix

Ti G

(i))

= Tr(G(i)xi

(G(i)xi

)∗)=∣∣G(i)xi

∣∣2 ,so

(5.2.41)

∣∣Tr (G(i))− Tr (G)

∣∣ =|G(i)xi|2|1+xTi G

(i)xi|≤ |G(i)xi|2|ImxTi G(i)xi| .

Since G(i) is Hermitian it is diagonalizable. Let x denote the vector xiwritten in this bases and let 1

λk(A(i))−znbe the eigenvalues of G(i). We


have that the above expression equals

(5.2.42)

∑k

∣∣∣∣ 1

λk(A(i))−znxk

∣∣∣∣2∣∣∣∣∑k Im1

λk(A(i))−znix2k

∣∣∣∣ =

∑k

1

|λk(A(i))−zn|2 xk∣∣∣∣∑k Im1

λk(A(i))−znix2k

∣∣∣∣ .Note that

(5.2.43) Im1

λk (A(i))− nz=

nImz|λi (A(i))− nz|2

.

Thus (5.2.42) equals 1nImz . This proves the claim.

Now (5.2.27) and (5.2.37) and the fact that |sn (z)| ≤ 1Imz and

|αn − α| ≤ 1nshow that if we define

(5.2.44) ∆i = xTi G(i)xi − αsn (z) ,

then

(5.2.45) maxi|∆i|

a.s.→ 0.

From (5.2.26) we get that(5.2.46)α = 1

n

∑nk=1

αsn(z)+∆i

1+αsn(z)+∆i− zαsn (z) + (α− αn) + zsn (z) (α− αn)

= αsn(z)1+αsn(z)

− zαsn (z) + 1n

∑nk=1

∆i

(1+αsn(z)+∆i)(1+αsn(z))+ (α− αn) + zsn (z) (α− αn) ,

provided 1 + αsn (z) 6= 0 (that 1 + αsn (z) + ∆i 6= 0 is implicit in thestatement (5.2.26)). Now

(5.2.47)|1 + αsn (z)| = |z+αzsn(z)|

|z|

≥ Im(z+αzsn(z))|z|

≥ Imz|z| ,

where we used that Imsn (z) ≥ 0 for Imz > 0. Thus indeed 1+αsn (z) 6=0, and also by (5.2.45) we have for n large enough that

(5.2.48) |1 + αsn (z) + ∆i| ≥1

2

Imz|z|

.

These two inequalities allow us to bound

(5.2.49)

∣∣∣ 1n

∑nk=1

∆i

(1+sn(z)+∆i)(1+sn(z))

∣∣∣≤ cmaxi|∆i|

( Imz|z| )

2

a.s.→ 0.

Using this, |αn − α| ≤ 1nand |zsn (z)| ≤ |z| 1

|Imz| in (5.2.46) the claim(5.2.10) follows.


Remark 5.2.3. The equation

(5.2.50)αs

1 + αs− zαs− α = 0,

has solutions

(5.2.51) s± = −1

2+

1− α2αz

±

√(z + α− 1)2 − 4αz

(2zα)2 .

With the convention√reiθ =

√rei

θ2 for 0 ≤ θ < 2π, we have that

Im√x ≥ 0 for all x, and Im1−α

2zα= −1−α

2αImz|z|2 , so if Imz > 0 then

certainly Ims− < 0.

Since for any measure µ and z with Imz > 0 we have Imsµ (z) ≥ 0this disqualifies s− as a solution for such Imz > 0. It is convenient tofactorize(5.2.52) (αz + α− 1)2 − 4αz = − (z − aα) (bα − z) ,

where(5.2.53) aα =

(1−√α)2 and bα =

(1 +√α)2.

We thus define

(5.2.54) s (z) := −1

2+

1− α2αz

+

√− (z − aα) (bα − z)

(2zα)2 for Imz > 0,

and expect this to be the Stieltjes transform of the limiting ESD1nXnX

Tn (for Imz > 0).

We now show that (5.2.10) indeed implies that s (z) is the limit ofsn (z).

Lemma 5.2.4. Under the conditions of Proposition 5.2.2, it holdsfor all z ∈ C with Imz > 0 that

(5.2.55) P (sn (z)→ s (z)) = 1,

for s (z) as in (5.2.54).

Proof. By (5.2.10) we have for a.e. ω ∈ Ω that

(5.2.56)∣∣∣∣ αsn (ω, z)

1 + αsn (ω, z)− αzsn (ω, z)− α

∣∣∣∣→ 0.

Any convergent subsequence of sn (ω, z) must convergence to some swith 1 +αs 6= 0 (since if 1 +αsn (ω, z)→ 0 then the right-hand side of(5.2.56) diverges) and Ims ≥ 0 (since Imsn (ω, z) has the same sign asImz), and this s must satisfy

(5.2.57)αs

1 + αs− zαs− α = 0 and Ims ≥ 0.


As seen in Remark 5.2.3 the equation has two solutions s±, and Ims− <0 if Imz > 0, so then the limit point must be

(5.2.58) s+ = s (z) ,

(recall (5.2.52)). The sequence sn (ω, z) is sequentially compact (since|sµ (z)| ≤ 1

|Imz| for any µ, z) and each subsequence has limit s (z); thisproves that in fact

(5.2.59) sn (ω, z)→ s (z) ,

(and as a by-product we obtain that Ims+ ≥ 0). Furthermore thisholds for a.e. ω ∈ Ω.

Heuristically we can already recover the limit ESD by invertingthis Stieljes transform. Heuristically (cf. (5.2.52)), the density of thecontinuous part of a measure µ at E should be

(5.2.60) limη↓0

1

πImsµ (E + iη) .

Considering

(5.2.61) limη↓0

1

πIms (E + iη) ,

one easily sees that the first two terms having imaginary part tendingto zero, so this equals

(5.2.62) limη↓0

1

π

√− (z − aα) (bα − z)

(2zα)2 for z = E + iη.

We will see that for all z with Imz ≥ 0 it holds

(5.2.63) Im− (z − aα) (bα − z)

(2zα)2 ≥ 0,

so that the argument of√· is always in the region where

√· is contin-

uous. Thus we can take the limit inside the square root to get(5.2.64)

Im1

π

√− (E − aα) (bα − E)

(2Eα)2 =

√(E−aα)(bα−E)

2πEif E ∈ (aα, bα) ,

0 i f E /∈ (aα, bα) .

This motivates the definition

(5.2.65) fMP (x;α) =

√(x− aα) (bα − x)

2πx,

which we call the Marchenko-Pastur density.Moreover, the function s (z) has singularities on the real line. From

the example µ = cδu+ν, for a measure ν, which has Stieltjes transform


sµ (z) = cu−z + sν we see that a pole of the form c

u−z should correspondto an atom of weight c at u. The function s (z) has a pole at z = 0,which should indicate the presence of an atom at 0. To recover it’sweight, first note that one can check that for z close but with positiveimaginary part we have

(5.2.66)

√−(z−aα)(bα−z)

(2zα)2 ≈ −√aαbα2zα

= −√

(1−α)2

2zα

= − |1−α|2zα

.

Thus the pole of s (z) at z = 0 is approximately

(5.2.67)1zα−1−1

2− 1

z

|1−α−1|2

= 1−z

(1−α−1+|1−α−1|

2

).

Now since x+|x|2

= x1x≥0 we have that this equals 1−z (1− α−1) 1α>1.

We thus expect the limiting ESD to have an atom at 0 if α > 1 (indeedsimple considerations show that there must be an atom at 0 in this case,see exercise sheet) of weight 1− α−1. This motivates the definition

(5.2.68) µMP,α := δ0

(1− α−1

)1α>1 + fMP (x;α) dx,

of the Marchenko-Pastur law. We expect this to be the limiting ESD.One way we could proceed with a proof that the ESD of 1

nXnX

Tn

converges to µMP,α would be to verify that µMP,α really defines a prob-ability (check that the total mass is 1), compute its Stieltjes transformto verify that it’s really given by s (z), and then conclude from (5.2.4)(and Lemma 4.3.4) that µ 1

nXXT

w→ µMP,α. Similarly to what we did forthe semicircle law in Section 4.3, we will take slightly different routewith an auxiliary argument that verifies that s (z) is the Stieltjes trans-form of a probability, and then carrying out a rigorous version of theabove heuristic inversion of the Stieltjes transform.

To prove that s (z) is a Stieltjes transform we need to prove it satis-fies the following property (that any Stieltjes transform must satisfy).

Lemma 5.2.5. The function s (z) from (5.2.54) is continuous inz : Imz > 0.

Proof. The first two terms in (5.2.54) are clearly continuous forImz > 0. We must thus only check that

(5.2.69) z →

√− (z − aα) (bα − z)

(2zα)2 ,


is continuous for such z. The follows if the range of

(5.2.70) z → − (z − aα) (bα − z)

(2zα)2 =z2 − 2 (1 + α) z + (1− α)2

(2zα)2

does not include (0,∞) (where we used that aα + bα = 2 (1 + α) andaαbα = (1− α)2 or equivalently (setting z = 1

x) if the range of

(5.2.71) x→ 1− 2 (1 + α)x+ (1− α)2 x2,

for Imx > 0 does not include (0,∞). Writing this as

(5.2.72) (1− α)2

(x− 1 + α

(1− α)2

)2

+(1− α)2 − (1 + α)2

(1− α)4

,

we see that this is real only if x is real (which is ruled out) or x =1+α

(1−α)2 + iv for some v > 0, but then the expression equals

(5.2.73) (1− α)2

−v2 +

(1− α)2 − (1 + α)2

(1− α)4

,

which is negative for all v > 0. This shows that the range does notintersect (0,∞).

The proof that s (z) is indeed a Stieltjes transform also needs abound on the operator norm in a special case for the distribution ofXij. This is a version of 3.3.1 for non-square matrices, the proof isalmost identical.

Lemma 5.2.6. Let α > 0 and let Xij be IID Rademacher randomvariables, and Xn as (5.2.4). There is a constant c = c (α), such that

(5.2.74) P(‖Xn‖op ≤ c

√n for n large enough

)= 1.

Proof. Note that

(5.2.75) ‖Xn‖op = supa∈Rm,b∈Rn:|a|=1,|b|=1

aTXnb.

Let Σk ⊂ Rk be a set such that for all x ∈ Rk with |x| = 1 we have

(5.2.76) infy∈Σk|x− y| ≤ 1

10,

and

(5.2.77)∣∣Σk∣∣ ≤ e10k.

Let a, b be such that

(5.2.78) ‖Xn‖op = aTXnb.


Let a ∈ Σm, b ∈ Σn be the closest points to a, b in the correspondingsets, and note that

(5.2.79)

‖X‖op =∣∣aTXnb

∣∣=

∣∣∣aTXnb+ aTXn

(b− b

)∣∣∣=

∣∣∣aTXnb+ (a− a)T Xnb+ aTXn

(b− b

)∣∣∣≤

∣∣∣aTXnb∣∣∣+∣∣∣(a− a)T Xnb

∣∣∣+∣∣∣aTXn

(b− b

)∣∣∣≤

∣∣∣aTXnb∣∣∣+ 2

10‖X‖op.

So

(5.2.80) ‖X‖op ≤1

2sup

a∈Σm,b∈ΣnaTXnb.

Using the inequality ex+e−x

2≤ e

12x2 we have for any a, b with |a| = |b| =

1

(5.2.81)

P(aTXnb ≥ x

)= P

(∑ij ajbjXij ≥ x

)≤ E

[exp

(λ∑

ij ajbjXij

)]e−λx

=(∏

ijeλajbi+e−λajbi

2

)e−λx

≤ e12

∑ij λ

2a2j b

2i e−λx

≤ e12λ2−λx

≤ e−12x2 by setting λ = x.

Thus

(5.2.82)

P (‖Xn‖op ≥ u√n)

≤ P(supa∈Σ1,b∈Σ2 aTXnb ≥ 1

2u√n)

≤ |Σm| × |Σn| × e− 18u2n

≤ e10(m+n)− 18u2n

≤ e(20 max(1,α)− 1uu2)n

which for large enough u (depending on α) is summable.

Next we want to conclude that s (z) is a Stieltjes transform. Forthis we must define it also in the half-plane with negative imaginarypart. Since any Stieltjes transform must satisfy this relation we defineit via

(5.2.83) s (z) := ¯s (¯)z for Ims < 0.


(Since√x = −

√x in our convention this means that

(5.2.84) s (z) = −1

2+

1− α2zα

−

√− (z − aα) (bα − z)

(2zα)2 , for Imz < 0).

Lemma 5.2.7. The function s (z) is the Stieltjes transform of aprobability measure.

Proof. Consider the special case where Xij are Rademacher ran-dom variables (P (Xij = ±1) = 1

2). By Lemma 5.2.6 there is constant

C such that for a.e. ω we have

(5.2.85) ‖Xn‖op ≤ C√n for n large enough.

But

(5.2.86) ‖ 1

nXnX

Tn ‖op ≤

1

n‖Xn‖2

op,

so in fact

(5.2.87) µ 1nXn(ω)XT

n (ω)

([−C2, C2

]c)= 0 for n large enough.

By Lemma 5.2.4 we also have for each z with Imz > 0 that

(5.2.88) sµ 1nXn(ω)XTn (ω)

(z)→ s (z) a.s.

We thus have that the event

(5.2.89)µ 1nXnXT

nis tight

∩

⋂z∈Q+iQ∩(0,∞)

sn (z)→ s (z)

,

is an almost sure event. For any ω in this event the sequence

(5.2.90) µ 1nXn(ω)XT

n (ω)

is a tight sequence of measures, so there is a subsequence nk and aprobability measure µ so that

(5.2.91)1

nkXnk (ω)XT

nk(ω)

w→ µ (ω) .

By Lemma 4.2.3 we have

(5.2.92) s 1nkXnk (ω)XT

nk(ω) (z)→ sµ(ω) (z) ,∀z ∈ C\R.

But also for all ω in the event (5.2.89) we have for all z ∈ Q+iQ∩(0,∞)

(5.2.93) s 1nkXnk (ω)XT

nk(ω) (z)→ s (z) .

This implies that

(5.2.94) sµ(ω) (z) = s (z) for z ∈ Q + iQ ∩ (0,∞) .


But both sµ(ω) (z) and s (z) are continuous for z : Imz > 0 (recallLemma 5.2.5), and Q + iQ ∩ (0,∞) is dense in this set, so in fact

(5.2.95) s (z) = sµ(ω) (z) for z with Im (z) > 0.

Since s (z) = ¯s (z) and sµ(ω) (z) = ¯sµ(ω) (¯)z this implies that in fact

(5.2.96) s (z) = sµ(ω) (z) for z ∈ C\R.

Now that we know s (z) is the Stieltjes transform of a probabil-ity measure, we can use the inversion formula (4.2.3) to recover theprobability measure.

Lemma 5.2.8. Let α > 0 and s (z) as in (5.2.54) (and (5.2.84))and µMP,α as in (5.2.68). Then

(5.2.97) sµMP ,α = s.

Proof. By Lemma 5.2.7 we know that there exists a µ such thatsµ = s. We now compute µ explicitly. By (4.2.3) we have for all a < b

(5.2.98)12µ (a) + µ ((a, b)) + 1

2µ (b)

= limη↓01π

∫Ims (E + iη) dE.

If 0 /∈ (a, b) then s (z) is bounded on the interval. Therefore we can inthis case take the limit inside to get

(5.2.99)12µ (a) + µ ((a, b)) + 1

2µ (b)

=∫

limη↓01πIms (E + iη) dE.

As in (5.2.61)-(5.2.65) we have

(5.2.100) limη↓0

1

πIms (E + iη) = fMP,α (E) .

By by taking the limit b ↓ a and noting that µMP,α (b) ↓ 0 as b ↓ 0(otherwise µMP,α can not be a probability measure) and µMP ((a, b)) ↓ 0we get that µMP (a) = 0 for all a 6= 0. Thus we have for all a < bwith 0 /∈ (a, b) that

(5.2.101) µ ((a, b)) =

∫ b

a

fMP,α (x) dx.

It thus only remains to compute

(5.2.102) µ (0) .For ε > 0 consider

(5.2.103) limη↓0

∫ ε

−ε

1

πIms (z) dE where z = E + iη.


Using that√uv = ±

√u√v one gets

(5.2.104)

√−(z−aα)(bα−z)

(2αz)2

=√

z2−2z(1+α)+(1−α)2

(2αz)2

= w (z)|1−α−1|

2z

√1 + z2−2z(1+α)

(1−α)2 ,

where w (z) ∈ −1, 1 can in principle depend on z. But

(5.2.105) Im(z2 − 2z (1 + α)

)= 2 (Rez − (1 + α)) Imz < 0

for ε chosen small enough depending on α, so for such ε

(5.2.106) z →

√1 +

z2 − 2z (1 + α)

(1− α)2 ,

is a continuous map in z : Imz > 0. Since z →√−(z−aα)(bα−z)

(2αz)2 is alsocontinuous (since s (z) is and all other terms in s (z) are) in this region,and also z → 1

zis , in fact w (z) must be continuous and thus constant

in this region z : Imz > 0. Thus we can determine the value of w (z)by computing it for any arbitrary z. By taking z → ∞ along theimaginary line we have that first square root on the LHS approaches 1,and the one one the RHS approaches

√z2 = z, from which we conclude

that in fact w (z) = 1. Furthermore (5.2.105) implies

(5.2.107)

∣∣∣∣∣√

1 +z2 − 2z (1 + α)

(1− α)2 − (−1)

∣∣∣∣∣ ≤ c |z| ,

so that we get

(5.2.108)

∣∣∣∣∣√− (z − aα) (bα − z)

(2αz)2 +|1− α−1|

2z

∣∣∣∣∣ ≤ c |z| .

Thus letting

(5.2.109) r =1− α−1

2+|1− α−1|

2=(1− α−1

)1α>1.

we have

(5.2.110)∣∣∣∣Ims (z)− r

−z

∣∣∣∣ ≤ c |z| ,

and so

(5.2.111)∣∣∣∣∫ ε

−ε

1

πIms (z) dE −

∫ ε

−ε

1

π

r

−zdE

∣∣∣∣ ≤ cε


Using e.g. the Residue theorem we have that for η ↓ 0

(5.2.112)1

π

∫ ε

−ε

1

π

r

−zdE → r.

Thus taking the limit η ↓ 0 we have

(5.2.113) |µ ((−ε, ε))− r| ≤ cε.

Taking now ε ↓ 0 proves that

(5.2.114) µ (0) =(1− α−1

)1α>1.

This shows that µ = µMP,α so sµMP,α= sµ = s.

Theorem 5.2.9. (Marchenko-Pastur law) Let Xij, 1 ≤ i, j <∞ beIID random variables as Proposition 5.2.2, that is

(5.2.115) E [Xij] = 0,

(5.2.116) E[|Xij|2

]= 1,

and for some ε > 0

(5.2.117) E[|Xij|4+ε] ≤ c <∞.

Let α > 0, and let Xn = (Xij)1≤i≤bαnc,1≤j≤n be a bαnc × n matrix. Wehave

(5.2.118) P(µ 1nXnXT

n

w→ µMP,α

)= 1.

Proof. By Lemma 5.2.4, Lemma 5.2.8 we have that

(5.2.119) P (sn (z)→ sMP,α (z)) ∀z ∈ C\R.

By Lemma 4.3.4 this implies

(5.2.120) P (sn (z)→ sMP,α (z) ∀z ∈ C\R) .

By Lemma 4.2.3 this implies that

(5.2.121) P(µ 1nXnXT

n

w→ µMP,α ∀z ∈ C\R).

As was the case when we proved the semi-circle law for Wigner ran-dom matrices, we are not able to conclude that the maximum eigen-value is close to the right end-point of the support of the Marchenko-Pastur density, i.e. we can not deduce that

(5.2.122) λmax

(1

nXnX

Tn

)→(1 +√α)2.

5.3. SAMPLE COVARIANCE MATRICES FOR CORRELATED VECTORS 92

With a more involved computation, which we omit, it is however pos-sible to show that this result holds in almost sure sense as long asE[|Xij|4

]<∞.

Theorem 5.2.10 ([2]). If α ∈ (0,∞) and Xi,j, i ≥ 1, j ≥ 1 areindependent with E [Xij] = 0, E

[X2ij

]= 1, E

[|Xij|4

]< ∞ and Xn is

as in Theorem 5.2.9 then

(5.2.123) λmax

(1

nXnX

Tn

)a.s.→(1 +√α)2.

Remark 5.2.11. The moment condition is weaker than what werequired in Theorem 5.2.9, which is morally speaking a weaker result.But this is an artifact of the proof we presented for Theorem 5.2.9.With a different proof the same conclusion can be shown to hold truewithout any moment condition at all beyond E

[X2ij

]= 1 [8].

5.3. Sample covariance matrices for correlated vectors

The results of the last section tell us that if we compute an em-pirical covariance matrix in the regime where n is comparable to mand see eigenvalues distributed approximately distributed according tothe Marchenko-Pastur law, then this is consistent with the unknownunderlying covariance matrix Σ being the identity. In this section weexamine what effect the presence of correlations in the random vectorhas on the ESD.

5.3.1. One factor model. We consider a special case where β isa vector in Rm and

(5.3.1) xi = yi + βzi,

for y1, y2, . . . that are IID independent vector valued with E [y2i ] = 1,

z1, z2, . . . are IID and real with E [z2i ] = 1, and yi, i ≥ 1, zi, i ≥ 1, are

independent.. This can be thought of as a situation where the entriesof the random vector are affected by some “global” randomness zi, witha multiplier taken from β, and a component yi which is independent foreach entry. In a model of stock returns, one could think of zi are themarket return, and yi the excess return of an individual stock relativeto the market.

The covariance matrix of xi can be computed as

(5.3.2) Σ = I + ββT ,

which a rank one perturbation of the identity. The eigenvalues of thismatrix are

(5.3.3) 1 and 1 + |β|2 .


We see that if e.g. βj = 1 for all j, then the largest eigenvalue is 1 +m,which is much larger than the other eigenvalues. Furthermore notethat the eigenvector of this larger eigenvalue is β itself. Our goal is toinvestigate to what extent we can detect the presence of this eigenvaluein the empirical covariance matrix, when m = αn for α > 0.

To study the almost sure behavior of the empirical covariance ma-trix we define the random variables yi, zi for different n on the sameprobability space. Thus let

(5.3.4) Yij, i ≥ 1, j ≥ 1, Zi, i ≥ 1,

be independent real-valued, such that

(5.3.5) Yij are IID with E [Yij] = 0 and E[Y 2ij

]= 1,

and

(5.3.6) Zij are IID with E[Zij

]= 0 and E

[Z2ij

]= 1.

Let

(5.3.7) α > 0 and m = bαnc.

We construct yi,k = Yk,i so that

(5.3.8) Yn = (Yij)1≤i≤m,1≤j≤n,

is the matrix with y1, . . . , yn as columns. Furthermore let

(5.3.9) Zn =

Z1

. . .

Zn

,

Also let

(5.3.10) βij, i ≥ 1, k ≥ 1,

be deterministic numbers and

(5.3.11) βn =

β1,n

. . .

βm,n

.

Then

(5.3.12) Xn = Yn + βnZTn ,

be the matrix with the measurements x1, x2, . . . , xn as columns. Theempirical covariance matrix of the xi (assuming that it is known that


all columns have mean zero) is then(5.3.13)

1nXnX

Tn = 1

n

(Yn + βnZ

Tn

) (Yn + βnZ

Tn

)T= 1

nYnY

Tn + 1

nYnZnβ

Tn + 1

nβnZ

Tn Y

Tn +

1

nβnZnZ

Tn β

Tn︸︷︷︸

1n|Zn|2βnβTn .

.

We now investigate the relative sizes of the different matrices in theabove sum. If we assume that E

[|Yij|4

]<∞ then Theorem 5.2.10 ap-

plies to 1nYnY

Tn and we have that ‖ 1

nYnY

Tn ‖op

a.s.→ (1 +√α)

2 and by The-orem 5.2.9 (recall also Remark 5.2.11) that the eigenvalues of 1

nYnY

Tn

are approximately distributed according to the Marchenko-Pastur law.For the rank-1 matrix 1

n|Zn|2 βnβTn we have by the Strong Law of

Large Numbers that

(5.3.14)1

n|Zn|2 =

1

n

n∑i=1

Z2i

a.s.→ 1,

so that

(5.3.15) ‖ 1n|Zn|2 βnβTn ‖op = (1 + o (1)) ‖βnβTn ‖op

= (1 + o (1)) |βn|2 .Thus if e.g. βij = 1 the single eigenvalue and the operator norm ofthe last matrix is close to m, which dominates the operator norm of1nYnY

Tn .

The matrix

(5.3.16)1

nYnZnβ

Tn =

1

n(YnZn) βTn ,

and its transpose 1nβnZ

Tn Y

Tn are rank 1 with operator norm

(5.3.17) ‖ 1

nYnZnβ

Tn ‖op =

1

n(YnZn) · βn =

1

n

∑1≤i≤m

∑1≤j≤n

YijZjβi,n.

The RHS of has mean zero and variance

(5.3.18)1

n2

∑1≤i≤m

∑1≤j≤n

β2i,n =

|βn|2

n,

suggesting that it is typically of size |βn|√n, which is much smaller than

the operator norm of 1n|Zn|2 βnβTn (or if |βn| ↓ 0 fast then at least it is

smaller than the operator norm of 1nY Y T ).

We thus expect to be able to neglect the two matrices 1nβnZ

Tn Y

Tn

and 1nβnZ

Tn Y

Tn . The next lemma shows this rigorously in an almost

sure sense.


Lemma 5.3.1. If for some ε > 0, E[|Y11|2+ε] ,E [∣∣∣Z1

∣∣∣2+ε]< ∞

then we have

(5.3.19)‖ 1nβnZ

Tn Y

Tn ‖op

|βn|a.s.→ 0.

Proof. Write

(5.3.20)1

n

∑1≤i≤m

∑1≤j≤n

YijZjβi,n =1

n

∑j

UjVj,

for

(5.3.21) Uj =∑

1≤i≤m

βi,nYij,

and

(5.3.22) Vj = Zj.

Since Uj and Vj are independent we can apply the Marcinkiewicz-Zygmund inequality (4.4.72) to get that

(5.3.23) ‖∑j

UjVj‖p ≤ supj‖Uj‖p sup

j‖Vj‖p

√n.

We have supj ‖Vj‖p = ‖Z1‖p, and by the Marcinkiewicz-Zygmund in-equality (4.4.52)

(5.3.24) ‖Uj‖p ≤ ‖Y11‖p |βn| .Thus

(5.3.25) ‖ 1

n

∑j

UjVj‖p ≤1√n‖Z1‖p‖Y11‖p |βn| ,

so

(5.3.26) P

(∣∣∣∣∣ 1n∑j

UjVj

∣∣∣∣∣ ≥ |βn|log n

)≤ ‖Z1‖pp‖Y11‖pp

(log n√n

)p,

which is summable for p = 2 + ε.

This already allows us to show that the largest eigenvalue of theempirical covariance matrix 1

nXnX

Tn is indeed close to |βn|2 as long as

|βn| → ∞ (e.g. if βij = 1).

Lemma 5.3.2. If E[|Y11|4

]<∞, E

[|Z11|2+ε] <∞ for some ε > 0,

and if the βij are such that |βn| → ∞ then

(5.3.27)λmax

(1nXnX

Tn

)|βn|2

a.s.→ 1.


Proof. By the triangle inequality for the operator norm we have

(5.3.28) ‖A‖op − ‖B‖op ≤ ‖A+B‖op ≤ ‖A‖op + ‖B‖op.With A = 1

n|Zn|2 βnβTn and B = 1

nYnY

Tn + 1

nYnZnβ

Tn + 1

nβnZ

Tn Y

Tn we

have by (5.3.14) that

(5.3.29)‖A‖op|βn|2

a.s.→ 1,

and by Theorem 5.2.10 and (5.3.19)

(5.3.30)‖B‖op|βn|

≤2‖ 1

nYnZnβ

Tn ‖op + ‖ 1

nYnY

Tn ‖

|βn|→ 0,

which proves that

(5.3.31)‖ 1nXnX

Tn ‖op

|βn|2a.s→ 1.

Since the matrix 1nXnX

Tn is positive semi-definite all the eigenvalues

are non-negative and we have

(5.3.32) λmax

(1

nXnX

Tn

)= ‖ 1

nXnX

Tn ‖op,

so the result follows.

Next we aim to determine the limiting ESD of 1nXnX

Tn , which will

turn out to coincide with that of 1nYnY

Tn . To show this we will use

the classical Weyl interlacing inequalities, which we now prove startingwith another classical result of linear algebra:

Lemma 5.3.3. (Courant-Fisher min-max theorem) Let A be n × nHermitian matrix then

(5.3.33) λi (A) = supdim(V )=n−i+1

infv∈V :|v|=1

v∗Av,

(5.3.34) λi (A) = infdim(V )=i

supv∈V :|v|=1

v∗Av.

Proof. By a change of basis we may take A to be diagonal withdiagonal entries λ1 ≤ . . . ≤ λn (where λi = λi (A)). Then

(5.3.35) v∗Av =n∑k=1

|vk|2 λ2k.

For V = 〈ei, ei+1, . . . , en〉 we have trivially

(5.3.36) infv∈〈ei,ei+1,...,en〉:|v|=1

n∑k=1

|vk|2 λ2k = λi.


Since this V is n− i+ 1-dimensional we thus have

(5.3.37) λi ≤ supdim(V )=n−i+1

infv∈V :|v|=1

v∗Av.

Now let V be any n−i+1-dimensional subset. Consider U = 〈e1, . . . , ei〉.We must have that V ∩U 6= ∅, so there is a unit vector u ∈ V ∩U , andthis vector must satisfy

(5.3.38) u∗Au =n∑k=1

|uk|2 λ2k ≤ λi.

Thus for all n− i+ 1-dimensional V we have

(5.3.39) infv∈V :|v|=1

v∗Av ≤ infv∈V ∩U :|v|=1

v∗Av ≤ u∗Au ≤ λi,

and so

(5.3.40) supdim(V )=n−i+1

infv∈V :|v|=1

v∗Av ≤ λi.

Together with with (5.3.37) this proves the top line of (5.3.33).By applying (5.3.33) with −A in place of A and n− i + 1 in place

of i in the top line of (5.3.33) we get

(5.3.41) λn−i+1 (−A) = supdim(V )=i

infv∈V :|v|=1

v∗ (−A) v.

But λn−i+1 (−A) = −λi (A) and

(5.3.42)supdim(V )=i infv∈V :|v|=1 v

∗ (−A) v= − infdim(V )=i supv∈V :|v|=1 v

∗Av,

which gives the second line of (5.3.33).

Lemma 5.3.4. (Weyl’s eigenvalue interlacing inequality) Let H,Pbe Hermitian n× n matrices. Then

(5.3.43) λi (H) + λmin (P ) ≤ λi (H + P ) ≤ λi (H) + λmax (P ) ,

and more generally for 1 ≤ i ≤ n and any j ≥ 0 such that i + j ≤ n,we have

(5.3.44) λi (H + P ) ≤ λi+j (H) + λn−j (P ) ,

and for 1 ≤ i ≤ n and any 0 ≤ j ≤ n − 1 such that i − j + 1 ≥ 1, wehave

(5.3.45) λi−j+1 (H) + λ1+j (P ) ≤ λi (H + P )

Remark 5.3.5. Note that when H and P are simultaneously diag-onalizable (5.3.43) and (5.3.45) are trivial to prove.


Proof of (5.3.43). Note that for all unit v we have

(5.3.46) λmin (P ) ≤ v∗Pv ≤ λmax (P ) ,

which yields (5.3.43) when used in (5.3.33).

Proof of (5.3.44)-(5.3.45) (Not examinable). Note that by(5.3.34) there exists subspaces U and W of dimension i + j and n− jrespectively, such that for all v ∈ U

(5.3.47) v∗Hv ≤ λi−j+n (H) ,

and for all v ∈ W we have

(5.3.48) v∗Pv ≤ λj (P ) .

The co-dimension of U is n− i−j and the co-dimension ofW is j. If Vis any linear space of dimension n−i+1 and co-dimension i−1, then theco-dimension of U ∩W ∩V is at most (n− i− j) + j+ (i− 1) = n− 1,so U ∩W ∩ V is non-empty, so there is a v ∈ U ∩W ∩ V such that

(5.3.49) v∗ (M + P ) v ≤ λi−j+n (M) + λj (P ) .

With (5.3.33) the upper bound (5.3.44) follows, since this holds for alln− i+ 1-dimensional V .

The lower bound of (5.3.45) follows similarly.

Lemma 5.3.6. It holds deterministically for all 2 ≤ i ≤ n− 1 that

(5.3.50) λi−2

(1

nYnY

Tn

)≤ λi

(1

nXnX

Tn

)≤ λi+3

(1

nYnY

Tn

).

Proof (Not examinable). Letting

(5.3.51) M =1

nYnY

Tn ,

and

(5.3.52) P =1

nYnZnβ

Tn +

1

nβnZ

Tn Y

Tn +

1

n|Zn|2 βnβTn ,

we have that P has rank at most three, and at most two negativeeigenvalues, so

(5.3.53) λ3 (P ) = . . . = λn−3 (P ) = 0.

Therefore (5.3.50) follows by (5.3.44) with j = 3 and from (5.3.44) withj = 2.

From this one can show the following.


Lemma 5.3.7. Provided E[|Yi,j|4+ε] < ∞ for some ε > 0 it holds

that

(5.3.54) P(µ 1nXnXT

n

w→ µMP,α

)= 1.

APPENDIX A

Analysis

Theorem A.1. (Inverse function theorem) Let F : A → Rn foran open A ⊂ Rn be a differentiable function with continuous partialderivatives. Let x ∈ A. If JF (x) =

(∂Fi∂xj

)1≤i,j≤n

|x is an invertible

matrix, there there is a neighborhood V ⊂ A of x and U ⊂ Rn of f (x)such that

(A.1) F |U,W : V → U,

is invertible and has continuous inverse

(A.2) G : U → V.

Theorem A.2. (Implicit function theorem) Let F (x, y) for x ∈Rn, y ∈ Rm be a continuously differentiable function F : Rn+m → Rm.Let a ∈ Rn, b ∈ Rm be such that F (a, b) = 0(∈ Rm) and such thatthe Jacobian

(∂Fi∂yj

)i,j=1,...,m

|(x,y)=(a,b) with respect to y at (a, b) is an

invertible square matrix. Then there exist open sets a ∈ U ⊂ Rn andb ∈ V ⊂ Rm and a unique continuously differentiable function g : U →V such that for x ∈ U, y ∈ V(A.3) F (x, y) = 0 ⇐⇒ y = g (x) .

100

APPENDIX B

Linear algebra

We use the following notation:• XT is the transpose of the matrix X• X∗ is the Hermitian conjugate XT of the matrix X• Mn (R) is the linear space of real-valued matrices• Mn (C) is the linear space of complex-valued matrices• Symn (R) ⊂Mn (R) is the linear space of n×n symmetric realmatrices• Skewn (R) ⊂Mn (R) is the linear space of n×n skew-symmetricreal matrices (S ∈ Skewn (R) ⇐⇒ S = −ST )• Hermn(C) ⊂ Mn (C) is the linear space of n × n Hermitiancomplex matrices• Skewn(C) ⊂Mn (C) is the linear space of n×n skew-Hermitiancomplex matrices (S ∈ Skewn (C) ⇐⇒ S = −S∗)• On the Euclidean space Rn (resp. Cn) we denote the standardnorm of x ∈ Rn (resp. Cn) by |x|.• The operator norm on X ∈ Mn (R) (resp. Mn (C)) is definedby

(B.1) ‖X‖op = supx∈Rn(resp.Cn):|x|=1

|Ax| .

• The Hilbert-Schmidt norm on X ∈ Mn (R) (resp. Mn (C)) isdefined by

(B.2) ‖X‖HS =

√∑i,j

|Xi,j|2,

and is the same as the Euclidean norm when X ∈ Mn (R)(resp. Mn (C)) is identified with Rn×n (resp. Cn×n) in thenatural way.

The Hilbert-Schmidt norm and the operator norm are norms on thesame finite dimensional Euclidean space, so therefore they are equiva-lent: There are c1, c2 such that

(B.3) c1 ≤‖X‖op‖X‖HS

≤ c2.

101

B. LINEAR ALGEBRA 102

(In fact, ‖X‖op ≤ ‖X‖HS, so the one can take c2 = 1).

Theorem B.1. (Spectral theorem for real symmetric matrices) IfX is a symmetric real matrix then there exists an orthogonal matrixO ∈ O (n) and a diagonal real matrix Λ such that

(B.4) X = OTΛO.

Theorem B.2. (Spectral theorem for complex Hermitian matrices)If X is a complex valued Hermitian matrix then there exists a unitarymatrix U ∈ U (n) and a diagonal real matrix Λ such that

(B.5) X = U∗ΛU.

Definition B.3. (Exponential map) The maps

(B.6) exp : Mn (R)→Mn (R) ,

and

(B.7) exp : Mn (C)→Mn (C) ,

are defined by

(B.8) exp (A) = I +∞∑k=1

Ak

k!.

Lemma B.4. (Basic properties of exponential map; Proposition 2.1,Proposition 2.3, [4])

a) The right-hand side of (B.8) converges for all A ∈Mn (R) (resp.Mn (C)).

b) When viewed as a map from Rn×n (resp. Cn×n) to Rn×n (resp.Cn×n), the map exp is smooth (continuous with continuous derivativesof all orders).

c) It holds that

(B.9) exp (A) = I + A+O(|A|2ope|A|op

).

d) If AB = BA then exp (A+B) = exp (A) exp (B)e) If A ∈ Skewn (R), then exp (A) ∈ O (n).f) If A ∈ Skewn (C), then exp (A) ∈ U (n).

Definition B.5. (Matrix logarithm) For A (real or complex) suchthat |A− I|HS ≤ 1, define

(B.10) logA =∞∑m=1

(−1)m+1 (A− I)m

m.


Lemma B.6. (Basic properties of matrix logarithm; Theorem 2.8[4])

a) The right-hand side of (B.10) is convergent for if |A− I|HS ≤ 1.b) For X s.t. |X − I|HS < 1 it holds that

(B.11) elogX = X.

c) For X s.t. |X − I|HS < log 2 it holds that

(B.12) log(eX)

= X.

Lemma B.7. a) There exists an ε > 0 such that

(B.13) X ∈ O (n) : |X − I|HS ≤ ε ⊂ exp (Skewn (R)) .

b) There exists an ε > 0 such that

(B.14) X ∈ U (n) : |X − I|HS ≤ ε ⊂ exp (Skewn (C)) .

Proof. (Following proof of Theorem 2.7 [4]) a) We can identifyMn (R) with Rn×n. We can identify Skewn (R) with a subset of A =Rn×n. The set A is a linear subspace of Rn×n. Let A⊥ denote itsorthogonal complement, and define the map:

(B.15) Φ : Rn×n → Rn×n,

by

(B.16) Φ (x) = exp (PAx) exp (PA⊥x) ,

where P· denotes the projection onto ·, and we apply exp to x ∈ Rn×n

via the identification of Rn×n with Mn (R). By Lemma B.4 b), andthe differentiability of the matrix product with respect to entries, Φ iscontinuously differentiable. Furthermore

(B.17)Φ (x) =

(I + PAx+O

(|x|2)) (

I + PA⊥x+O(|x|2))

= I + PAx+ PA⊥x+O(|x|2)

= I + x+O(|x|2),

which shows that the Jacobian of Φ at x = 0 is the identity matrixI. Thus by the inverse function theorem (Theorem A.1), there areneighborhoods U ⊂ Rn×n of 0 and V ⊂ Rn×n of Φ (0) = I such that

(B.18) Φ : V → U,

is invertible.Now consider O ∈ O (n), such that |O − I|HS ≤ ε. This O is

identified with an y ∈ Rn×n. By making ε > 0 small enough we canensure y ∈ U , so that we can define

(B.19) x = Φ−1 (y) .


Now assume for contradiction that for all ε > 0, it holds that there isa O ∈ O (n), such that |O − I|HS ≤ ε and for the corresponding y

(B.20) Φ−1 (y) /∈ A.If not then for some ε > 0 it holds that if O ∈ O (n) and |O− I|HS ≤ εthen for the corresponding y we have Φ−1 (y) ∈ A, implying that thefor the Skew-symmetric matrix S corresponding to Φ−1 (y) it holds thatexp (S) = O, which finishes the proof.

Now under the assumption, we can find a sequence yn ∈ Rn suchthat yn → 0 and

(B.21) exp (sn) exp (rn) = yn,

for sn ∈ A and rn ∈ A⊥, with sn → 0 and rn → 0 but rn 6= 0. Notethat exp (sn) ∈ O (n), and therefore

(B.22) exp (rn) = exp (−sn) yn ∈ O (n)∀n.Furthermore, it holds that

(B.23) exp (rn)→ I.

Pick a subsequence nk such that also rnk

|rnk |converges to a r with |r| = 1.

Next pick a t > 0. We can pick a sequence mk of integers such thatmk |rnk | → t. It holds that

(B.24) exp

(mk |rnk |

rnk|rnk |

)→ exp (tr) ,

by the continuity of exp, but also

(B.25) exp

(mk |rnk |

rnk|rnk |

)= exp (mkrnk) = exp (rnk)

mk ∈ O (n) ,

since exp (rnk) ∈ O (n). Since O (n) is closed, this implies that

(B.26) exp (tr) ∈ O (n) for all t ≥ 0.

Now since exp (tr) ∈ O (n) it holds that

(B.27)exp (tr) exp (tr)T = I=⇒ exp (tr) exp

(trT)

= I=⇒ exp

(t(r + rT

))= I

For a small enough t > 0, t(r − rT

)lies in a neighborhood of Φ : A→

U where exp is invertible. Then

(B.28)

exp(t(r + rT

))= I

=⇒ t(r + rT

)= 0

=⇒ r = −rT=⇒ r ∈ Skewn (R) .


But this implies that r ∈ A, which contradicts, the fact that rn ∈A⊥ for all n.

b) Similar.

APPENDIX C

Wahrscheinlichkeitstheorie

In diesem Appendix wird an einge Grundbegriffe der Wahrschein-lichkeitsrechnung erinnert.

Masstheorie

Definition C.1. (Ring) Sei Ω eine Menge. Ein Ring ist eine eineMenge A von Untermengen von Ω, für die

a) A,B ∈ A =⇒ A ∪B ∈ Ab) A,B ∈ A =⇒ A\B ∈ A.

Definition C.2. (σ-algebra) Sei Ω eine Menge. Ein σ-Algebra isteine eine Menge A von Untermengen von Ω, für die

a) A1, A2, . . . ∈ A =⇒ ∪iAi ∈ Ab) A,B ∈ A =⇒ A\B ∈ Ac) Ω ∈ A

Definition C.3. (Mass) Sei Ω eine Menge und A eine σ-algebraauf Ω. Eine Funktion

(C.1) µ : A → [0,∞],

heisst Mass, falls∑∞

i=1 µ (Ai) = µ (∪∞i=1Ai) für alle disjunktenA1, A2, . . . ∈A gilt und µ (∅) = 0.

Definition C.4. (Lokal endliches Mass) Ein Mass µ auf Rd (mitdem Borel σ-algebra) heisst lokal endlich, falls µ (A) < ∞ für allebeschränkte messbare Mengen A.

Grundlegende Definitionen

Definition C.5. (Wahrscheinlichkeitsmass) Ein Mass µ auf eineMenge Ω heisst Wahrscheinlichkeitsmass, falls µ (Ω) = 1.

Definition C.6. (Wahrscheinlichkeitsraum) EinWahrscheinlichkeit-sraum ist ein Tripel (Ω,A,P), wobei Ω eine Menge ist (der Zustand-sraum), A eine σ-algebra auf Ω ist und P : A → R eine Wahrschein-lichkeitsmass ist.

106

VERTEILUNGEN 107

Definition C.7. (Ereignis) Eine A-messbare Menge A ⊂ Ω (d.h.A ∈ A), für ein Wahrscheinlichkeitsraum (Ω,A,P), heisst ein Ereignis.

Definition C.8. (Fast sicher; f.s.) Ein Ereignis A ist fast sicherfalls P (A) = 1.

Definition C.9. (Zufallsvariable) Eine (relle) Zufallsvariabel aufeinen Wahrscheinlichkeitsraum (Ω,A,P) ist eineA-messbare FunktionX : Ω→ R.

Definition C.10. (Erwartungswert) Der Erwartungswert von einer(rellen) Zufallsvariabel X auf ein Wahrscheinlichkeitsraum (Ω,A,P) ist

(C.2) E [X] :=

∫X (ω)P (dω) .

Definition C.11. (Varianz) Die Varianz von einer (rellen) Zu-fallsvariabel X auf ein Wahrscheinlichkeitsraum (Ω,A,P) ist

(C.3) Var (X) := E[(X − E [X])2] = E

[X2]− E [X]2 .

Verteilungen

Definition C.12. (Verteilung) Sei X : Ω→ R eine Zufallsvariable.Die Verteilung von X ist der Wahrscheinlichkeitsraum (R,B (R) ,Q),wobei B der Borell-sigma-algebra ist, und Q definiert ist durch

(C.4) Q (A) = P (X ∈ A) für alle A ∈ B (R) .

Definition C.13. (Gleiche Verteilung) Zufallsvariablen X und Yhaben die gleiche Verteilung, falls

(C.5) P (X ∈ A) = P (Y ∈ A) für alle A ∈ B (R) .

Wir schreiben dann

(C.6) XD= Y.

Definition C.14. (Cumulative Distribution Function/Verteilungsfunktion)If ν is a probability measure on (R,B (R)), then we define the Cumu-lative Distribution Function (CDF) of ν as the function

(C.7) F : R→ [0, 1] ,

(C.8) F (z) = ν ((−∞, z]) .

Definition C.15. (Cumulative Distribution Function/Verteilungsfunktionof Random Variable) Die Verteilungsfunktion einer Zufallsvariable Xist die Funktion F : R→ [0, 1] gegeben durch

(C.9) F (z) = P (X ≤ z) .

VERTEILUNGEN 108

Equivalently, it’s the Cumulative Distribtion Function of the distribu-tion/Verteilung (see (C.12)) of X.

Lemma C.16. Jede Funktion F : R → [0, 1] die monoton undrechtsstetig ist, und für die limz→∞ F (z) = 1 und limz→−∞ F (z) = 0,ist eine Verteilungsfunktion einer Zufallsvariable.

Lemma C.17. Seien X und Y Zufallsvariablen mit Verteilungsfunk-tionen FX = FY . Es gilt

(C.10) XD= Y ⇐⇒ FX = FY .

\for

Definition C.18. (Dichte) Eine Zufallsvariable X hat Dichte f :R→ R falls f messbar ist und

(C.11) P (X ∈ A) =

∫A

f (x) dx,

für alle messbare A ⊂ R.Lemma C.19. Jede messbare Funktion f : R → [0,∞) für die∫

f (x) dx = 1 ist die Dichte einer Zufallsvariable.

Definition C.20. (Triviale Verteilung) Eine triviale Verteilung,ist eine Verteilung mit Verteilungsfunktion F (z) = 1z≤x. Eine Zu-fallsvariable mit dieser Verteilung nimmt f.s. den Wert x an.

Definition C.21. (Uniforme Verteilung) Für Parameter −∞ <a < b < ∞ ist die uniforme Verteilung U [a, b] auf [a, b] die Verteilungmit Dichte 1

b−a1[a,b].

Definition C.22. (Exponentialverteilung) Für ein Parameter λ >0 ist die Exponentialverteilung Exp (λ) die Verteilung mit Verteilungs-

funktion F (z) =

1− e−λz für z ≥ 0,

0 sonst.

Definition C.23. (Normalverteilung) Für Parameter µ ∈ R, σ > 0

und ist die Normalverteilung die Verteilung mit Dichte 1√2πσ2

e−(x−µ)2

2σ2 .

Definition C.24. (IID) Eine Folge von Zufallsvariable X1, X2, . . .ist IID (Independent and identically distributed), falls die Zufallsvari-ablen unabhängig sind und alle die gleiche Verteilung haben.

Lemma C.25. (Change of variables formula for densities) Let X =(X1, . . . , Xn) be a random vector in Rn taking values in a measurableset B ⊂ Rn and with a joint density

(C.12) fX : Rn → [0,∞)

WEAK AND VAGUE CONVERGENCE 109

with respect to Lebesgue measure. Let

(C.13) g : B → Rn,

be an injective function which is continuously differentiable. Assumefurther that the inverse g−1 : g (Rn)→ B has Jacobian J (z) =

(∂g−1i (z)

∂yi

)i,j=1,...,n

|y=z

which is invertible for all z ∈ g (Rn). Then the random vector

(C.14) Y = g (X) ,

has a density with respsect to Lebesgue measure on Rn, given by(C.15)

fY (y) =

fX (x) |det (J (x))| where x = g−1 (y) , if y ∈ g (B) ,

0 if y /∈ g (B) ,

where J .

Weak and vague convergence

Definition C.26. (Weak convergence of measures) Let νn be mea-sures on some measurable space X (such as R) with Borel sigma algebraB (X ). We say that

(C.16) νn → ν weakely,

if for all continuous bounded f : X → R it holds

(C.17)∫fνn (dx)→

∫fν (dx) .

Lemma C.27. (Topology of weak convergence)

Lemma C.28. (Portmanteau lemma) Let X be a topological spacewith Borel sigma algebra B (X ) and νn, ν probability measures on thisspace The following are equivalent

a)∫fνn (dx)→

∫fν (dx) for all continuous bounded f : X → R,

b) νn (A)→ ν (A) for all measurable A ⊂ X such that ν(A\Ao

)=

0.In the special case X = R then also the following are equivalent to

a) b)c) νn ([a, b])→ ν ([a, b]) for all a, b such that ν (a) = ν (b) = 0,d) Fn (x)→ F (x) for x where F is continuous, where Fn, F are the

cumulative distribution functions of νn, n.

Definition C.29. (Lévy–Prokhorov metric) Let (M,d) be a metricspace and B (M) its Borel sigma-algebra. For any A ⊂M and ε, define

(C.18) Aε = x ∈M : d (x, y) ≤ ε for some y ∈ A .

KONVERGENZ VON ZUFALLSVARIABLEN 110

Let P be the set of probablity measures on (M,B (M)). Define

(C.19) π : P × P → [0,∞),

by(C.20)π (µ, ν) = inf ε > 0 : for all A ∈ B (M) , µ (A) ≤ ν (Aε) + ε and ν (A) ≤ µ (Aε) + ε .We call π the Lévy–Prokhorov metric on P .

Lemma C.30. (Properties of the Lévy–Prokhorov metric)a) The definition (C.20) gives a well-defined metric on Pb) For any probability νn, ν on (M,B (M)), it holds that νn → ν

weakely iff π (νn, ν)→ 0.

Konvergenz von Zufallsvariablen

Definition C.31. (Fast sichere Konvergenz) Eine Folge von Zu-fallsvarablen X1, X2, . . . auf demselben Wahrscheinlichkeitsraum kon-vergiert fast sicher gegen X, falls

(C.21) P (Xn → X) = 1.

Wenn dies gilt schreiben wir Xnf.s.→ X.

Definition C.32. (Konvergenz in Wahrscheinlichkeit) Eine Folgevon Zufallsvarablen X1, X2, . . . konvergiert in Wahrscheinlichkeit gegeneine Zufallsvariable X, falls

(C.22) limn→∞

P (|Xn −X| ≥ ε) = 0 ∀ε > 0.

Wir schreiben XnP→ X. Falls X = x ∈ R eine Konstante ist, ist eine

equivalente Bedingung das für jede a < x < b,

(C.23) limn→∞

P (a ≤ Xn ≤ b) = 1.

Die letzte Definition gilt auch für x =∞, d.h. XnP→∞ falls

(C.24) limn→∞

P (Xn ≥ a) = 1 ∀a > 0.

(Konvergenz in Verteilung) Eine Folge von Zufallsviablen X1, X2, . . .mit Verteilungsfunktionen F1, F2, . . . konvergiert in Verteilung gegeneine Zufallsvariable X mit Verteilung F falls

(C.25) limn→∞

Fn (x) = F (x) ,

für alle x ∈ R, so dass F stetig ist auf x. Wir schreiben dafür

(C.26) XnD→ X.

UNGLEICHUNGEN 111

Lemma C.33. Für Zufallsvariablen X,X1, X2, . . . definiert auf densel-ben Wahrscheinlichkeitsraum gelten folgende Implikationen

(C.27) Xnf.s.→ X =⇒ Xn

P→ X =⇒ XnD→ X.

Definition C.34. (Straff) Eine Folge von ZufallsvarablenX1, X2, . . .ist Straff falls

(C.28) limK→∞

lim supn→∞

P (|Xn| ≥ K) = 0.

Theorem C.35. Falls X1, X2, . . . eine straffe Folge von Zufallsvari-ablen ist, gibt es eine Unterfolge Xi1 , Xi2 , . . ., und eine Verteilung F ,so dass

(C.29) Xik

D→ F.

Lemma C.36 ((Satz von Slutsky)). Seien A,A1, A2 . . . , B1, B2, . . .Folgen von Zufallsvariablen, so dass

(C.30) AnD→ A und Bn

P→ b ∈ R,dann gilt

(C.31) An +BnD→ A+ b und AnBn

D→ Ab.

Theorem C.37 (Satz von Skorohod). Seien A,A1, A2, . . . Zufallsvari-ablen, so dass

(C.32) AnD→ A.

Es gibt ein Wahrscheinlichkeitsraum (Ω,A,Q) und Zufallsvariablen A, A1, A2, . . .auf (Ω,A,Q) s.d. die Q-verteilung von An (bzw. A) die Verteilung vonAn (bzw. A) ist, und

(C.33) AnQ-f.s.→ A.

Ungleichungen

Lemma C.38. (Markov-ungleichung) Sei X ≥ 0 eine Zufallsvariableauf eine Wahrscheinlichkeitsraum (Ω,A,P). Dann gilt für jedes u > 0

(C.34) P (X ≥ u) ≤ E [X]

u.

Lemma C.39. (Tschebyscheff-ungleichung) Sei X eine Zufallsvari-able auf eine Wahrscheinlichkeitsraum (Ω,A,P) mit E [|X|] <∞. Danngilt für jedes u > 0

(C.35) P (|X − E [X]| ≥ u) ≤ Var (X)

u2.

MOMENTERZEUGENDE FUNKTION 112

Grenzwertsätze

Theorem C.40. (Starker Gesetz der grossen Zahlen) Seien X1, X2, . . .IID Zufallsvariablen mit E [|X1|] <∞. Es gilt

(C.36)X1 + . . .+Xn

n

f.s.→ E [X1] .

Theorem C.41. (Zentralen Grenzwertssatz) Seien X1, X2, . . . IIDZufallsvariablen mit E [X2

1 ] = σ2 <∞ und E [X1] = µ. Es gilt

(C.37)X1 + . . .+Xn − nµ√

nσ

D→ N (0, 1) .

Momenterzeugende Funktion

Definition C.42. (Momentzerzeugende Funktion) Für eine reelleZufallsvariable Z wird die momenterzeugende Funktion MZ (t) : R →[0,∞)

(C.38) MZ (λ) := E [exp (λZ)] .

Lemma C.43. Falls X und Y Zufallsvariablen sind, und es eineε > 0 gibt, so dass MX (λ) ,MY (λ) < ∞ und MX (λ) = MY (λ) füralle λ ∈ (−ε, ε), so haben X und Y die gleiche Verteilung.

Lemma C.44. Falls X,X1, X2, . . . Zufallsvariablen sind, und es eineε > 0 gibt, so dass

(C.39) MXn (λ)→MX (λ) für alle λ ∈ (−ε, ε) ,dann gilt

(C.40) XnD→ X.

Lemma C.45. Die momenterzeugend Funktion von einer standard-normalverteilte Zufallsvariable X ist

(C.41) MX (λ) = eλ2

2 für λ ∈ R.

skript: random matrix theory (fs2019) · chapter 1 introduction random matrix theory is the study...

Documents