magen zouzais matrix multiplication

8/10/2019 Magen Zouzais Matrix Multiplication

1/15

Low Rank Matrix-valued Chernoff Bounds and Approximate Matrix

Multiplication

Avner Magen Anastasios Zouzias

AbstractIn this paper we develop algorithms for approximatingmatrix multiplication with respect to the spectral norm. LetA Rnm and B Rnp be two matrices and > 0. Weapproximate the productABusing two sketches eA Rtm

and eB Rtp, where t n, such thateA eB AB

2

A2B

2

with high probability. We analyze two different sampling

procedures for constructing eA and eB; one of them is doneby i.i.d. non-uniform sampling rows from A and B and theother by taking random linear combinations of their rows.We prove bounds on t that depend only on the intrinsicdimensionality ofAandB, that is their rank and their stablerank.

For achieving bounds that depend on rank when takingrandom linear combinations we employ standard tools fromhigh-dimensional geometry such as concentration of measurearguments combined with elaborate-net constructions. Forbounds that depend on the smaller parameter of stablerank this technology itself seems weak. However, we showthat in combination with a simple truncation argument itis amenable to provide such bounds. To handle similarbounds for row sampling, we develop a novel matrix-valuedChernoff bound inequality which we call low rank matrix-valued Chernoff bound. Thanks to this inequality, we areable to give bounds that depend only on the stable rank ofthe input matrices.

We highlight the usefulness of our approximate matrix

multiplication bounds by supplying two applications. First

we give an approximation algorithm for the 2-regression

problem that returns an approximate solution by randomly

projecting the initial problem to dimensions linear on the

rank of the constraint matrix. Second we give improved

approximation algorithms for the low rank matrix approxi-

mation problem with respect to the spectral norm.

1 Introduction

In many scientific applications, data is often naturallyexpressed as a matrix, and computational problems onsuch data are reduced to standard matrix operationsincluding matrix multiplication, 2-regression, and lowrank matrix approximation.

University of Toronto, Department of Computer Science,Email : [email protected].

University of Toronto, Department of Computer Science,Email : [email protected] .

In this paper we analyze several approximationalgorithms with respect to these operations. All ofour algorithms share a common underlying frameworkwhich can be described as follows: LetA be an inputmatrix that we may want to apply a matrix computationon it to infer some useful information about the datathat it represents. The main idea is to work with asample ofA (a.k.a. sketch), call it

A, and hope that the

obtained information fromAwill be in some sense closeto the information that would have been extracted fromA.

In this generality, the above approach (sometimescalled Monte-Carlo method for linear algebraic prob-lems) is ubiquitous, and is responsible for much ofthe development in fast matrix computations [FKV04,DKM06a,Sar06, DMM06, AM07, CW09,DR10].

As we sample A to create a sketchA, our goalis twofold: (i) guarantee thatA resembles A in therelevant measure, and (ii) achieve such aAusing as fewsamples as possible. The standard tool that provides ahandle on these requirements when the objects are realnumbers, is the Chernoff bound inequality. However,since we deal with matrices, we would like to have ananalogous probabilistic tool suitable for matrices. Quiterecently a non-trivial generalization of Chernoff boundtype inequalities for matrix-valued random variableswas introduced by Ahlswede and Winter[AW02]. Suchinequalities are suitable for the type of problems that wewill consider here. However, this type of inequalities andtheir variants that have been proposed in the literature[GLF+09, Rec09,Gro09, Tro10] all suffer from the factthat their bounds depend on the dimensionality of thesamples. We argue that in a wide range of applications,this dependency can be quite detrimental.

Specifically, whenever the following two conditionshold we typically provide stronger bounds comparedwith the existing tools: (a) the input matrix has lowintrinsic dimensionality such as rank or stable rank,(b) the matrix samples themselves have low rank. Thevalidity of condition (a) is very common in applicationsfrom the simple fact that viewing data using matricestypically leads to redundant representations. Typicalsampling methods tend to rely on extremely simple

1422Copyright by SIAM.Unauthorized reproduction of this article is prohibited.
http://localhost/var/www/apps/conversion/tmp/scratch_3/[email protected]://localhost/var/www/apps/conversion/tmp/scratch_3/[email protected]://localhost/var/www/apps/conversion/tmp/scratch_3/[email protected]://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://localhost/var/www/apps/conversion/tmp/scratch_3/[email protected]://localhost/var/www/apps/conversion/tmp/scratch_3/[email protected]


2/15

sampling matrices, i.e., samples that are supported ononly one entry[AHK06, AM07,DZ10]or samples thatare obtained by the outer-product of the sampled rowsor columns [DKM06a, RV07], therefore condition (b)is often natural to assume. By incorporating the rankassumption of the matrix samples on the above matrix-

valued inequalities we are able to develop a dimension-free matrix-valued Chernoff bound. See Theorem1.1for more details.

Fundamental to the applications we derive, are twoprobabilistic tools that provide concentration boundsof certain random matrices. These tools are inherentlydifferent, where each pertains to a different samplingprocedure. In the first, we multiply the input matrixby a random sign matrix, whereas in the second wesample rows according to a distribution that dependson the input matrix. In particular, the first method isoblivious (the probability space does not depend on theinput matrix) while the second is not.

The first tool is the so-called subspace Johnson-Lindenstrauss lemma. Such a result was obtainedin [Sar06] (see also [Cla08, Theorem 1.3]) althoughit appears implicitly in results extending the originalJohnson Lindenstrauss lemma (see [Mag07]). Thetechniques for proving such a result with possible worsebound are not new and can be traced back even toMilmans proof of Dvoretsky theorem [Mil71].

Lemma 1.1. (Subspace JL lemma [Sar06]) LetW Rdbe a linear subspace of dimension k and (0, 1/3).LetR be a t d random sign matrix rescaled by1/t,namelyR

ij=

1/

t with equal probability. Then

P

(1 ) w22 Rw22 (1 + ) w22 , w W

1 ck2 exp(c12t),(1.1)

wherec1 > 0, c2 > 1 are constants.

The importance of such a tool, is that it allows us toget bounds on the necessary dimensions of the randomsign matrix in terms of the rankof the input matrices,see Theorem3.2 (i.a).

While the assumption that the input matrices havelow rank is a fairly reasonable assumption, one shouldbe a little cautious as the property of having low rankis not robust. Indeed, if random noise is added to amatrix, even if low rank, the matrix obtained will havefull rank almost surely. On the other hand, it can beshown that the added noise cannot distort the Frobeniusand operator norm significantly; which makes the notionofstable rankrobust and so the assumption of low stablerank on the input is more applicable than the low rankassumption.

Given the above discussion, we resort to a differ-ent methodology, called matrix-valued Chernoff bounds.These are non-trivial generalizations of the standardChernoff bounds over the reals and were first intro-duced in [AW02]. Part of the contribution of the cur-rent work is to show that such inequalities, similarly

to their real-valued ancestors, provide powerful toolsto analyze randomized algorithms. There is a rapidlygrowing line of research exploiting the power of suchinequalities including matrix approximation by sparsi-fication [AM07, DZ10]; analysis of algorithms for ma-trix completion and decomposition of low rank ma-trices [CR07, Gro09, Rec09]; and semi-definite relax-ation and rounding of quadratic maximization prob-lems [Nem07, So09a,So09b].

The quality of these bounds can be measured by thenumber of samples needed in order to obtain small errorprobability. The original result of [AW02, Theorem 19]shows that1 if M is distributed according to somedistribution over n n matrices with zero mean2, andifM1, . . . , M t are independent copies ofMthen for any > 0,

(1.2) P

1tt

i=1

Mi

2

>

n exp

C

2t

2

,

whereM2 holds almost surely and C > 0 is anabsolute constant.

Notice that the number of samples in Ineq. (1.2)depends logarithmically inn. In general, unfortunately,such a dependency is inevitable: take for example a di-

agonal random sign matrix of dimension n. The opera-tor norm of the sum oft independent samples is preciselythe maximum deviation among n independent randomwalks of length t. In order to achieve a fixed bound onthe maximum deviation with constant probability, it iseasy to see thatt should grow logarithmically withn inthis scenario.

In their seminal paper, Rudelson and Vershyninprovide a matrix-valued Chernoff bound that avoidsthe dependency on the dimensions by assuming thatthe matrix samples are the outer product x x of arandomly distributed vector x [RV07]. It turns outthat this assumption is too strong in most applications,

such as the ones we study in this work, and so we wishto relax it without increasing the bound significantly.In the following theorem we replace this assumptionwith that of having low rank. We should note that we

1For ease of presentation we actually provide the restatement

presented in[WX08,Theorem 2.6], which is more suitable for thisdiscussion.

2Zero mean means that the (matrix-valued) expectation is thezero n n matrix.

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


3/15

are not aware of a simple way to extend Theorem 3.1of [RV07] to the low rank case, even constant rank.The main technical obstacle is the use of the powerfulRudelson selection lemma, see [Rud99] or Lemma 3.5of [RV07], which applies only for Rademacher sumsof outer product of vectors. We bypass this obstacle

by proving a more general lemma, see Lemma 5.2.The proof of Lemma5.2 relies on the non-commutativeKhintchine moment inequality [LP86, Buc01] which isalso the backbone in the proof of Rudelsons selectionlemma. With Lemma 5.2 at our disposal, the prooftechniques of [RV07] can be adapted to support ourmore general condition.

Theorem 1.1. Let 0 < < 1 and M be a randomsymmetric real matrix withEM2 1 andM2 almost surely. Assume that each element on the supportofMhas at most rankr. Sett = (log(/2)/2). Ifr

t holds almost surely, then

P

1tt

i=1

Mi EM

2

>

1

poly(t).

whereM1, M2, . . . , M t are i.i.d. copies ofM.

Proof. See Appendix, page12.

Remark 1. (Optimality) The above theorem cannotbe improved in terms of the number of samples requiredwithout changing its form, since in the special casewhere the rank of the samples is one it is exactlythe statement of Theorem 3.1 of [RV07], see [RV07,

Remark3.4].

We highlight the usefulness of the above main toolsby first proving a dimension-free approximation al-gorithm for matrix multiplication with respect to thespectral norm (Section3.1). Utilizing this matrix mul-tiplication bound we get an approximation algorithm forthe2-regression problem which returns an approximatesolution by randomly projecting the initial problem todimensions linear on the rank of the constraint matrix(Section3.2). Finally, in Section3.3 we give improvedapproximation algorithms for the low rank matrix ap-proximation problem with respect to the spectral norm,

and moreover answer in the affirmative a question leftopen by the authors of[NDT09].

2 Preliminaries and Definitions

The next discussion reviews several definitions and factsfrom linear algebra; for more details, see [SS90, GV96,Bha96]. We abbreviate the terms independently andidentically distributed and almost surely with i.i.d. anda.s., respectively. We let Sn1 :={x Rn | x2 =

1} be the (n 1)-dimensional sphere. A randomGaussian matrix is a matrix whose entries are i.i.d.standard Gaussians, and a random sign matrix is amatrix whose entries are independent Bernoulli randomvariables, that is they take values from {1} withequal probability. For a matrix A

R

nm, A(i), A(j),denote the ith row, jth column, respectively. For amatrix with rank r, the Singular Value Decomposition(SVD) ofA is the decomposition ofA as UV whereU Rnr,V Rmr where the columns ofUand V areorthonormal, and = diag(1(A), . . . , r(A)) is r rdiagonal matrix. We further assume 1 . . . r >0and call these real numbers thesingular valuesofA. ByAk = UkkVk we denote the best rank kapproximationto A, where Uk and Vk are the matrices formed by thefirstk columns ofU andV, respectively. We denote byA2 = max{Ax2 | x2 = 1} the spectral norm ofA, and by AF = i,j

A2ij the Frobenius norm ofA.

We denote by A the Moore-Penrose pseudo-inverse ofA, i.e.,A = V1U. Notice that 1(A) = A2. Alsowe define by sr (A) :=A2F / A22 the stable rank ofA. Notice that the inequality sr (A) rank (A) alwaysholds. The orthogonal projector of a matrix A onto therow-space of a matrixC is denoted byPC(A) = ACC.ByPC,k(A) we define the best rank-k approximation ofthe matrix PC(A).

3 Applications

All the proofs of this section have been deferred toSection4.

3.1 Matrix Multiplication The seminal researchof[FKV04]focuses on using non-uniform row samplingto speed-up the running time of several matrix com-putations. The subsequent developments of [DKM06a,DKM06b, DKM06c] also study the performance ofMonte-Carlo algorithms on primitive matrix algorithmsincluding the matrix multiplication problem with re-spect to the Frobenius norm. Sarlos [Sar06] extended(and improved) this line of research using random pro-

jections. Most of the bounds for approximating matrixmultiplication in the literature are mostly with respectto the Frobenius norm [DKM06a, Sar06, CW09]. In

some cases, the techniques that are utilized for bound-ing the Frobenius norm also implyweakbounds for thespectral norm, see [DKM06a, Theorem 4] or [Sar06,Corollary 11] which is similar with part (i.a) of The-orem3.2.

In this section we develop approximation algorithmsfor matrix multiplication with respect to the spectralnorm. The algorithms that will be presented in thissection are based on the tools mentioned in Section 1.

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


4/15

Variants of Matrix-valued InequalitiesAssumption on the sampleM #of samples (t) Failure Prob. References Comments

M2 a.s. (2 log(n)/2) 1/poly (n) [WX08] Hoeffding

M2 a.s.,

EM22 2 ((2 +/3) log(n)/2) 1/poly (n) [Rec09] Bernstein

M2 a.s., M=x x,EM

2 1 (log(/2)/2) exp((2t/(log t))) [RV07] Rank one

M2 , rank (M) t a.s., EM

2 1 (log(/2)/2) 1/poly(t) Theorem1.1 Low rank

Table 1: Summary of matrix-valued Chernoff bounds. M is a probability distribution over symmetric n nmatrices. M1, . . . , M t are i.i.d. copies ofM.

Before stating our main dimension-free matrix multi-plication theorem (Theorem 3.2), we discuss the bestpossible bound that can be achieved using the currentknown matrix-valued inequalities (to the best of ourknowledge). Consider a direct application of Ineq. (1.2),where a similar analysis with that in proof of The-orem 3.2 (ii) would allow us to achieve a bound of(

r2 log(m + p)/2) on the number of samples (de-

tails omitted). However, as the next theorem indicates(proof omitted) we can get linear dependency on thestable rank of the input matrices gaining from the vari-ance information of the samples; more precisely, thiscan be achieved by applying the matrix-valued Bern-stein Inequality see e.g.[GLF+09],[Rec09, Theorem 3.2]or[Tro10, Theorem 2.10].

Theorem 3.1. Let 0 < < 1/2 and let A Rnm,B Rnp both having stable rank at mostr. The

following hold:

(i) Let R be a t n random sign matrix rescaled by1/

t. Denote byA = RA andB = RB. Ift= (r log(m+p)/2) then

P

A B AB2

A2 B2

1 1poly(r) .

(ii) Let pi =A(i)2 B(i)2 /S, where S =n

i=1

A(i)2 B(i)2 be a probability distributionover[n]. If we form at m matrixA and at pmatrixB by taking t = (r log(m+ p)/2) i.i.d.(row indices) samples frompi, then

P A B AB2 A2 B2 1 1poly(r) .Notice that the above bounds depend linearly on thestable rank of the matrices and logarithmically on theirdimensions. As we will see in the next theorem wecan remove the dependency on the dimensions, andreplace it with the stable rank. Recall that in mostcases matrices do have low stable rank, which is muchsmaller that their dimensionality.

Theorem 3.2. Let 0 < < 1/2 and let A Rnm,B Rnp both having rank and stable rank at most randr, respectively. The following hold:

(i) Let R be a t n random sign matrix rescaled by1/

t. Denote byA= RA andB = RB.

(a) Ift = (r/2) then

P(x Rm, y Rp,|x( A B AB)y| Ax2 By2) 1 e(r).

(b) Ift = (r/4) thenP

A B AB2

A2 B2

1e( er2).

(ii) Let pi =A(i)2 B(i)2 /S, where S =n

i=1

A(i)2 B(i)2 be a probability distributionover [n]. If we form a t m matrix

A and a

tp matrixB by taking t = (r log(r/2)/2)

i.i.d. (row indices) samples frompi, then

P

A B AB2

A2 B2

1 1poly(r) .

Remark 2. In part (ii), we can actually achieve thestronger bound of t = (

sr(A) sr(B) log(sr(A)

sr(B) /4)/2) (see proof). However, for ease of pre-sentation and comparison we give the above displayedbound.

Part (i.b) follows from (i.a) via a simple truncation ar-gument. This was pointed out to us by Mark Rudel-

son (personal communication). To understand the sig-nificance and the differences between the different com-ponents of this theorem, we first note that the proba-bilistic event of part (i.a) is superior to the probabilisticevent of (i.b) and (ii). Indeed, when B = A the former

implies that|x( A A AA)x| < xAAx for ev-eryx, which is stronger than

A A AA2

A22.We will heavilyexploit this fact in Section4.3 to proveTheorem3.4(i.a) and (ii). Also notice that part (i.b) is

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


5/15

essential computationally inferior to (ii) as it gives thesame bound while it is more expensive computationallyto multiply the matrices by random sign matrices than

just sampling their rows. However, the advantage ofpart (i) is that the sampling process is oblivious, i.e.,does not depend on the input matrices.

3.2 2-regression In this section we present an ap-proximation algorithm for the least-squares regressionproblem; given annm,n > m, real matrixAof rankrand a real vectorb Rn we want to computexopt = Abthat minimizesAx b2 over all x Rm. In theirseminal paper [DMM06], Drineas et al. show that if wenon-uniformly samplet= (m2/2) rows fromA and b,then with high probability the optimum solution of thet dsampled problem will be within (1+ ) close to theoriginal problem. The main drawback of their approachis that finding or even approximating the sampling prob-abilities is computationally intractable. Sarlos [Sar06]improved the above to t = (m log m/2) and gave thefirst o(nm2) relative error approximation algorithm forthis problem.

In the next theorem we eliminate the extra log mfactor from Sarlos bounds, and more importantly, re-place the dimension (number of variables) m withthe rank r of the constraints matrix A. We shouldpoint out that independently, the same bound as ourTheorem 3.3 was recently obtained by Clarkson andWoodruff [CW09] (see also [DMMS09]). The proof ofClarkson and Woodruff uses heavy machinery and acompletely different approach. In a nutshell they man-

age to improve the matrix multiplication bound withrespect to the Frobenius norm. They achieve this bybounding higher moments of the Frobenius norm of theapproximation viewed as a random variable instead ofbounding thelocaldifferences for each coordinate of theproduct. To do so, they rely on intricate moment calcu-lations spanning over four pages, see [CW09]for more.On the other hand, the proof of the present 2-regressionbound uses only basic matrix analysis, elementary devi-ation bounds and -net arguments. More precisely, weargue that Theorem3.2 (i.a) immediately implies thatby randomly-projecting to dimensions linear in the in-trinsic dimensionality of the constraints, i.e., the rank

ofA, is sufficient as the following theorem indicates.Theorem 3.3. LetA Rnm be a real matrix of rankr and b Rn. Let minxRm b Ax2 be the 2-regression problem, where the minimum is achieved withxopt= Ab. Let0 < m, real matrix A of rank r, we wish to computeAk = UkkVk , which minimizesA Xk2 over theset ofn m matrices of rank k , Xk. The first additivebound for this problem was obtained in [RV07]. Tothe best of our knowledge the best relative boundwas recently achieved in [NDT09, Theorem 1]. Thelatter result is not directly comparable with ours, sinceit uses a more restricted projection methodology and

so their bound is weaker compared to our results.The first algorithm randomly projects the rows of theinput matrix onto t dimension. Here, we set t to beeither (r/2) in which case we get an (1 + ) errorguarantee, or to be (k/2) in which case we showa (2 +

(r k)/k) error approximation. In both

cases the algorithm succeeds with high probability. Thesecond approximation algorithm samples non-uniformly(r log(r/2)/2) rows from A in order to satisfy the(1 + ) guarantee with high probability.

The following lemma (Lemma 3.1) is essential forproving both relative error bounds of Theorem 3.4. Itgives a sufficient condition that any matrixA shouldsatisfy in order to get a (1+ ) spectral low rank matrixapproximation ofA for every k, 1 k rank (A).Lemma 3.1. LetA be ann m matrix and > 0. Ifthere exists atmmatrixAsuch that for everyx Rm,(1 )xAAx x A Ax (1 + )xAAx, thenA PeA,k(A)

2 (1 + ) A Ak2 ,

for every k = 1, . . . , rank(A).

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


6/15

The theorem below shows that its possible to sat-isfy the conditions of Lemma3.1by randomly project-ing A onto (r/2) or by non-uniform sampling i.i.d.(r log(r/2)/2) rows ofA as described in parts (i.a)and (ii), respectively.

Theorem 3.4. Let0< 0, c2 > 1 are constants.

Proof. (of Theorem 4.1) Let A = UAAV

A, B =UBBVB be the singular value decomposition ofA andB respectively. Notice thatUA RnrA , UB RnrB ,whererA andrB is the rank ofA and B , respectively.

Let x1 Rm, x2 Rp two arbitrary unit vectors.Letw1 = Ax1 andw2 = Bx2. Recall thatARRB AB

2=

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


7/15

supx1Sm1,x2Sp1

|x1(ARRB AB)x2|.

We will bound the last term for any arbitrary vector.Denote with Vthe subspace3 colspan(UA)colspan(UB)ofRn. Notice that the size ofdim(V) rA+rB 2r.Applying Lemma1.1 to

V, we get that with probability

at least 1 cr2exp(c12t) that(4.5) v V : | Rv22 v22 | v22 .Therefore we get that for any unit vectors v1, v2 V:

(Rv1)Rv2 =

Rv1+Rv222 Rv1 Rv2224

(1 + ) v1+ v222 (1 ) v1 v222

4

= v1+ v222 v1 v222

4

+ v

1+ v

22

2+

v

1 v

22

24

= v1 v2+ v122+ v222

2 = v1v2+ ,

where the first equality follows from the Parallelogramlaw, the first inequality follows from Equation (4.5),and the last inequality since v1, v2 are unit vectors.By similar considerations we get that (Rv1)Rv2 v1v2 . By linearity ofR, we get that

v1, v2 V: |(Rv1)Rv2 v1v2| v12 v22 .Notice that w1, w2 V, hence|w1RRw2 w1 w2| w12 w22= Ax12 Bx22.

Part (b): We start with a technical lemma thatbounds the spectral norm of any matrix A when itsmultiplied by a random sign matrix rescaled by 1/

t.

Lemma 4.1. Let A be an n m real matrix, and letR be a t n random sign matrix rescaled by1/t. Ift sr(A), then(4.6) P (RA2 4 A2) 2et/2.Proof. Without loss of generality assume that A2 = 1.ThenAF = sr (A). Let G be a tn Gaussianmatrix. Then by the Gordon-Chevet inequality4

E GA2 It2 AF+ ItF A2= AF+

t 2t.

3We denote by colspan(A) the subspace generated by the

columns of A, and rowspan(A) the subspace generated by therows ofA.

4For example, set S = It, T = A in [HMT09, Proposi-tion 10.1, p. 54].

The Gaussian distribution is symmetric, so Gij andtRij |Gij|, where Gij is a Gaussian random vari-

able have the same distribution. By Jensens inequal-ity and the fact that E |Gij | =

2/, we get that

2/E RA2 E GA2/

t. Define the function

f :{1}t

n

R by f(S) = 1t SA2. The calcu-lation above shows that median(f) 2. Since f isconvex and (1/

t)-Lipschitz as a function of the entries

ofS, Talagrands measure concentration inequality forconvex functions yields

P (RA2 median(f) + ) 2 exp(2t/2).Setting= 1 in the above inequality implies the lemma.

Now using the above Lemma together with Theorem3.2(i.a) and a simple truncation argument we can provepart (i.b).

Proof. (of Theorem 3.2 (i.b)) Without loss of gener-ality assume that A2 = B2 = 1. Set r = 1600 max{sr(A),sr(B)}

2 . SetA = A Ar,B = B Br.

SinceA2F =rank(A)

j=1 j (A)2,A

2 AF

r

40, and

B2

BFr

40

.

By triangle inequality, it follows that

A

B AB

2

ArR

RBr

ArBr2(4.7)+

ARRBr2

+ArRRB

2+ARRB

2(4.8)

+ABr

2+ArB

2+AB

2.(4.9)

Choose a constant in Theorem 3.2 (i.a) so that thefailure probability of the right hand side of (4.7) doesnot exceed exp(c2t), where c = c1/32. The sameargument shows that P (RAr2 1 + ) exp(c2t)and P (RBr2 1 + ) exp(c2t). This combinedwith Lemma 4.1 applied onA andB yields that thesum in (4.8) is less than 2(1 +)/10 + 2/100. Also,sinceAr2 , Br2 1, the sum in (4.9) is less that2/10 +2/100. Combining the bounds for (4.7), (4.8)and (4.9) concludes the claim.

Row Sampling - Part (ii): By homogeneitynormalize A and B such that A2 = B2 = 1.Notice that AB =

ni=1 A

(i)B(i). Define pi =

A(i)2B(i)2S , where S =

ni=1

A(i)2

B(i)2. Also

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


8/15

define a distribution over matrices in R(m+p)(m+p)

with n elements by

P

M=

1

pi

0 B(i)A(i)A(i)B(i) 0

=pi.

First notice that

EM =n

i=1

1

pi


pi

=n

i=1


=

0 BAAB 0

.

This implies thatEM2 =

AB

2 1. Next notice

that the spectral norm of the random matrix Mis upper

bounded by sr (A) sr (B) almost surely. Indeed,M2 sup

i[n]

A(i)B(i)

pi

2

= Ssupi[n]

A(i)A(i)2

B(i)B(i)2

2

=S 1

=

ni=1

A(i)2 B(i)2 AF BF=

sr (A) sr (B) (sr (A) + sr (B))/2,

by definition of pi, properties of norms, Cauchy-Schwartz inequality, and arithmetic/geometric mean in-equality. Notice that this quantity (since the spectralnorms of both A, B are one) is at mostr by assump-tion. Also notice that every element on the support ofthe random variable M, has rank at most two. It iseasy to see that, by setting =r, all the conditions inTheorem1.1are satisfied, and hence we get i1, i2, . . . , itindices from [n], t = (r log(r/2)/2), such that withhigh probability

1t

t

j=1 0 1pij

B(ij )A(ij )1

pijA(i

j)B(ij) 0

0 BAAB 0

2 .

The first sum can be rewritten asA B whereA =1

t

1

pi1A(i1)

1pi2

A(i2) . . . 1

pitA(it)

and

B = 1t

1

pi1B(i1)

1pi2

B(i2) . . . 1

pitB(it)

.

This concludes the theorem.

4.2 Proof of Theorem3.3 (2-regression)

Proof. (of Theorem 3.3) Similarly as the proofin [Sar06]. Let A = UV be the SVD of A. Letb= Axopt+ w, wherew Rn andwcolspan(A). Alsolet A(xopt xopt) = U y, wherey R

rank(A). Our goal

is to bound this quantity

b Axopt22 = b A(xopt xopt) Axopt22= w U y22= w22+ U y22 , since wcolspan(U)= w22+ y22 , sinceUU=I .(4.10)

It suffices to bound the norm ofy , i.e.,y2 3 w2.Recall that givenA, b the vectorw is uniquely defined.On the other hand, vector y depends on the randomprojection R. Next we show the connection betweenyandw through the normal equations.

RAxopt = Rb + w2 =RAxopt = R(Axopt+ w) + w2 =

RA(xopt xopt) = Rw+ w2 =URRU y = URRw+ URw2 =URRU y = URRw,(4.11)

where w2colspan(R), and used this fact to de-rive Ineq. (4.11). A crucial observation is that thecolspan(U) is perpendicular to w. Set A = B = Uin Theorem3.2,and set=

, andt = (r/2). No-

tice that rank (A) + rank (B) 2r, hence with constantprobability we know that 1

i(RU)

1 +. It

follows that URRU y2 (1 )2 y2. A similar

argument (setA = UandB = w in Theorem3.2) guar-antees that

URRw2

=URRw Uw

2

U2 w2 = w2. Recall thatU2 = 1, sinceUU=In with high probability. Therefore, taking Eu-clidean norms on both sides of Equation (4.11) we getthat

y2

(1 )2w2 4 w2 .

Summing up, it follows from Equation (4.10) that,

with constant probability, b A

xopt22 (1 +

162)

b

Axopt

22 = (1 + 16)

b

Axopt

22 . This

proves Ineq. (3.3).Ineq. (3.4) follows directly from the bound on the

norm of y repeating the above proof for .First recall that xopt is in the row span of A, sincexopt = V

1Ub and the columns of V span therow space ofA. Similarly forxopt since the row spanof R A is contained in the row-span of A. Indeed, w2 y2 = U y2 = A(xopt xopt)2 min(A) xopt xopt2.

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


9/15

4.3 Proof of Theorems 3.4, 3.5 (Spectral LowRank Matrix Approximation)

Proof. (of Lemma 3.1) By the assumption and usingLemma5.1 we get that

(4.12) (1 )i(AA) i( A A) (1 + )i(AA)for all i = 1, . . . , rank (A). Letk be the projectionmatrix onto the first k right singular vectors ofA, i.e.,( Ak) Ak. It follows that for every k = 1, . . . , rank (A)A PeA,k(A)

2

A Ak22

= supxRm,x=1

A(Ik)x22

= supxker ek,x=1

Ax22

= supxker ek,x=1 xAAx

(1 + ) supxker ek,x2=1

x A Ax= (1 + )k+1( A A) (1 + )2k+1(AA)= (1 + )2 A Ak22 ,

using that x ker k implieskx = x, left side of thehypothesis, Courant-Fischer onA A (see Eqn. (5.17)),Eqn. (4.12), and properties of singular values, respec-tively.

Proof of Theorem 3.4 (i):Part (a): Now we are ready to prove our first

corollary of our matrix multiplication result to theproblem of computing an approximate low rank matrixapproximation of a matrix with respect to the spectralnorm (Theorem3.4).

Proof. SetA= 1t

RAwhereR is a (r/2)nrandomsign matrix. Apply Theorem 3.2 (i.a) on A we havewith high probability that(4.13)

x Rn, (1)xAAx x A

Ax (1+)xAAx.

Combining Lemma3.1 with Ineq. (4.13) concludes theproof.

Part (b): The proof is based on the followinglemma which reduces the problem of low rank matrixapproximation to the problem of bounding the normof a random matrix. We restate it here for readersconvenience and completeness[NDT09, Lemma 8], (seealso [HMT09,Theorem 9.1] or [BMD09]).

Lemma 4.2. Let A = Ak + UrkrkVrk, Hk =Urkrk and R be any t n matrix. If the matrix(RUk) has full column rank, then the following inequal-ity holds,(4.14)

A P(RA),k(A)2 2 A Ak2 + (RUk)RHk2 .Notice that the above lemma, reduces the problem ofspectral low rank matrix approximation to a problem ofapproximation the spectral norm of the random matrix(RUk)

RHk.

First notice that by setting t = (k/2) we canguarantee that the matrix (RUk) will have full columnrank with high probability. Actually, we can saysomething much stronger; applying Theorem 3.2 (i.a)with A = Uk we can guarantee that all the singularvalues are within 1 with high probability. Now byconditioning on the above event ( (RUk) has full column

rank), it follows from Lemma4.2 thatA P(RA),k(A)2 2 A Ak2+ (RUk)RHk2 2 A Ak2+

(RUk)2RHk2

2 A Ak2+ 1

1 RHk2

2 A Ak2+3

2RUrk2 rk

using the sub-multiplicative property of matrix norms,and that < 1/3. Now, it suffices to bound the normof W := RUrk. Recall that R = 1t G where G

is a t n random Gaussian matrix, It is well-knownthat the distribution of the random matrix GUrk (byrotational invariance of the Gaussian distribution) hasentries which are also i.i.d. Gaussian random variables.Now, we can use the following fact about random sub-Gaussian matrices to give a bound on the spectral normofW. Indeed, we have the following

Theorem 4.2. [RV09, Proposition 2.3] Let W be at(r k)random matrix whose entries are independentmean zero Gaussian random variables. Assume thatr k t, then

(4.15) PW2 r k ec02rk.

for any > 0, where0 is a positive constant.

Apply union bound on the above theorem with bea sufficient large constant and on the conditions ofLemma4.2, we get that with high probability,W2C3

r k and min((RUk)) 1/(1 ). Hence,

Lemma4.2 combined with the above discussion impliesthat

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


10/15

A P(RA),k(A)2 2 A Ak2+ 3/2 RUrk2 A Ak2= 2

A

Ak

2

+ 3

2

tGUrk2 A Ak2

2 + c4

r k

k

A Ak2 ,

wherec4 > 0 is an absolute constant. Rescaling by c4concludes Theorem3.4 (i.b).

Proof of Theorem 3.4 (ii) Here we prove thatwe can achieve the same relative error bound as withrandom projections by just sampling rows ofA througha judiciously selected distribution. However, there is a

price to pay and thats an extra logarithmic factor onthe number of samples, as is stated in Theorem3.4, part(ii).

Proof. (of Theorem3.4 (ii)) The proof follows closelythe proof of[SS08]. Similar with the proof of part (a).Let A = UV be the singular value decompositionof A. Define the projector matrix = U U of sizen n. Clearly, the rank of is equal to the rank ofAand has the same image with A since every elementin the image of A and is a linear combination ofcolumns ofU. Recall that for any projection matrix, thefollowing holds 2 = and hence sr () = rank (A) =

r. Moreover,ni=1 U(i)22 = tr U U = tr () =tr

2

= r. Let pi = (i, i)/r =U(i)22 /r be a

probability distribution on [n], whereUi is the i-th rowofU.

Define a t n random matrix S as follows: Pick tsamples from pi; if the i-th sample is equal to j ( [n])then set Sij = 1/

pj . Notice thatS has exactly one

non-zero entry in each row, hence it has t non-zeroentries. DefineA= SA.

It is easy to verify that ESSS = 2 = .

Apply Theorem 1.1 (alternatively we can use [RV07,Theorem 3.1], since the matrix samples are rank one)

on the matrix , notice that2F = r and2 = 1,ESSS2 1, hence the stable rank of isr. Therefore, if t = (r log(r/2)/2) then with highprobability

(4.16)SS

2 .

It suffices to show that Ineq. (4.16) is equivalent with

the condition of Lemma3.1. Indeed,

supxRn, x=0

x(SS )xxx

supxker, x=0

x(SS )x

xx sup

yIm(A), y=0

y(SS )yyy

supxRm, Ax=0

xA(SS )AxxAAx

supxRm, Ax=0

x(ASSA AA)xxAAx

supxRm, Ax=0

x( AA AA)xxAAx

,

since x ker implies x Im (A), Im (A) Im (),and A = A. By re-arranging terms we get Equa-tion(4.13) and so the claim follows.

Proof of Theorem 3.5: Similarly with the proofof Theorem 3.4 (i.b). By following the proof of part(i.b), conditioning on the event that (RUk) has fullcolumn rank in Lemma4.2, we get with high probabilitythatA PeA,k(A)

2 2 A Ak2+

Uk RRHk2(1 )2

using the fact that if (RUk) has full column

rank then (RUk) = ((RUk)RUk)1Uk R and((RUk)RUk)12 1/(1 )2. Now observethat Uk Hk = 0. Since sr (Hk) k, us-ing Theorem 3.2 (i.b) with t = (k/4), we getthat

Uk RRHk2 = Uk RRHk Uk Hk2 Uk2 Hk2 = A Ak2 with high probability.Rescaling concludes the proof.

5 Acknowledgments

Many thanks go to Petros Drineas for many helpfuldiscussions and pointing out the connection of Theo-rem 3.2 with the 2-regression problem. The second

author would like to thank Mark Rudelson for his val-ueable comments on an earlier draft and also for sharingwith us the proof of Theorem 3.2 (i.b).

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


11/15

References

[AHK06] S. Arora, E. Hazan, and S. Kale. A Fast RandomSampling Algorithm for Sparsifying Matrices. In Pro-ceedings of the International Workshop on Randomiza-

tion and Approximation Techniques (RANDOM), pages272279, 2006. (Cited on page2)

[AM07] D. Achlioptas and F. Mcsherry. Fast Computationof Low-rank Matrix Approximations. Journal of theACM (JACM), 54(2):9, 2007. (Cited on pages 1, 2and5)

[AW02] R. Ahlswede and A. Winter. Strong Converse forIdentification via Quantum Channels. IEEE Trans-actions on Information Theory, 48(3):569579, 2002.(Cited on pages1and 2)

[Bha96] R. Bhatia. Matrix Analysis, volume 169. GraduateTexts in Mathematics, Springer, First edition, 1996.(Cited on pages3and 13)

[BMD09] C. Boutsidis, M. W. Mahoney, and P. Drineas.

An Improved Approximation Algorithm for the ColumnSubset Selection Problem. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA),pages 968977, 2009. (Cited on page9)

[Buc01] A. Buchholz. Operator Khintchine Inequality inNon-commutative Probability. Mathematische Annalen,319(1-16):116, January 2001. (Cited on pages3and13)

[Cla08] K. L. Clarkson. Tighter Bounds for Random Projec-tions of Manifolds. In Proceedings of the ACM Sympo-sium on Computational Geometry (SoCG), pages 3948,2008. (Cited on page2)

[CR07] E. Candes and J. Romberg. Sparsity and Inco-herence in Compressive Sampling. Inverse Problems,23(3):969, 2007. (Cited on page2)

[CW09] K. L. Clarkson and D. P. Woodruff. NumericalLinear Algebra in the Streaming Model. InProceedingsof the Symposium on Theory of Computing (STOC),pages 205214, 2009. (Cited on pages 1,3 and 5)

[DK03] P. Drineas and R. Kannan. Pass Efficient Algo-rithms for Approximating Large Matrices. In Proceed-ings of the ACM-SIAM Symposium on Discrete Algo-rithms (SODA), pages 223232, 2003. (Cited on page5)

[DKM06a] P. Drineas, R. Kannan, and M. W. Mahoney.Fast Monte Carlo Algorithms for Matrices I: Approx-imating Matrix Multiplication. Journal of the ACM(JACM), 36(1):132157, 2006. (Cited on pages 1, 2and3)

[DKM06b] P. Drineas, R. Kannan, and M. W. Mahoney.

Fast Monte Carlo Algorithms for Matrices II: Computinga Low-Rank Approximation to a Matrix. Journal of theACM (JACM), 36(1):158183, 2006. (Cited on page 3)

[DKM06c] P. Drineas, R. Kannan, and M. W. Mahoney.Fast Monte Carlo Algorithms for Matrices III: Com-puting a Compressed Approximate Matrix Decomposi-tion. Journal of the ACM (JACM), 36(1):184206, 2006.(Cited on page3)

[DM10] P. Drineas and M. W. Mahoney. Effective Resis-tances, Statistical Leverage, and Applications to Linear

Equation Solving. Available at arxiv:1005.3097, May2010. (Cited on page6)

[DMM06] P. Drineas, M. W. Mahoney, and S. Muthukrish-nan. Sampling Algorithms for 2-regression and Appli-cations. In Proceedings of the ACM-SIAM Symposiumon Discrete Algorithms (SODA), pages 11271136, 2006.

(Cited on pages1 and 5)[DMMS09] P. Drineas, M. W. Mahoney, S. Muthukrishnan,and T. Sarlos. Faster Least Squares Approximation.Available at arvix:0710.1435, May 2009. (Cited onpage5)

[DR10] A. Deshpande and L. Rademacher. Efficient VolumeSampling for Row/column Subset Selection. InProceed-ings of the Symposium on Foundations of Computer Sci-ence (FOCS), 2010. (Cited on page1)

[DRVW06] A. Deshpande, L. Rademacher, S. Vempala,and G. Wang. Matrix Approximation and Projec-tive Clustering via Volume Sampling. In Proceedingsof the ACM-SIAM Symposium on Discrete Algorithms(SODA), pages 11171126, 2006. (Cited on page 5)

[DZ10] P. Drineas and A. Zouzias. A Note on Element-wise Matrix Sparsification via Matrix-valued ChernoffBounds. Available at arxiv:1006.0407, June 2010.(Cited on page2)

[FKV04] A. Frieze, R. Kannan, and S. Vempala. FastMonte-carlo Algorithms for Finding Low-rank Approx-imations. Journal of the ACM (JACM), 51(6):10251041, 2004. (Cited on pages 1,3 and 5)

[GLF+09] D. Gross, Y.-K. Liu, S. T. Flammia, S. Becker,and J. Eisert. Quantum State Tomography via Com-pressed Sensing. Available at 0909.3304, September2009. (Cited on pages1 and 4)

[Gro09] D. Gross. Recovering Low-rank Matrices from FewCoefficients in any Basis. Available atarxiv:0910.1879,

December 2009. (Cited on pages1and 2)[GV96] G. H. Golub and C. F. Van Loan. Matrix Com-

putations. The Johns Hopkins University Press, Thirdedition, October 1996. (Cited on pages3 and 12)

[HMT09] N. Halko, P. G. Martinsson, and J. A. Tropp.Finding Structure with Randomness: Stochastic Algo-rithms for Constructing Approximate Matrix Decompo-sitions. Available atarxiv:0909.4061,Sep. 2009. (Citedon pages5, 6, 7and 9)

[LP86] F. Lust-Piquard. Inegalites de Khintchine dansCp (1 < p < ). C. R. Acad. Sci. Paris Ser. I Math.,303(7):289292, 1986. (Cited on pages3and 13)

[LPP91] F. Lust-Piquard and G. Pisier. Non CommutativeKhintchine and Paley Inequalities. Arkiv for Matematik,

29(1-2):241260, December 1991. (Cited on page13)[LT91] M. Ledoux and M. Talagrand. Probability in Ba-

nach Spaces, volume 23 of Ergebnisse der Mathematikund ihrer Grenzgebiete (3). Springer-Verlag, 1991.Isoperimetry and Processes. (Cited on page13)

[Mag07] A. Magen. Dimensionality Reductions in 2 thatPreserve Volumes and Distance to Affine Spaces. Dis-crete & Computational Geometry, 38(1):139153, 2007.(Cited on page2)

[Mil71] V.D. Milman. A new Proof of A. Dvoretzkys The-

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/1005.3097http://-/?-http://-/?-http://-/?-http://arxiv.org/pdf/0710.1435http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/1006.0407http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/0909.3304http://-/?-http://-/?-http://arxiv.org/abs/0910.1879http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/0909.4061http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/0909.4061http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/0910.1879http://-/?-http://-/?-http://arxiv.org/abs/0909.3304http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/1006.0407http://-/?-http://-/?-http://-/?-http://arxiv.org/pdf/0710.1435http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/1005.3097http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


12/15

orem on Cross-sections of Convex Bodies. Funkcional.Anal. i Prilozhen., 5(4):2837, 1971. (Cited on page2)

[NDT09] N. H. Nguyen, T. T. Do, and T. D. Tran. A Fastand Efficient Algorithm for Low-rank Approximation ofa Matrix. InProceedings of the Symposium on Theoryof Computing (STOC), pages 215224, 2009. (Cited on

pages3,5,6,9 and 13)[Nem07] A. Nemirovski. Sums of Random Symmetric Ma-trices and Quadratic Optimization under OrthogonalityConstraints. Mathematical Programming, 109(2):283317, 2007. (Cited on page2)

[Rec09] B. Recht. A Simpler Approach to Matrix Comple-tion. Available atarxiv:0910.0651, October 2009. (Citedon pages1,2 and 4)

[RST09] V. Rokhlin, A. Szlam, and M. Tygert. A Ran-domized Algorithm for Principal Component Analysis.SIAM Journal on Matrix Analysis and Applications,31(3):11001124, 2009. (Cited on page5)

[Rud99] M. Rudelson. Random Vectors in the IsotropicPosition. J. Funct. Anal., 164(1):6072, 1999. (Cited

on pages3 and 13)[RV07] M. Rudelson and R. Vershynin. Sampling from

Large Matrices: An Approach through Geometric Func-tional Analysis. Journal of the ACM (JACM), 54(4):21,2007. (Cited on pages 2,3,4, 5, 10and 13)

[RV09] M. Rudelson and R. Vershynin. The Smallest Singu-lar Value of a Random Rectangular Matrix. Communi-cations on Pure and Applied Mathematics, 62(1-2):17071739, 2009. (Cited on page9)

[Sar06] T. Sarlos. Improved Approximation Algorithms forLarge Matrices via Random Projections. InProceedingsof the Symposium on Foundations of Computer Science(FOCS), pages 143152, 2006. (Cited on pages1, 2, 3,5and 8)

[So09a] A. Man-Cho So. Improved Approximation Boundfor Quadratic Optimization Problems with Orthogonal-ity Constraints. InProceedings of the ACM-SIAM Sym-posium on Discrete Algorithms (SODA), pages 12011209, 2009. (Cited on page2)

[So09b] A. Man-Cho So. Moment Inequalities for sumsof Random Matrices and their Applications in Opti-mization. Mathematical Programming, December 2009.(Cited on page2)

[SS90] G. W. Stewart and J. G. Sun. Matrix PerturbationTheory (Computer Science and Scientific Computing).Academic Press, June 1990. (Cited on page 3)

[SS08] D. A. Spielman and N. Srivastava. Graph Sparsifica-tion by Effective Resistances. In Proceedings of the Sym-

posium on Theory of Computing (STOC), pages 563568, 2008. (Cited on pages 6and 10)

[Tro10] J. A. Tropp. User-Friendly Tail Bounds for Sums ofRandom Matrices. Available at arxiv:1004.4389, April2010. (Cited on pages 1 and 4)

[WX08] A. Wigderson and D. Xiao. Derandomizing theAhlswede-Winter Matrix-valued Chernoff Bound usingPessimistic Estimators, and Applications. Theory ofComputing, 4(1):5376, 2008. (Cited on pages2 and 4)

Appendix

The next lemma states that if a symmetric positivesemi-definite matrixA approximates the Rayleigh quo-tient of a symmetric positive semi-definite matrix A,then the eigenvalues of

A also approximate the eigen-

values ofA.

Lemma 5.1. Let 0 < < 1. AssumeA,A are n nsymmetric positive semi-definite matrices, such that the

following inequality holds

(1 )xAx x Ax (1 + )xAx, x Rn.Then, for i = 1, . . . , n the eigenvalues of A andA arethe same up-to an error factor, i.e.,

(1 )i(A) i( A) (1 + )i(A).Proof. The proof is an immediate consequence of the

Courant-Fischers characterization of the eigenvalues.First notice that by hypothesis, A andAhave the samenull space. Hence we can assume without loss of gen-erality, that i(A), i( A) > 0 for all i = 1, . . . , n. Leti(A) and i( A) be the eigenvalues (in non-decreasingorder) ofA andA, respectively. The Courant-Fischermin-max theorem [GV96, p. 394] expresses the eigen-values as

(5.17) i(A) = minSi

maxxSi

xAxxx

,

where the minimum is over all i-dimensional subspaces

Si. Let the subspaces Si0 and Si1 where the minimumis achieved for the eigenvalues ofA andA, respectively.Then, it follows that

i( A) = minSi

maxxSi

x Axxx

maxxSi0

x AxxAx

xAxxx

(1+)i(A).

and similarly,

i(A) = minSi

maxxSi

xAxxx

maxxSi1

xAx

x Ax xAxxx

i(A)

1 .

Therefore, it follows that for i = 1, . . . , n,

(1 )i(A) i( A) (1 + )i(A).Proof of Theorem 1.1 For notational convenience,

let Z =1t ti=1 Mi EM2 and define Ep :=

EM1,M2,...,MtZp. Moreover, let X1, X2, . . . , X n be

copies of a (matrix-valued) random variables X, we willdenote EX1,X2,...,Xn by EX[n] . Our goal is to give sharpbounds on the moments of the non-negative random

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/0910.0651v2http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/1004.4389http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/1004.4389http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://arxiv.org/abs/0910.0651v2http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


13/15

variableZand then using the moment method to giveconcentration result forZ.

First we give a technical lemma of independentinterest that bounds thep-th moments ofZas a functionofp,r(the rank of the samples), and thep/2-th moment

of the random variable tj=1 M2j 2. More formally, wehave the followingLemma 5.2. Let M1, . . . , M t be i.i.d. copies of M,whereMis a symmetric matrix-valued random variablethat has rank at most r almost surely. Then for every

p 2

(5.18) Ep rt1p(2Bp)p EM[t]

t

j=1

M2j

p/2

2

,

whereBp is a constant that depends onp.

We need a non-commutative version of Khintchine in-

equality due to F. Lust-Piquard[LP86], see also [LPP91]and [Buc01, Theorem 5]. We start with some prelimi-naries; letA Rnn and denote by Cnp thep-th Schat-ten norm spacethe Banach space of linear operators(or matrices in our setting) in Rn equipped with thenorm

(5.19) ACnp :=

ni=1

i(A)p

1/p,

where i(A) are the singular values of A, see [Bha96,Chapter IV, p.92] for a discussion on Schatten norms.Notice that

A

2= 1(A), hence we have the following

inequality

(5.20) A2 ACnp (rank (A))1/p A2 ,

for any p 1. Notice that when p = log2(rank (A)),then rank (A)

1/ log2(rank(A)) = 2. Therefore, in thiscase, the Schatten norm is essentially the spectral norm.We are now ready to state the matrix-valued Khintchineinequality. See e.g.[Rud99] or[NDT09,Lemma 8].

Theorem 5.1. Assume2p


14/15

for anyfixedsymmetric matrices M1, M2, . . . , M t.

E[t]

1

t

tj=1

j Mj

p

2

1p

1t

E[t]

tj=1

j Mj

p

Cp

1p

1t

Bp

t

j=1

M2j

1/2Cp

(rt)1/pBpt

t

j=1

M2j

12

2

= (rt)1/pBp

t

tj=1

M2j

12

2

,(5.24)

taking 1/t outside the expectation and using the leftpart of Ineq. (5.20), Ineq. (5.21), the right part of

Ineq. (5.20) and the fact that the matrixt

j=1 M2j

1/2has rank at most rt.

Now raising Ineq. (5.24) to the p-th power onboth sides and then take expectation with respect toM1, . . . , M t, it follows from Ineq. (5.23) that

Ep 2p rttp

Bpp EM[t]

tj=1

M2j

p/2

2

.

This concludes the proof of Lemma 5.2.

Now we are ready to prove Theorem1.1. First we canassume without loss of generality that M 0 almostsurely losing only a constant factor in our bounds.Indeed, by the spectral decomposition theorem anysymmetric matrix can be written as M =

jj uju

j .

Set M+ =

j0 j ujuj and M = M M+. It is

clear thatM+2 , M2 M2,M+F , MFMF and rank (M+) , rank (M) rank (M). Triangleinequality tells us that

1tt

i=1

Mj EM2

1tt

i=1

(Mj)+ EM+2

+

1tt

i=1

(Mj) EM

2

and one can bound each term of the right hand sideseparately. Hence, from now on we assume that M 0a.s.. Now use the fact that for every j [t],M2j Mjsince Mj s are positive semi-definite andM2

almost surely. Summing up all the inequalities we getthat

(5.25)

tj=1

M2j

2

tj=1

Mj

2

.

It follows that

Ep rt1p(2Bp)p EM[t]

t

j=1

M2j

p/2

2

rt1p(2Bp)pp/2 EM[t]

t

j=1

Mj

p/2

2

= rt(2Bp

)p

tp/2 EM[t]

1

t

tj=1

Mj

p/2

2

= rt(2Bp

)p

tp/2 EM[t]

1tt

j=1

Mj EM+ EM

p/2

2

rt(2Bp

)p

tp/2

E

1tt

j=1

Mj EMp2

2

2p

+ 1

p2

rt(2Bp

)p

tp/2

E

1

t

tj=1

Mj EM

p

2

1p

+ 1

p2

= rt(2Bp)p

tp/2E1/pp + 1p/2 ,

using Lemma5.2, Ineq. (5.25), Minkowskis inequality,Jensens inequality, definition ofEp and the assumptionEM2 1. This implies the following inequality

(5.26) E1/pp 2Bp

(rt)1/p

t(E1/pp + 1),

using that

1 + x 1+x,x 0. Let ap = 4Bp

(rt)1/pt

.

Then it follows from the above inequality that E1/p

p ap

2 (E

1/p

p + 1). It follows that6

min{E1/p

p , 1} ap. Alsonotice that

(5.27) (Emin{Z, 1}p)1/p min(E1/pp , 1).

Now for any 0< ) = P (min{Z, 1} > ) .

6Indeed, ifE1/pp


15/15

By the moment method we have that

P (min{Z, 1} > ) = P (min{Z, 1}p > p) inf

p2

Emin{Z, 1}p

p

inf

p2min{E1/pp , 1}p

p (5.27)

infp2

ap

p= inf

p2

4Bp

(rt)1/p

t

p

= inf p2

C2

p(rt)1/p

t

p,

whereC2 > 0 is an absolute constant.Now assume that r

t and then set p = c2log t,

where c2 > 0 is a sufficient large constant, at theinfimum expression in the above inequality, it followsthat

P

1tt

i=1

Mi EM

2

>

C

log t(rt)

1log t

t

c2log tWe want to make the base of the above exponent smallerthan one. It is easy to see that this is possible if we sett = C0/

2 log(C0/2) where C0 is sufficiently large

absolute constant. Hence it implies that the aboveprobability is at most 1/poly(t). This concludes theproof.

1436
http://-/?-http://-/?-

magen zouzais matrix multiplication

Documents