towards provable learning of polynomial neural networks...

10
Towards Provable Learning of Polynomial Neural Networks Using Low-Rank Matrix Estimation Mohammadreza Soltani Chinmay Hegde Iowa State University [email protected] Iowa State University [email protected] Abstract We study the problem of (provably) learn- ing the weights of a two-layer neural network with quadratic activations. In particular, we focus on the under-parametrized regime where the number of neurons in the hidden layer is (much) smaller than the dimension of the input. Our approach uses a lifting trick, which enables us to borrow algorith- mic ideas from low-rank matrix estimation. In this context, we propose two novel, non- convex training algorithms which do not need any extra tuning parameters other than the number of hidden neurons. We support our algorithms with rigorous theoretical analysis, and show that the proposed algorithms en- joy linear convergence, fast running time per iteration, and near-optimal sample complex- ity. Finally, we complement our theoretical results with several numerical experiments. 1 Introduction The re-emergence of neural networks (spurred by the advent of deep learning) has had a remarkable im- pact on various sub-domains of artificial intelligence (AI) including object recognition in images, natural language processing, and automated drug discovery, among many others. However, despite the success- ful empirical performance of neural networks for these AI tasks, provable methods for learning neural net- works remain relatively mysterious. Indeed, training a network of even moderate size requires solving a very large-scale, highly non-convex optimization problem. In this paper, we (provably) resolve several algorith- Proceedings of the 21 st International Conference on Ar- tificial Intelligence and Statistics (AISTATS) 2018, Lan- zarote, Spain. PMLR: Volume 84. Copyright 2018 by the author(s). . . . . . . x 1 x 2 x 3 x 4 x p σ(.) σ(.) σ(.) ˆ y 1 2 r Figure 1: Two-layer polynomial neural network. mic challenges that arise in the context of a special class of (shallow) neural networks by making connec- tions to the better-studied problem of low-rank matrix estimation. Our hope is that a rigorous understanding of the fundamental limits of training shallow networks can be used as building blocks to obtain theoretical insights for more complex networks. 1.1 Setup Consider a shallow (two-layer) neural network archi- tecture, as illustrated in Figure 1. This network com- prises p input nodes, a single hidden layer with r neu- rons with activation function σ(z), first layer weights {w j } r j=1 R p , and an output layer comprising of a single node and weights {j } r j=1 R. If σ(z)= z 2 , then the above network is called a polynomial neural network (Livni et al., 2014). More precisely, the input- output relationship between an input, x 2 R p , and the corresponding output, y 2 R, is given by: ˆ y = r X j=1 j σ(w T j x)= r X j=1 j hw j ,xi 2 . In this paper, our focus is in the so-called “under- parameterized” regime where r p. Our goal is to learn this network, given a set of training input-output

Upload: others

Post on 27-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Towards Provable Learning of Polynomial Neural Networks

Using Low-Rank Matrix Estimation

Mohammadreza Soltani Chinmay HegdeIowa State [email protected]

Iowa State [email protected]

Abstract

We study the problem of (provably) learn-ing the weights of a two-layer neural networkwith quadratic activations. In particular,we focus on the under-parametrized regimewhere the number of neurons in the hiddenlayer is (much) smaller than the dimensionof the input. Our approach uses a liftingtrick, which enables us to borrow algorith-mic ideas from low-rank matrix estimation.In this context, we propose two novel, non-convex training algorithms which do not needany extra tuning parameters other than thenumber of hidden neurons. We support ouralgorithms with rigorous theoretical analysis,and show that the proposed algorithms en-joy linear convergence, fast running time periteration, and near-optimal sample complex-ity. Finally, we complement our theoreticalresults with several numerical experiments.

1 Introduction

The re-emergence of neural networks (spurred by theadvent of deep learning) has had a remarkable im-pact on various sub-domains of artificial intelligence(AI) including object recognition in images, naturallanguage processing, and automated drug discovery,among many others. However, despite the success-ful empirical performance of neural networks for theseAI tasks, provable methods for learning neural net-works remain relatively mysterious. Indeed, training anetwork of even moderate size requires solving a verylarge-scale, highly non-convex optimization problem.

In this paper, we (provably) resolve several algorith-

Proceedings of the 21

stInternational Conference on Ar-

tificial Intelligence and Statistics (AISTATS) 2018, Lan-

zarote, Spain. PMLR: Volume 84. Copyright 2018 by the

author(s).

...

...

x1

x2

x3

x4

xp

�(.)

�(.)

�(.)

y

↵1

↵2

↵ r

Figure 1: Two-layer polynomial neural network.

mic challenges that arise in the context of a specialclass of (shallow) neural networks by making connec-tions to the better-studied problem of low-rank matrixestimation. Our hope is that a rigorous understandingof the fundamental limits of training shallow networkscan be used as building blocks to obtain theoreticalinsights for more complex networks.

1.1 Setup

Consider a shallow (two-layer) neural network archi-tecture, as illustrated in Figure 1. This network com-prises p input nodes, a single hidden layer with r neu-rons with activation function �(z), first layer weights{wj}rj=1

⇢ Rp, and an output layer comprising of asingle node and weights {↵j}rj=1

⇢ R. If �(z) = z2,then the above network is called a polynomial neuralnetwork (Livni et al., 2014). More precisely, the input-output relationship between an input, x 2 Rp, and thecorresponding output, y 2 R, is given by:

y =rX

j=1

↵j�(wTj x) =

rX

j=1

↵jhwj , xi2.

In this paper, our focus is in the so-called “under-parameterized” regime where r ⌧ p. Our goal is tolearn this network, given a set of training input-output

Page 2: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Towards Provable Learning of Polynomial Neural Networks

pairs {(xi, yi)}mi=1

. We do so by finding a set of weights{↵j , wj}rj=1

that minimize the following empirical risk :

minW2Rr⇥p,↵2Rr

F (W,↵) =1

2m

mX

i=1

(yi � yi)2

, (1)

where the rows of W and the entries of ↵ indicate thefirst-and second-layer weights, respectively. Numer-ous recent papers have explored (provable) algorithmsto learn the weights of such a network under distri-butional assumptions on the input data (Livni et al.,2014; Lin & Ye, 2016; Janzamin et al., 2015; Tian,2016; Zhong et al., 2016; Soltanolkotabi et al., 2017;Li & Yuan, 2017)1.

Clearly, the empirical risk defined in (1) is extremelynonconvex (involving fourth-powers of the entries ofwj , coupled with the squares of ↵j). However, thiscan be circumvented using a clever lifting trick: if wedefine the matrix variable L⇤ =

Prj=1

↵jwjwTj , then

the input-output relationship becomes:

yi = xTi L⇤xi = hxix

Ti , L⇤i, (2)

where xi 2 Rp denotes the ith training sample. More-over, the variable L⇤ is a rank-r matrix of size p ⇥ pTherefore, (1) can be viewed as an instance of learn-ing a fixed (but unknown) rank-r symmetric matrixL⇤ 2 Rp⇥p with r ⌧ p, from a small number of rank-one linear observations given by Ai = xix

Ti . While

still non-convex, low-rank matrix estimation problemssuch as (2) are much better understood. Two specificinstances in statistical learning include:

Matrix sensing and matrix completion. Recon-structing low-rank matrices from (noisy) linear mea-surements of the form yi = hXi, L⇤i impact several ap-plications in control and system identification (Fazel,2002), collaborative filtering (Candes & Recht, 2009;Recht et al., 2010), and imaging. The problem (2) spe-cializes the matrix sensing problem to the case wherethe measurement vectors Xi are constrained to bethemselves rank-one.

Covariance sketching. Estimating a high-dimensional covariance matrix, given a stream of inde-pendent samples {st}1t=1

, st 2 Rp, involves maintain-ing the empirical estimate Q = E[stsTt ], which canrequire quadratic (O(p2)) space complexity. Alterna-tively, one can record a sequence of m ⌧ p2 linearsketches of each sample: zi = xT

i st for i = 1, . . . ,m.At the conclusion of the stream, sketches correspond-ing to a given vector xi are squared and aggregatedto form a measurement: yi = E[z2i ] = E[(xT

i st)2] =

1While quadratic activation functions are rarely used

in practice, stacking multiple such two-layer blocks can be

used to simulate networks with higher-order polynomial

and sigmoidal activations (Livni et al., 2014).

xTi Qxi, which is nothing but a linear sketch of Q of

the form (2). Again, several matrix estimation meth-ods that “invert” such sketches exist; see Cai & Zhang(2015); Chen et al. (2015); Dasarathy et al. (2015).

1.2 Our contributions

In this paper, we make concrete algorithmic progresson solving low-rank matrix estimation problems of theform (2). In the context of learning polynomial neuralnetworks, once we have estimated a rank-r symmetricmatrix L⇤, we can always produce weights {↵j , wj} byan eigendecomposition of L⇤.

In general, a range of algorithms for solving (2) (orvariants thereof) exist in the literature, and can bebroadly classified into two categories:(i) convex approaches, all of which involve enforcingthe rank-r assumption in terms of a convex penaltyterm, such as the nuclear norm (Fazel, 2002; Rechtet al., 2010; Cai & Zhang, 2015; Chen et al., 2015; Caiet al., 2010);(ii) nonconvex approaches based on either alternat-ing minimization (Zhong et al., 2015; Lin & Ye, 2016)or greedy approximation (Livni et al., 2014; Shalev-Shwartz et al., 2011).

Both types of approaches su↵er from severe computa-tional di�culties, particularly when the data dimen-sion p is large. Even the most computationally e�-cient convex approaches require multiple invocationsof singular value decomposition (SVD) of a (poten-tially) large p⇥pmatrix, which can incur cubic (O(p3))running time. Moreover, even the best available non-convex approaches require a very accurate initializa-tion, and also require that the underlying matrix L⇤is well-conditioned; if this is not the case, the runningtime of all available methods again inflates to O(p3),or worse.

In this paper, we take a di↵erent approach, and showhow to leverage recent results in low-rank approxima-tion to our advantage (Musco & Musco, 2015; Hegdeet al., 2016). Our algorithm is also non-convex; how-ever, unlike all earlier works, our method does notrequire any full SVD calculations. Specifically, wedemonstrate that a careful concatenation of random-ized, approximate SVD methods, coupled with appro-priately defined gradient steps, leads to e�cient andaccurate matrix estimation.

To our knowledge, this work constitutes the firstnearly-linear time method for low-rank matrix esti-mation from rank-one observations. Consequently, inthe context of learning two-layer polynomial networks,our method is the first to exhibit nearly-linear runningtime, is nearly sample-optimal for fixed target rank r,and is unconditional (i.e., it makes no assumptions on

Page 3: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Mohammadreza Soltani, Chinmay Hegde

the condition number of L⇤ or the weight matrix W ).Numerical experiments reveal that our methods yielda very attractive tradeo↵ between sample complexityand running time for e�cient matrix estimation.

1.3 Techniques

At a high level, our method can be viewed as a vari-ant of the seminal algorithms proposed in Jain et al.(2010) and Jain et al. (2014), which essentially per-form projected (or proximal) gradient descent withrespect to the space of rank-r matrices. However,since computing SVD in high dimensions can be abottleneck, we cannot use this approach directly. Tothis end, we use the approximation-based matrix es-timation framework proposed in Hegde et al. (2016).This work demonstrates how to carefully integrate ap-proximate SVD methods into singular value projection(SVP)-based matrix estimation algorithms; in partic-ular, algorithms that satisfy certain “head” and “tail”projection properties (explained below in Section 2)are su�cient to guarantee robust and fast convergence.Crucially, this framework removes the need to computeeven a single SVD, as opposed to factorized methodswhich necessarily require one or multiple SVDs, to-gether with stringent condition number assumptions.

However, a direct application of (projected) gradientdescent does not succeed for matrix estimation prob-lems obeying (2); two major obstacles arise:Obstacle 1 . It is well-known the measurement opera-tor that maps L⇤ to y does not satisfy the so-called Re-stricted Isometry Property over rank-r matrices (Cai& Zhang, 2015; Chen et al., 2015; Zhong et al., 2015);therefore, all statistical and algorithmic correctness ar-guments of Hegde et al. (2016) no longer apply.Obstacle 2 . The algebraic structure of the rank-oneobservations in (2) inflates the running time of com-puting even a simple gradient update to O(p3) (irre-spective of the algorithmic cost of rank-r projection,whether done using exact or approximate SVDs).

We resolve Obstacle 1 by studying the concentra-tion properties of certain linear operators of the formof rank-one projections, leveraging an approach firstproposed in Zhong et al. (2015). We show that a non-trivial “bias correction” step, coupled with projecteddescent-type methods, within each iteration is su�-cient to achieve fast (linear) convergence. To be moreprecise, define the operator A such that (A(L⇤))i =xTi L⇤xi for i = 1, . . . ,m, where xi is a standard nor-

mal random vector. A simple calculation shows thatat any given iteration t, if Lt is the current estimateof the underlying matrix variable, then we have:

EA⇤A(Lt � L⇤) = 2(Lt � L⇤) + Tr(Lt � L⇤)I,

where the operator Tr(·) denotes the trace of a matrix

and the expectation is taken with respect to the ran-domness in the xi’s. The left hand side of this equation(roughly) corresponds to the expected value of the gra-dient in each iteration, and it is clear that while thegradient points in the “correct” direction Lt�L⇤, it isadditionally biased by the extra Tr(.) term. Motivatedby this, we develop a new descent scheme by carefullyaccounting for this bias. Interestingly, the sample com-plexity of our approach only increases by a mild factor(specifically, an extra factor r together with polylog-arithmic terms) when compared to the best availabletechniques. Moreover, this scheme exhibits linear con-vergence in theory, and shows very competitive perfor-mance in experimental simulations.

We resolve Obstacle 2 by carefully exploiting therank-one structure of the observations. In partic-ular, we develop a modification of the randomizedblock-Krylov SVD (or BK-SVD) algorithm of Musco& Musco (2015) to work for the case of certain “im-plicitly defined” matrices; specifically, we design a ran-domized SVD routine where the input is a linear op-erator that is constructed using vector-valued compo-nents. This modification, coupled with the tail-andhead-projection arguments developed in Hegde et al.(2016), enables us to achieve a fast per-iteration com-putational complexity. In particular, our algorithmstrictly improves over the (worst-case) per-iterationrunning time of all existing algorithms; see Table 1.

2 Main results

2.1 Preliminaries

Let us first introduce some notation. Throughout thispaper, k·kF and k·k

2

denote the matrix Frobenius andspectral norm, respectively, and Tr(·) denotes matrixtrace. The phrase “with high probability” indicatesan event whose failure rate is exponentially small. Weassume that the training data samples (x, y) obey agenerative model (2) written as:

y =rX

j=1

↵⇤j�(hw⇤

j , xi) = xTL⇤x+ e0 (3)

where L⇤ 2 Rp⇥p is the “ground-truth” matrix (withrank equal to r), and e0 2 R is additive noise. DefineA : Rp⇥p ! Rm such that:

A(L⇤) = [xT1

L⇤x1

, xT2

L⇤x2

, . . . , xTmL⇤xm]T ,

and each xii.i.ds N (0, I) is a normal random vector

in Rp for i = 1, . . . ,m. The adjoint operator of A isdefined as A⇤(y) =

Pmi=1

yixixTi . Also, noise vector

is shown by e 2 Rm; throughout the paper (for thepurpose of analysis) we assume that e is zero-mean,

Page 4: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Towards Provable Learning of Polynomial Neural Networks

Table 1: Summary of our contributions and comparison with existing algorithms. Here, � =

�1�r

denotes the condition

number of L⇤.

Algorithm Sample complexity (m) Total Running Time

Convex O(pr) O⇣

p3p✏

GECO N/A O⇣

p2log(p)poly(r)

AltMin-LRROM O�pr4 log2(p)�2 log( 1✏ )

�O�mpr log( 1✏ ) + p3

gFM O(pr3�2 log( 1✏ )) O�mpr log( 1✏ ) + p3

EP-ROM [This paper] O�pr2 log4(p) log( 1✏ )

�O�mp2 log( 1✏ )

AP-ROM [This paper] O�pr3 log4(p) log( 1✏ )

�O�mpr log(p) log( 1✏

subgaussian with i.i.d entries, and independent of xi’s.The goal is to learn the rank-r matrix parameter L⇤

from as few samples as possible.

In our analysis, we require the operators A and A⇤ tosatisfy the following regularity condition with respectto the set of low-rank matrices. We call this the Con-ditional Unbiased Restricted Isometry Property, abbre-viated as CU-RIP(⇢):

Definition 1. Consider fixed rank-r matrices L1

andL2

. Then, A is said to satisfy CU-RIP(⇢) if thereexists 0 < ⇢ < 1 such that

���L1

� L2

� 1

2mA⇤A(L

1

� L2

)

� 1

2m1T (A(L

1

)�A(L2

)) I���2

⇢���L

1

� L2

���2

.

Let Ur denote the set of all rank-r matrix subspaces,i.e., subspaces of Rp⇥p which are spanned by any ratoms of the form uvT where u, v 2 Rp are unit `

2

-norm vectors. We use the idea of head and tail approx-imate projections with respect to Ur first proposedin Hegde et al. (2015), and instantiated in the contextof low-rank approximation in Hegde et al. (2016).

Definition 2 (Approximate tail projection). T :Rp⇥p ! Ur is a "-approximate tail projection algo-rithm if for all L 2 Rp⇥p, T returns a subspace W =T (L) that satisfies: kL�PWLkF (1+ ")kL�LrkF ,where Lr is the optimal rank-r approximation of L.

Definition 3 (Approximate head projection). H :Rp⇥p ! Ur is a "-approximate head projection if forall L 2 Rp⇥p, the returned subspace V = H(L) satis-fies: kPV LkF � (1�")kLrkF , where Lr is the optimalrank-r approximation of L.

2.2 Algorithms and theoretical results

We now propose methods to estimate L⇤ given knowl-edge of {xi, yi}mi=1

. Our first method is somewhat com-putationally ine�cient, but achieves very good samplecomplexity and serves to illustrate the overall algorith-mic approach. Consider the non-convex, constrained

risk minimization problem:

minL2Rp⇥p

F (L) =1

2m

mX

i=1

�yi � xT

i Lxi

�2

s.t. rank(L) r.

(4)

To solve this problem, we first propose an algorithmthat we call Exact Projections for Rank-One Matrixestimation, or EP-ROM, described in pseudocode formin Algorithm 12.

We now analyze this algorithm. First, we provide atheoretical result which establishes statistical and op-timization convergence rates of EP-ROM. More pre-cisely, we derive an upper bound on the estimation er-ror (measured using the spectral norm) of recoveringL⇤. Due to space constraints, we defer all the proofsto the appendix.

Theorem 4 (Linear convergence of EP-ROM). Con-sider the sequence of iterates (Lt) obtained in EP-ROM. Assume that in each iteration the linear opera-tor A satisfies CU-RIP(⇢) for some 0 < ⇢ < 1

2

, thenEP-ROM outputs a sequence of estimates Lt such that:

kLt+1

� L⇤k2 q��Lt � L⇤

��2

+1

2m(|1T e|+

��A⇤e��2

), (5)

where 0 < q < 1.

The contraction factor, q, in Equation (5) can be ar-bitrary small if we choose m su�ciently large, and weelaborate it in Theorem (6). The second and thirdterm in (5) represent the statistical error rate. In thenext Theorem, we show that these error terms can besuitably bounded. Furthermore, Theorem 4 implies(via induction) that EP-ROM exhibits linear conver-gence; please see Corollary 7.

Theorem 5 (Bounding the statistical error). Con-sider the generative model (3) with zero-mean subgaus-sian noise e 2 Rm with i.i.d. entries (and independentof the xi’s) such that ⌧ = max

1jm kejk 2 (Here,

2In Alg 1, Pr denotes the projection operator onto the

set of rank-r matrices.

Page 5: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Mohammadreza Soltani, Chinmay Hegde

Algorithm 1 EP-ROM

Inputs: y, number of iterations K, independent data samples {xt1

, xt2

. . . , xtm} for t = 1, . . . ,K, rank r

Outputs: Estimates bLInitialization: L

0

0, t 0Calculate: y = 1

m

Pmi=1

yiwhile t K doLt+1

= Pr

�Lt � 1

2m

Pmi=1

�(xt

i)TLtx

ti � yi

�xti(x

ti)

T � ( 1

2m1TA(Lt)� 1

2

y)I�

t t+ 1end whileReturn: bL = LK

k · k 2 denotes the 2

-norm; see Definition 11 in theappendix). Then, with probability at least 1 � �, wehave:

1

m|1T e|+

���1

mA⇤e

���2

C 00⌧

sp log2 p

mlog(

p

�). (6)

where C 00⌧ > 0 is constant which depends on ⌧ .

To establish linear convergence of EP-ROM, we as-sume that the CU-RIP holds at each iteration. Thefollowing theorem certifies this assumption.

Theorem 6 (Verifying CU-RIP). At any iteration t ofEP-ROM, with probability at least 1 � ⇠, CU-RIP(⇢)is satisfied with parameter ⇢ < 1

2

provided that m =

O⇣

1

�2 pr2 log3 p log(p⇠ )

⌘for some � > 0.

Integrating the above results, together with the as-sumption of availability of a batch of m indepen-dent samples in each iteration, we obtain the follow-ing corollary formally establishing linear convergence.We acknowledge that this assumption of “fresh sam-ples” is somewhat unrealistic and is an artifact of ourproof techniques; nonetheless, it is a standard mech-anism for proofs for non-convex low-rank matrix esti-mation (Hardt, 2014; Zhong et al., 2016)

Corollary 7. After K iterations, with high probabilitythe output of EP-ROM satisfies:

kLK � L⇤k2 qKkL⇤k2 +C 00⌧

1� q

sp log3 p

m. (7)

where C 00⌧ is given in (6). Thus, under all the assump-

tions in theorem 4, to achieve ✏-accuracy for estima-tion of L⇤ in terms of the spectral norm, EP-ROMneeds K = O(log(kL⇤k2

✏ )) iterations. Based on Theo-rems 5, 6, and Corollary 7, the sample complexity ofEP-ROM scales as m = O

�pr2 log4 p log( 1✏ )

�.

While EP-ROM exhibits linear convergence, the per-iteration complexity is still high since it requires pro-jection onto the space of rank-r matrices, which neces-sitates the application of SVD. In the absence of any

spectral assumptions on the input to the SVD, the per-iteration running time of EP-ROM can be cubic, whichcan be prohibitive. Overall, we obtain a running timeof eO(p3r2) in order to achieve ✏-accuracy (please seeSection 5.3 in the appendix for a longer discussion).

To reduce the running time, one can instead replace astandard SVD routine with approximation heuristicssuch as Lanczos iterations (Lanczos, 1950); however,these may not result in algorithms with provable con-vergence guarantees. Instead, following Hegde et al.(2016), we can use a pair of inaccurate rank-r projec-tions (in particular, tail-and head-approximate projec-tion operators). Based on this idea, we propose oursecond algorithm that we call Approximate Projectionfor Rank One Matrix estimation, or AP-ROM. We dis-play the pseudocode of AP-ROM in Algorithm 2.

The specific choice of approximate SVD algorithmsthat simulate the operators T (.) and H(.) is flexi-ble. We note that tail-approximate projections havebeen widely studied in the numerical linear algebraliterature (Clarkson & Woodru↵, 2013; Mahoney &Drineas, 2009; Rokhlin et al., 2009); however, head-approximate projection methods are less well-known.In our method, we use the randomized Block KrylovSVD (BK-SVD) method proposed by Musco & Musco(2015), which has been shown to satisfy both typesof approximation guarantees (Hegde et al., 2016).One can alternatively use LazySVD, recently proposedby Allen-Zhu & Li (2016), which also satisfies bothguarantees. The nice feature of these methods is thattheir running time is independent of the spectralgap of the matrix. We leverage this property to showasymptotic improvements over other fast SVD meth-ods (such as the power method).

We briefly discuss the BK-SVD algorithm. In particu-lar, BK-SVD takes an input matrix with size p⇥p withrank r and returns a r-dimensional subspace which ap-proximates the top right r singular vectors of the in-put. Mathematically, if A 2 Rp⇥p is the input, Ar isthe best rank-r approximation to it, and Z is a basismatrix that spans the subspace returned by BK-SVD,then the projection of A into Z, B = ZZTA satisfies

Page 6: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Towards Provable Learning of Polynomial Neural Networks

Algorithm 2 AP-ROM

Inputs: y, number of iterations K, independent data samples {xt1

, xt2

. . . , xtm} for t = 1, . . . ,K, rank r

Outputs: Estimates bLInitialization: L

0

0, t 0Calculate: y = 1

m

Pmi=1

yiwhile t K doLt+1

= T�Lt �H

�1

2m

Pmi=1

�(xt

i)TLtx

ti � yi

�xti(x

ti)

T � ( 1

2m1TA(Lt)� 1

2

y)I��

t t+ 1end whileReturn: bL = LK

the following relations:

kA�BkF (1 + ")kA�ArkF ,|uT

i AATui � ziAAT zi| "�2

r+1

,

where " > 0 is defined as the tail and head pro-jection approximate constant, and ui denotes the ith

right eigenvector of A. In Appendix-B of Hegde et al.(2016), it has been shown that the per-vector guaran-tee can be used to prove the approximate head projec-tion property, i.e., kBkF � (1� ")kArkF .We now establish that AP-ROM also exhibits linearconvergence, while obeying similar statistical proper-ties as EP-ROM. We have the following results:

Theorem 8 (Convergence of AP-ROM). Consider thesequence of iterates (Lt) obtained in AP-ROM. As-sume that in each iteration t, A satisfies CU-RIP(⇢0)for some 0 < ⇢0 < 1, then AP-ROM outputs a sequenceof estimates Lt such that:

kLt+1

� L⇤kF q01

kLt � L⇤kF+ q0

2

�|1T e|+

��A⇤e��2

�, (8)

where q01

= (2 + ")(⇢0 +p1� �2), q0

2

=pr

2m

✓2� "+ �(2�")(2+")p

1��2

◆, and � = (1�")(1�⇢0)�⇢0.

Similar to Theorem 6, we can show that CU-RIP is sat-isfied in each iteration of AP-ROM with probability at

least 1�⇠, provided that m = O⇣

1

�2 pr3 log3 p log(p⇠ )

⌘.

Hence, we require a factor-r increase compared to be-fore. Overall, we have the following result:

Corollary 9. The output of AP-ROM satisfies thefollowing after K iterations with high probability:

kLK � L⇤kF (q01

)KkL⇤kF +C 00

⌧ q02

1� q01

spr log3 p

m.

(9)

where q01

and q02

have been defined in Theorem 8.

Hence, under the assumptions in Theorem 8, inorder to achieve ✏-accuracy in the estimation of

L⇤ in terms of Frobenius norm, AP-ROM requiresK = O(log(kL⇤k2

✏ )) iterations. From Theorem 8 andCorollary 9, we observe that the sample-complexityof AP-ROM (i.e., the number of samples m toachieve a given accuracy) slightly increases as m =O�pr3 log4 p log( 1✏ )

�.

2.3 Improving running time

The above analysis of AP-ROM shows that insteadof using exact rank-r projections (as in EP-ROM),one can use instead tail and head approximate pro-jection which is implemented by the BK-SVD methodof Musco & Musco (2015). The running time for thismethod is given by eO(p2r) if r ⌧ p. While the run-ning time of the projection step is gap-independent,the calculation of the gradient (i.e., the input to thehead projection method H) is itself the major bottle-neck. In essence, this is related to the calculation ofthe adjoint operator, A⇤(d) =

Pmi=1

d(i)xixTi , which

requires O(p2) operations for each sample. Coupledwith the sample-complexity ofm = ⌦(pr3), this meansthat the running time per-iteration scaled as ⌦(p3r3),which overshadows any gains achieved during the pro-jection step (please see Section 5.3 in the appendix)

To address this challenge, we propose a modifiedversion of BK-SVD for head approximate projectionwhich uses the special rank-one structures involved inthe calculation of the gradients. We call this methodModified BK-SVD, or MBK-SVD. The basic idea isto implicitly evaluate each Krylov-subspace iterationwithin BK-SVD, and avoid any explicit calculation ofthe adjoint operator A⇤ applied to the current esti-mate. Due to space constraints, the pseudocode aswell as the running time analysis of MBK-SVD is de-ferred to the appendix. We have:

Theorem 10. AP-ROM (with modified BK-SVD)runs in time K = O

�p2r4 log2( 1✏ )polylog(p)

�.

3 Prior art

Due to space constraints, we only provide here a brief(and incomplete) review of related work, and describe

Page 7: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Mohammadreza Soltani, Chinmay Hegde

2,000 4,000 6,000 8,0000

0.2

0.4

0.6

0.8

1

Number of measurements, m

Probab

ilityof

Success SVD

SVDSAP-ROMConvexgFM

20 40 600

2

4

6

8

10

Iteration

log� 1 2

ky�(A

)Ltk

2 2

SVDSVDS

AP-ROMConvexgFM

0 1,000 2,000 3,000 4,000 5,000

�2

�1

0

Time (sec)

Relativeerror

SVDSVDS

AP-ROMConvexgFM

(a) (b) (c)

Figure 2: Comparison of algorithms. (a) Phase transition plot with p = 100. (b) Evolution of the objective functionversus number of iterations with p = 100, m = 8500, and noise level � = 0.1. (c) Running time of the algorithm withp = 1000 and m = 75000.

how our method di↵ers from earlier techniques.

Problems involving low-rank matrix estimation havereceived significant attention from the machine learn-ing community over the last few years; see Davenport& Romberg (2016) for a recent survey. In early worksfor matrix estimation, the observation operator A isassumed to be parametrized by m independent full-rank p⇥p matrices that satisfy certain restricted isom-etry conditions (Recht et al., 2010; Liu, 2011). In thissetup, it has been established that m = O(pr) ob-servations are su�cient to recover an unknown rank-rmatrix L⇤ in (3) (Candes & Plan, 2011), and this scal-ing of sample complexity is statistically optimal.

In the context of provable methods for learning neu-ral networks, two-layer networks have received specialattention. For instance, Livni et al. (2014) has con-sidered a two-layer network with quadratic activationfunction (identical to the model proposed above), andproposed a greedy, improper learning algorithm: ineach iteration, the algorithms adds one hidden neu-ron to the network until the risk falls below a thresh-old. While this algorithm is guaranteed to converge,its convergence rate is sublinear.

Recently, in Zhong et al. (2016), the authors haveproposed a linearly convergent algorithm for learn-ing two-layer networks for several classes of activationfunctions. They also derived an upper bound on thesample complexity of network learning which is linearin p, and depends polynomially on r and other spec-tral properties of the ground-truth (planted) weights.However, their theory does not provide convergenceguarantees for quadratic functions; this paper closesthis gap. Note that our focus here is not the estima-tion of the weights {↵j , wj} themselves, but rather,any network that gives the same input-relationship..As a result, our guarantees are stated in terms of thelow-rank matrix L⇤. Furthermore, unlike their algo-rithm, our sample complexity does not depend on thespectral properties of ground-truth weights.

Other works have also studied similar two-layer se-tups, including Janzamin et al. (2015); Tian (2016);Soltanolkotabi et al. (2017); Li & Yuan (2017). Incontrast with these results, our framework does notassume the over-parameterized setting where the num-ber of hidden neurons r is greater than p. In addition,we explicitly derive a sample complexity that is linearin p, as well as demonstrate linear time convergence.Also, observe that if we let L⇤ to be rank-1, thenProblem (2) is known as generalized phase retrieval forwhich several excellent algorithms are known (Candeset al., 2013, 2015; Netrapalli et al., 2013). However,our problem is more challenging as it allows L⇤ to havearbitrary rank-r.

We now briefly contrast our method with other al-gorithmic techniques for low-rank matrix estimation.Broadly, two classes of such techniques exist. The firstclass of matrix estimation techniques can be catego-rized as approaches based on convex relaxation (Chenet al., 2015; Cai & Zhang, 2015; Kueng et al., 2017;Candes et al., 2013). For instance, the authors in Chenet al. (2015); Cai & Zhang (2015) demonstrate that theobservation operator A satisfies a specialized mixed-norm isometry condition called the RIP-`

2

/`1

. Fur-ther, they show that the sample complexity of matrixestimation using rank-one projections matches the op-timal rate O(pr). However, these methods advocateusing either semidefinite programming (SDP) or prox-imal sub-gradient algorithms (Boyd & Vandenberghe,2004; Goldstein et al., 2014, 2015), both of which aretoo slow for very high-dimensional problems.

The second class of techniques can be categorizedas non-convex approaches, which are all based ona factorization-based approach initially advocated byBurer & Monteiro (2003). Here, the underlying low-rank matrix variable is factorized as L⇤ = UV T , whereU, V 2 Rp⇥r (Zheng & La↵erty, 2015; Tu et al., 2016).In the Altmin-LRROM method proposed by Zhonget al. (2015), U and V are updated in alternative fash-ion. However, the setup in Zhong et al. (2015) is dif-

Page 8: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Towards Provable Learning of Polynomial Neural Networks

ferent from this paper, as it uses an asymmetric ob-servation model, in which observation yi is given byyi = xT

i L⇤zi with xi and zi being independent ran-dom vectors. Our goal is to analyze the more challeng-ing case where the observation operator A is symmet-ric and defined according (3). In a subsequent work(called the generalized factorization machine) by Linet al. (2017), U and V are updated based on the con-struction of certain sequences of moment estimators.

Both the approaches of (Zhong et al., 2015) and (Linet al., 2017) require a spectral initialization which in-volves running a rank-r SVD on a given p ⇥ p ma-trix, and therefore the running time heavily dependson the condition number (i.e., the ratio of the maxi-mum and the minimum nonzero singular values) of L⇤.To our knowledge, only three works in the matrix es-timation literature require no full SVDs (Bhojanapalliet al., 2016; Hegde et al., 2016; Ge et al., 2016). How-ever, both Bhojanapalli et al. (2016) and Hegde et al.(2016) assume that the restricted isometry property issatisfied, which is not applicable in our setting. More-over, Ge et al. (2016) makes stringent assumptions onthe condition number, as well as the coherence, of theunknown matrix.

Finally, we mention that a matrix estimation schemeusing approximate SVDs (based on Frank-Wolfe typegreedy approximation) has been proposed for learn-ing polynomial neural networks (Shalev-Shwartz et al.,2011; Livni et al., 2014). Moreover, this approach hasbeen shown to compare favorably to typical neuralnetwork learning methods (such as stochastic gradi-ent descent). However, the rate of convergence is sub-linear, and they provide no sample-complexity guaran-tees. Indeed, the main motivating factor of our paperwas to accelerate the running time of such greedy ap-proximation techniques. We complete this line of workby providing (a) rigorous statistical analysis that pre-cisely establishes upper bounds on the number of sam-ples required for learning such networks, and (b) analgorithm that provably exhibits linear convergence,as well as nearly-linear per iteration running time.

4 Experimental results and discussion

We illustrate some experiments to support our pro-posed algorithms. We compare EP-ROM and AP-ROM with convex (nuclear norm) minimization as wellas the gFM algorithm of Lin & Ye (2016). To solvethe nuclear norm minimization, we use FASTA (Gold-stein et al., 2014, 2015) which e�ciently implements anaccelerated proximal sub-gradient method. For AP-ROM, we consider our proposed modified BK-SVDmethod (MBK-SVD). In addition, SVD and SVDS de-note the projection step being used in EP-ROM. In

all the experiments, we generate a low-rank matrix,L⇤ = UUT , such that U 2 Rp⇥r with r = 5 wherethe entries of U is randomly chosen according to thestandard normal distribution.

Figures 2(a) and 2(b) show the phase transition of suc-cessful estimation as well as the evolution of the ob-jective function, 1

2

ky � A(Lt)k22

versus the iterationcount t for five algorithms. In figure 2(a), we haveused 50 Monte Carlo trials and the phase transitionplot is generated based on the empirical probabilityof success; here, success is when the relative error be-tween L (the estimate of L⇤) and the ground truthL⇤ (measured in terms of spectral norm) is less than0.05. For solving convex nuclear norm minimizationusing FASTA, we set the Lagrangian parameter, µ i.e.,µkLk⇤+ 1

2

ky�ALk2F via a grid search. In Figure 2(a),there is no additive noise. As we can see in this Figure,the phase transition for the convex method is slightlybetter than those for non-convex algorithms, which isconsistent with known theoretical results. However,the convex method is improper, i.e., the rank of L ismuch higher than the target rank. In Figure 2(b) weconsider an additive standard normal noise with stan-dard deviation equal to 0.1, and average over 10 MonteCarlo trials. As illustrated in this plot, all non-convexalgorithm have much better performance in decreasingthe objective function compared to convex method.

Finally, in Figure 2(c), we compare the algorithms inthe high-dimensional regime where p = 1000, m =75000, and r = 5 in terms of running time. We letall the algorithms run 15 iterations, and then computethe CPU time in seconds for each of them. The y-axisdenotes the logarithm of relative error in spectral normand we report averages over 10 Monte Carlo trials.As we can see, convex methods are the slowest (asexpected); the non-convex methods are comparable toeach other, while MBK-SVD is the fastest. This plotverifies that our modified head approximate projectionroutine is faster than other non-convex methods, whichmakes it a promising approach for high-dimensionalmatrix estimation applications.

Discussion. It seems plausible that the matrix-basedtechniques of this paper can be extended to learn net-works with similar polynomial-like activation functions(such as the squared ReLU). Moreover, similar algo-rithms can be plausibly used to train multi-layer net-works using a greedy (layer-by-layer) learning strategy.Finally, it will be interesting to integrate our methodswith practical approaches such as stochastic gradientdescent (SGD).

Acknowledgement. This work was supported in partby NSF grant CCF-1566281.

Page 9: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Mohammadreza Soltani, Chinmay Hegde

References

Allen-Zhu, Z. and Li, Y. Lazysvd: Even faster svddecomposition yet without agonizing pain. In Adv.Neural Inf. Proc. Sys. (NIPS), pp. 974–982, 2016.

Bhojanapalli, S., Neyshabur, B., and Srebro, N.Global optimality of local search for low rank matrixrecovery. In Adv. Neural Inf. Proc. Sys. (NIPS), pp.3873–3881, 2016.

Boyd, S. and Vandenberghe, L. Convex optimization.Cambridge university press, 2004.

Burer, S. and Monteiro, R. A nonlinear programmingalgorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.

Cai, J., Candes, E., and Shen, Z. A singularvalue thresholding algorithm for matrix completion.SIAM Journal on Optimization, 20(4):1956–1982,2010.

Cai, T. and Zhang, A. Rop: Matrix recovery via rank-one projections. Ann. Stat., 43(1):102–138, 2015.

Candes, E. and Plan, Y. Tight oracle inequalities forlow-rank matrix recovery from a minimal number ofnoisy random measurements. IEEE Trans. Inform.Theory, 57(4):2342–2359, 2011.

Candes, E. and Recht, B. Exact matrix completionvia convex optimization. Found. Comput. Math., 9(6), 2009.

Candes, E., Strohmer, T., and Voroninski, V.Phaselift: Exact and stable signal recovery frommagnitude measurements via convex programming.Comm. Pure Appl. Math., 66(8):1241–1274, 2013.

Candes, E., Li, X., and Soltanolkotabi, M. Phaseretrieval via wirtinger flow: Theory and algo-rithms. IEEE Trans. Inform. Theory, 61(4):1985–2007, 2015.

Chen, Y., Chi, Y., and Goldsmith, A. Exact and stablecovariance estimation from quadratic sampling viaconvex programming. IEEE Trans. Inform. Theory,61(7):4034–4059, 2015.

Clarkson, K. L and Woodru↵, D. Low rank approx-imation and regression in input sparsity time. InProc. ACM Symp. Theory of Comput., pp. 81–90.ACM, 2013.

Dasarathy, G., Shah, P., Bhaskar, B., and Nowak, R.Sketching sparse matrices, covariances, and graphsvia tensor products. IEEE Trans. Inform. Theory,61(3):1373–1388, 2015.

Davenport, M. and Romberg, J. An overview oflow-rank matrix recovery from incomplete observa-tions. IEEE J. Select. Top. Sig. Proc., 10(4):608–622, 2016.

Fazel, M. Matrix rank minimization with applications.PhD thesis, PhD thesis, Stanford University, 2002.

Ge, R., Lee, J., and Ma, T. Matrix completion has nospurious local minimum. In Adv. Neural Inf. Proc.Sys. (NIPS), pp. 2973–2981, 2016.

Goldstein, T., Studer, C., and Baraniuk, R. A fieldguide to forward-backward splitting with a FASTAimplementation. arXiv eprint, abs/1411.3406, 2014.URL http://arxiv.org/abs/1411.3406.

Goldstein, T., Studer, C., and Baraniuk, R.FASTA: A generalized implementation offorward-backward splitting, January 2015.http://arxiv.org/abs/1501.04979.

Hardt, M. Understanding alternating minimization formatrix completion. In Proc. IEEE Symp. Found.Comp. Science (FOCS), pp. 651–660. IEEE, 2014.

Hegde, C., Indyk, P., and Schmidt, L. Approxima-tion algorithms for model-based compressive sens-ing. IEEE Trans. Inform. Theory, 61(9):5129–5147,2015.

Hegde, C., Indyk, P., and Schmidt, L. Fast recoveryfrom a union of subspaces. In Adv. Neural Inf. Proc.Sys. (NIPS), pp. 4394–4402, 2016.

Jain, P., Meka, R., and Dhillon, I. Guaranteed rankminimization via singular value projection. In Adv.Neural Inf. Proc. Sys. (NIPS), pp. 937–945, 2010.

Jain, P., Tewari, A., and Kar, P. On iterativehard thresholding methods for high-dimensional m-estimation. In Adv. Neural Inf. Proc. Sys. (NIPS),pp. 685–693, 2014.

Janzamin, M., Sedghi, H., and Anandkumar, A. Beat-ing the perils of non-convexity: Guaranteed train-ing of neural networks using tensor methods. arXivpreprint arXiv:1506.08473, 2015.

Kueng, R., Rauhut, H., and Terstiege, U. Low rankmatrix recovery from rank one measurements. Appl.Comput. Harmon. Anal., 42(1):88–116, 2017.

Lanczos, C. An iteration method for the solution ofthe eigenvalue problem of linear di↵erential and inte-gral operators1. Journal of Research of the NationalBureau of Standards, 45(4), 1950.

Li, Y. and Yuan, Y. Convergence analysis of two-layerneural networks with relu activation. In Adv. NeuralInf. Proc. Sys. (NIPS), 2017.

Page 10: Towards Provable Learning of Polynomial Neural Networks ...proceedings.mlr.press/v84/soltani18a/soltani18a.pdfThe re-emergence of neural networks (spurred by the advent of deep learning)

Towards Provable Learning of Polynomial Neural Networks

Lin, M. and Ye, J. A non-convex one-pass frameworkfor generalized factorization machine and rank-onematrix sensing. In In Adv. Neural Inf. Proc. Sys.(NIPS), pp. 1633–1641, 2016.

Lin, M., Qiu, S., Hong, B., and Ye, J. The second or-der linear model. arXiv preprint arXiv:1703.00598,2017.

Liu, Y. Universal low-rank matrix recovery frompauli measurements. In Adv. Neural Inf. Proc. Sys.(NIPS), pp. 1638–1646, 2011.

Livni, R., Shalev-Shwartz, S., and Shamir, O. On thecomputational e�ciency of training neural networks.In Adv. Neural Inf. Proc. Sys. (NIPS), pp. 855–863,2014.

Mahoney, M. and Drineas, P. Cur matrix decomposi-tions for improved data analysis. Proc. Natl. Acad.Sci., 106(3):697–702, 2009.

Musco, C. and Musco, C. Randomized block krylovmethods for stronger and faster approximate singu-lar value decomposition. In Adv. Neural Inf. Proc.Sys. (NIPS), pp. 1396–1404, 2015.

Netrapalli, P., Jain, P., and Sanghavi, S. Phase re-trieval using alternating minimization. In Adv. Neu-ral Inf. Proc. Sys. (NIPS), pp. 2796–2804, 2013.

Recht, B., Fazel, M., and Parrilo, P. Guaranteedminimum-rank solutions of linear matrix equationsvia nuclear norm minimization. SIAM review, 52(3):471–501, 2010.

Rokhlin, V., Szlam, A., and Tygert, M. A randomizedalgorithm for principal component analysis. SIAMJ. Matrix Anal. Appl., 31(3):1100–1124, 2009.

Shalev-Shwartz, S., Gonen, A., and Shamir, O. Large-scale convex minimization with a low-rank con-straint. In Proc. Int. Conf. Machine Learning, pp.329–336, 2011.

Soltanolkotabi, M., J., Adel, and Lee, J. Theo-retical insights into the optimization landscape ofover-parameterized shallow neural networks. arXivpreprint arXiv:1707.04926, 2017.

Tian, Y. Symmetry-breaking convergence analysis ofcertain two-layered neural networks with relu non-linearity. In Submitted to ICLR 2017, 2016.

Tropp, J. An introduction to matrix concentrationinequalities. Foundations and Trends R� in MachineLearning, 8(1-2):1–230, 2015.

Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi,M., and Recht, B. Low-rank solutions of linear ma-trix equations via procrustes flow. In Proc. Int.Conf. Machine Learning, pp. 964–973, 2016.

Vershynin, R. Introduction to the non-asymptoticanalysis of random matrices. arXiv preprintarXiv:1011.3027, 2010.

Zheng, Q. and La↵erty, J. A convergent gradient de-scent algorithm for rank minimization and semidef-inite programming from random linear measure-ments. In Adv. Neural Inf. Proc. Sys. (NIPS), pp.109–117, 2015.

Zhong, K., Jain, P., and Dhillon, I. E�cient ma-trix sensing using rank-1 gaussian measurements. InInt. Conf on Algorithmic Learning Theory, pp. 3–18.Springer, 2015.

Zhong, K., Song, Z., Jain, P., Bartlett, P. L, andDhillon, I. Recovery guarantees for one-hidden-layerneural networks. Proc. Int. Conf. Machine Learn-ing, 2016.