convexiﬁed convolutional neural networks · 2018-04-15 · than cnns trained by backpropagation,...

Convexified Convolutional Neural Networks

Yuchen Zhang 1 Percy Liang 1 Martin J. Wainwright 2

AbstractWe describe the class of convexified convolu-tional neural networks (CCNNs), which capturethe parameter sharing of convolutional neuralnetworks in a convex manner. By representingthe nonlinear convolutional filters as vectors in areproducing kernel Hilbert space, the CNN pa-rameters can be represented in terms of a low-rank matrix, and the rank constraint can be re-laxed so as to obtain a convex optimization prob-lem. For learning two-layer convolutional neu-ral networks, we prove that the generalizationerror obtained by a convexified CNN convergesto that of the best possible CNN. For learningdeeper networks, we train CCNNs in a layer-wise manner. Empirically, we find that CC-NNs achieve competitive or better performancethan CNNs trained by backpropagation, SVMs,fully-connected neural networks, stacked denois-ing auto-encoders, and other baseline methods.

1. IntroductionConvolutional neural networks (CNNs) (LeCun et al.,1998) have proven successful across many tasks includingimage classification (LeCun et al., 1998; Krizhevsky et al.,2012), face recognition (Lawrence et al., 1997), speechrecognition (Hinton et al., 2012), text classification (Wanget al., 2012), and game playing (Mnih et al., 2015; Silveret al., 2016). There are two principal advantages of a CNNover a fully-connected neural network: (i) sparsity—eachnonlinear convolutional filter acts only on a local patch ofthe input, and (ii) parameter sharing—the same filter is ap-plied to each patch.

However, as with most neural networks, the standard ap-proach to training CNNs is based on solving a nonconvexoptimization problem that is known to be NP-hard (Blum

1Stanford University, CA, USA 2University of Califor-nia, Berkeley, CA, USA. Correspondence to: Yuchen Zhang<[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

& Rivest, 1992). In practice, researchers use some flavorof stochastic gradient method, in which gradients are com-puted via backpropagation (Bottou, 1998). This approachhas two drawbacks: (i) the rate of convergence, which is atbest only to a local optimum, can be slow due to noncon-vexity (for instance, see the paper (Fahlman, 1988)), and(ii) its statistical properties are very difficult to understand,as the actual performance is determined by some combina-tion of the CNN architecture along with the optimizationalgorithm.

In this paper, with the goal of addressing these two chal-lenges, we propose a new model class known as convexi-fied convolutional neural networks (CCNNs). These mod-els have two desirable features. First, training a CCNNcorresponds to a convex optimization problem, which canbe solved efficiently and optimally via a projected gradi-ent algorithm. Second, the statistical properties of CCNNmodels can be studied in a precise and rigorous manner. Weobtain CCNNs by convexifying two-layer CNNs; doing sorequires overcoming two challenges. First, the activationfunction of a CNN is nonlinear. In order to address thisissue, we relax the class of CNN filters to a reproducingkernel Hilbert space (RKHS). This approach is inspired bythe paper (Zhang et al., 2016a), which put forth a relaxationfor fully-connected neural networks. Second, the parame-ter sharing induced by CNNs is crucial to its effectivenessand must be preserved. We show that CNNs with RKHSfilters can be parametrized by a low-rank matrix. Relaxingthis low-rank constraint to a nuclear norm constraint leadsto our final formulation of CCNNs.

On the theoretical front, we prove an oracle inequality onthe generalization error achieved by our class of CCNNs,showing that it is upper bounded by the best possible per-formance achievable by a two-layer CNN given infinitedata—a quantity to which we refer as the oracle risk—plusa model complexity term that decays to zero polynomiallyin the sample size. Our results suggest that the sample com-plexity for CCNNs is significantly lower than that of theconvexified fully-connected neural network (Zhang et al.,2016a), highlighting the importance of parameter sharing.For models with more than one hidden layer, our theorydoes not apply, but we provide encouraging empirical re-sults using a greedy layer-wise training heuristic. Finally,we apply CCNNs to the MNIST handwritten digit dataset


as well as four variation datasets (VariationsMNIST), andfind that it achieves state-of-the-art accuracy.

Related work. With the empirical success of deep neu-ral networks, there has been an increasing interest in un-derstanding its connection to convex optimization. Ben-gio et al. (2005) showed how to formulate neural networktraining as a convex optimization problem involving an in-finite number of parameters. Aslan et al. (2013; 2014) pro-pose a method for learning multi-layer latent-variable mod-els. They showed that for certain activation functions, theproposed method is a convex relaxation for learning fully-connected neural networks.

Past work has studied learning translation-invariant fea-tures without backpropagation. Mairal et al. (2014) presentconvolutional kernel networks. They propose a translation-invariant kernel whose feature mapping can be approxi-mated by a composition of the convolution, non-linearityand pooling operators, obtained through unsupervisedlearning. However, this method is not equipped with theoptimality guarantees that we provide for CCNNs in thispaper, even for learning one convolutional layer. The Scat-Net method (Bruna & Mallat, 2013) uses translation anddeformation-invariant filters constructed by wavelet analy-sis; however, these filters are independent of the data, incontrast to CCNNs. Daniely et al. (2016) show that a ran-domly initialized CNN can extract features as powerful askernel methods, but it is not clear how to provably improvethe model from random initialization.

Notation. For any positive integer n, we use [n] as ashorthand for the discrete set {1, 2, . . . , n}. For a rect-angular matrix A, let ‖A‖∗ be its nuclear norm, ‖A‖2 beits spectral norm (i.e., maximal singular value), and ‖A‖Fbe its Frobenius norm. We use `2(N) to denote the setof countable dimensional vectors v = (v1, v2, . . . ) suchthat

∑∞`=1 v

2` < ∞. For any vectors u, v ∈ `2(N),

the inner product 〈u, v〉 :=∑∞`=1 uivi and the `2-norm

‖u‖2 :=√〈u, u〉 are well defined.

2. Background and problem setupIn this section, we formalize the class of convolutional neu-ral networks to be learned and describe the associated non-convex optimization problem.

2.1. Convolutional neural networks

At a high level, a two-layer CNN1 is a function that mapsan input vector x ∈ Rd0 (e.g., an image) to an output vectorin y ∈ Rd2 (e.g., classification scores for d2 classes). Thismapping is formed in the following manner:

1Average pooling and multiple channels are also an integralpart of CNNs, but these do not present any new technical chal-lenges, so that we defer these extensions to Section 4.

• First, we extract a collection of P vectors {zp(x)}Pp=1

of the full input vector x. Each vector zp(x) ∈ Rd1 isreferred to as a patch, and these patches may depend onoverlapping components of x.

• Second, given some choice of activation functionσ : R→ R and a collection of weight vectors {wj}rj=1

in Rd1 , we define the functions

hj(z) := σ(w>j z) for each patch z ∈ Rd1 . (1)

Each function hj (for j ∈ [r]) is known as a filter, andnote that the same filters are applied to each patch—thiscorresponds to the parameter sharing of a CNN.

• Third, for each patch index p ∈ [P ], filter index j ∈ [r],and output coordinate k ∈ [d2], we introduce a coef-ficient αk,j,p ∈ R that governs the contribution of thefilter hj on patch zp(x) to output fk(x). The final formof the CNN is given by f(x) : = (f1(x), . . . , fd2(x)),where the kth component is given by

fk(x) :=

r∑j=1

P∑p=1

αk,j,phj(zp(x)). (2)

Taking the patch functions {zp}Pp=1 and activation functionσ as fixed, the parameters of the CNN are the filter vectorsw := {wj ∈ Rd1 : j ∈ [r]} along with the collection ofcoefficient vectors α := {αk,j ∈ RP : k ∈ [d2], j ∈ [r]}.We assume that all patch vectors zp(x) ∈ Rd1 are con-tained in the unit `2-ball. This assumption can be satisfiedwithout loss of generality by normalization: By multiply-ing a constant γ > 0 to every patch zp(x) and multiplying1/γ to the filter vectors w, this assumption holds withoutchanging the the output of the network.

Given some positive radii B1 and B2, we consider themodel class

Fcnn(B1, B2) :={f of the form (2) : max

j∈[r]‖wj‖2 ≤ B1

and maxk∈[d2],j∈[r]

‖αk,j‖2 ≤ B2

}. (3)

When the radii (B1, B2) are clear from context, we adoptFcnn as a convenient shorthand.

2.2. Empirical risk minimization.

Given an input-output pair (x, y) and a CNN f , welet L(f(x); y) denote the loss incurred when the out-put y is predicted via f(x). We assume that the lossfunction L is convex and L-Lipschitz in its first argu-ment given any value of its second argument. As aconcrete example, for multiclass classification with d2classes, the output vector y takes values in the dis-crete set [d2] = {1, 2, . . . , d2}. For example, givena vector f(x) = (f1(x), . . . , fd2(y)) ∈ Rd2 of classifica-tion scores, the associated multiclass logistic loss fora pair (x, y) is given by L(f(x); y) := −fy(x) +


log(∑d2

y′=1 exp(fy′(x))).

Given n training examples {(xi, yi)}ni=1, we would like tocompute an empirical risk minimizer:

fcnn ∈ arg minf∈Fcnn

n∑i=1

L(f(xi); yi). (4)

Recalling that functions f ∈ Fcnn depend on the parame-ters w and α in a highly nonlinear way (2), this optimiza-tion problem is nonconvex. As mentioned earlier, heuris-tics based on stochastic gradient methods are used in prac-tice, which makes it challenging to gain a theoretical un-derstanding of their behavior. Thus, in the next section, wedescribe a relaxation of the class Fcnn for which empiricalrisk minimization is convex.

3. Convexifying CNNsWe now turn to the development of the class of convexifiedCNNs. We begin in Section 3.1 by illustrating the proce-dure for the special case of the linear activation function.Although the linear case is not of practical interest, it pro-vides intuition for our more general convexification proce-dure, described in Section 3.2, which applies to nonlinearactivation functions. In particular, we show how embed-ding the nonlinear problem into an appropriately chosen re-producing kernel Hilbert space (RKHS) allows us to againreduce to the linear setting.

3.1. Linear activation functions: low rank relaxations

In order to develop intuition for our approach, let us beginby considering the simple case of the linear activation func-tion σ(t) = t. In this case, the filter hj when applied to thepatch vector zp(x) outputs a Euclidean inner product of theform hj(zp(x)) = 〈zp(x), wj〉. For each x ∈ Rd0 , we firstdefine the P × d1-dimensional matrix

Z(x) :=

z1(x)>

...zP (x)>

. (5)

We also define the P -dimensional vectorαk,j := (αk,j,1, . . . , αk,j,P )>. With this notation, wecan rewrite equation (2) for the kth output as

fk(x) =

r∑j=1

P∑p=1

αk,j,p〈zp(x), wj〉 =

r∑j=1

α>k,jZ(x)wj

= tr(Z(x)

( r∑j=1

wjα>k,j

))= tr(Z(x)Ak), (6)

where in the final step, we have defined the d1 × P -dimensional matrix Ak :=

∑rj=1 wjα

>k,j . Observe that fk

now depends linearly on the matrix parameter Ak. More-over, the matrixAk has rank at most r, due to the parameter

sharing of CNNs. See Figure 1 for a graphical illustrationof this model structure.

Letting A := (A1, . . . , Ad2) be a concatenation of thesematrices across all d2 output coordinates, we can then de-fine a function fA : Rd1 → Rd2 of the form

fA(x) := (tr(Z(x)A1), . . . , tr(Z(x)Ad2)). (7)

Note that these functions have a linear parameterization interms of the underlying matrix A. Our model class cor-responds to a collection of such functions based on im-posing certain constraints on the underlying matrix A: inparticular, we define Fcnn(B1, B2) to be the set of func-tions fA which satisfies: (C1) maxj∈[r] ‖wj‖2 ≤ B1,maxk∈[d2],j∈[r] ‖αk,j‖2 ≤ B2; and (C2) rank(A) = r.This is simply an alternative formulation of our originalclass of CNNs defined in equation (3). Notice that if thefilter weights wj are not shared across all patches, then theconstraint (C1) still holds, but constraint (C2) no longerholds. Thus, the parameter sharing of CNNs is realized bythe low-rank constraint (C2). The matrix A of rank r canbe decomposed as A = UV >, where both U and V have rcolumns. The column space of matrix A contains the con-volution parameters {wj}, and the row space of A containsto the output parameters {αk,j}.

The matrices satisfying constraints (C1) and (C2) form anonconvex set. A standard convex relaxation is based onthe nuclear norm ‖A‖∗ corresponding to the sum of thesingular values of A. It is straightforward to verify thatany matrix A satisfying the constraints (C1) and (C2) musthave nuclear norm bounded as ‖A‖∗ ≤ B1B2r

√d2. Con-

sequently, if we define the function class

Fccnn :={fA : ‖A‖∗ ≤ B1B2r

√d2

}, (8)

then we are guaranteed that Fccnn ⊇ Fcnn.

We propose to minimize the empirical risk (4) over Fccnninstead of Fcnn; doing so defines a convex optimizationproblem over this richer class of functions

fccnn := arg minfA∈Fccnn

n∑i=1

L(fA(xi); yi). (9)

In Section 3.3, we describe iterative algorithms that can beused to solve this convex problem in the more general set-ting of nonlinear activation functions.

3.2. Nonlinear activations: RKHS filters

For nonlinear activation functions σ, we relax the class ofCNN filters to a reproducing kernel Hilbert space (RKHS).As we will show, this relaxation allows us to reduce theproblem to the linear activation case.

Let K : Rd1 × Rd1 → R be a positive semidefinite kernelfunction. For particular choices of kernels (e.g., the Gaus-


fk(x) = tr(

Z(x)

z1(x)

zP (x)

P

d1

×w1 wrd1

r (filters)

× αk,1

αk,r

r

P (patches) )

Ak

Figure 1. The kth output of a CNN fk(x) ∈ R can be expressed as the product between a matrix Z(x) ∈ RP×d1 whose rows arefeatures of the input patches and a rank-r matrix Ak ∈ Rd1×P , which is made up of the filter weights {wj} and coefficients {ak,j,p},as illustrated. Due to the parameter sharing intrinsic to CNNs, the matrix Ak inherits a low rank structure, which can be encouraged viaconvex relaxation using the nuclear norm.

sian RBF kernel) and a sufficiently smooth activation func-tion σ, we are able to show that the filter h : z 7→ σ(〈w, z〉)is contained in the RKHS induced by the kernel functionK.See Section 3.4 for the choice of the kernel function and theactivation function. Let S := {zp(xi) : p ∈ [P ], i ∈ [n]}be the set of patches in the training dataset. The represen-ter theorem then implies that for any patch zp(xi) ∈ S, thefunction value can be represented by

h(zp(xi)) =∑

(i′,p′)∈[n]×[P ]

ci′,p′K(zp(xi), zp′(xi′)) (10)

for some coefficients {ci′,p′}(i′,p′)∈[n]×[P ]. Filters of theform (10) are members of the RKHS, because they are lin-ear combinations of basis functions z 7→ K(z, zp′(xi′)).Such filters are parametrized by a finite set of coefficients{ci′,p′}(i′,p′)∈[n]×[P ], which can be estimated via empiricalrisk minimization.

Let K ∈ RnP×nP be the symmetric kernel matrix, wherewith rows and columns indexed by the example-patch in-dex pair (i, p) ∈ [n] × [P ]. The entry at row (i, p) andcolumn (i′, p′) of matrix K is equal to K(zp(xi), zp′(xi′)).So as to avoid re-deriving everything in the kernelized set-ting, we perform a reduction to the linear setting of Sec-tion 3.1. Consider a factorization K = QQ> of the kernelmatrix, where Q ∈ RnP×m; one example is the Choleskyfactorization with m = nP . We can interpret each rowQ(i,p) ∈ Rm as a feature vector in place of the originalzp(xi) ∈ Rd1 , and rewrite equation (10) as

h(zp(xi)) = 〈Q(i,p), w〉 where w :=∑(i′,p′)

ci′,p′Q(i′,p′).

In order to learn the filter h, it suffices to learn the m-dimensional vector w. To do this, define patch matricesZ(xi) ∈ RP×m for each i ∈ [n] so that its p-th rowis Q(i,p). Then the problem reduces to learning a linearfilter with coefficient vector w. Carrying out all of Sec-

tion 3.1, solving the ERM gives us a parameter matrixA ∈ Rm×Pd2 . The only difference is that `2-norm con-straint (C1) needs to be adapted to the norm of the RKHS.See Appendix B for details.

At test time, given a new input x ∈ Rd0 , we can compute apatch matrix Z(x) ∈ RP×m as follows:• The p-th row of this matrix is the feature vector for

patch p, which is equal to Q†v(zp(x)) ∈ Rm, wherefor any patch z, the vector v(z) is defined as a nP -dimensional vector whose (i, p)-th coordinate is equalto K(z, zp(xi)). We note that if x is an instance xi inthe training set, then the vector Q†v(zp(x)) is exactlyequal to Q(i,p). Thus the mapping Z(x) applies to bothtraining and testing.

• We can then compute the predictor fk(x) = tr(Z(x)Ak)via equation (6). Note that we do not explicitly needto compute the filter values hj(zp(x)) to compute theoutput under the CCNN.

Retrieving filters. When we learn multi-layer CCNNs(Section 4), we need to compute the filters hj explicitlyin order to form the inputs to the next layer. Recall fromSection 3.1 that the column space of matrix A correspondsto parameters of the convolutional layer, and the row spaceof A corresponds to parameters of the output layer. Thus,once we obtain the parameter matrixA, we compute a rank-r approximation A ≈ U V >. Then set the j-th filter hj tothe mapping

z 7→ 〈Uj , Q†v(z)〉 for any patch z ∈ Rd1 , (11)

where Uj ∈ Rm is the j-th column of matrix U , andQ†v(z) represents the feature vector for patch z.2 Thematrix V > encodes parameters of the output layer, thus

2If z is a patch in the training set, namely z = zp(xi), then wehave equation Q†v(z) = Q(i,p)


Algorithm 1 Learning two-layer CCNNsInput: Data {(xi, yi)}ni=1, kernel function K, regularizationparameter R > 0, number of filters r.1. Construct a kernel matrix K ∈ RnP×nP such that the entry

at column (i, p) and row (i′, p′) is equal toK(zp(xi), zp′(xi′)). Compute a factorization K = QQ> oran approximation K ≈ QQ>, where Q ∈ RnP×m.

2. For each xi, construct patch matrix Z(xi) ∈ RP×m whosep-th row is the (i, p)-th row of Q, where Z(·) is defined inSection 3.2.

3. Solve the following optimization problem to obtain a matrixA = (A1, . . . , Ad2):

A ∈ argmin‖A‖∗≤R

L(A) where (12)

L(A) :=

n∑i=1

L((

tr(Z(xi)A1), . . . , tr(Z(xi)Ad2)); yi).

4. Compute a rank-r approximation U V > ≈ A whereU ∈ Rm×r and V ∈ RPd2×r .

Output: predictor fccnn(x) :=(tr(Z(x)A1), . . . , tr(Z(x)Ad2)

)and the convolutional layer output H(x) := U>Z(x)>.

doesn’t appear in the filter expression (11). It is impor-tant to note that the filter retrieval is not unique, becausethe rank-r approximation of the matrix A is not unique.The heuristic we suggest is to form the singular value de-composition A = UΛV >, then define U to be the first rcolumns of U .

When we apply all of the r filters to all patches of an inputx ∈ Rd0 , the resulting output is H(x) := U>Z(x)> —this is an r×P matrix whose element at row j and columnp is equal to hj(zp(x)).

3.3. Algorithm

The algorithm for learning a two-layer CCNN is summa-rized in Algorithm 1; it is a formalization of the steps de-scribed in Section 3.2. In order to solve the optimizationproblem (12), the simplest approach is via projected gra-dient descent: At iteration t, using a step size ηt > 0, weform the new matrix At+1 based on the previous iterate At

according to:

At+1 = ΠR

(At − ηt ∇AL(At)

). (13)

Here ∇AL denotes the gradient of the objective functiondefined in (12), and ΠR denotes the Euclidean projectiononto the nuclear norm ball {A : ‖A‖∗ ≤ R}. This nuclearnorm projection can be obtained by first computing the sin-gular value decomposition of A, and then projecting thevector of singular values onto the `1-ball. This latter pro-jection step can be carried out efficiently by the algorithmof Duchi et al. (2008). There are other efficient optimiza-

tion algorithms (Duchi et al., 2011; Xiao & Zhang, 2014)for solving the problem (12). All these algorithms can beexecuted in a stochastic fashion, so that each gradient stepprocesses a mini-batch of examples.

The computational complexity of each iteration depends onthe width m of the matrix Q. Setting m = nP allowsus to solve the exact kernelized problem, but to improvethe computation efficiency, we can use Nystrom approxi-mation (Drineas & Mahoney, 2005) or random feature ap-proximation (Rahimi & Recht, 2007); both are randomizedmethods to obtain a tall-and-thin matrix Q ∈ RnP×m suchthatK ≈ QQ>. Typically, the parameterm is chosen to bemuch smaller than nP . In order to compute the matrix Q,the Nystrom approximation method takes O(m2nP ) time.The random feature approximation takesO(mnPd1) time,but can be improved to O(mnP log d1) time using the fastHadamard transform (Le et al., 2013). The time complexityof project gradient descent also scales with m rather thanwith nP .

3.4. Theoretical results

In this section, we upper bound the generalization error ofAlgorithm 1, proving that it converges to the best possiblegeneralization error of CNN. We focus on the binary clas-sification case where the output dimension is d2 = 1.3

The learning of CCNN requires a kernel function K. Weconsider kernel functions whose associated RKHS is largeenough to contain any function of the following form: z 7→q(〈w, z〉), where q is an arbitrary polynomial function andw ∈ Rd1 is an arbitrary vector. As a concrete example, weconsider the inverse polynomial kernel:

K(z, z′) :=1

2− 〈z, z′〉, ‖z‖2 ≤ 1, ‖z′‖2 ≤ 1. (14)

This kernel was studied by Shalev-Shwartz et al. (2011) forlearning halfspaces, and by Zhang et al. (2016a) for learn-ing fully-connected neural networks. We also consider theGaussian RBF kernel:

K(z, z′) := e−γ‖z−z′‖22 , ‖z‖2 = ‖z′‖2 = 1. (15)

As shown by Appendix A, the inverse polynomial kerneland the Gaussian kernel satisfy the above notion of rich-ness. We focus on these two kernels for the theoreticalanalysis.

Let fccnn be the CCNN that minimizes the empiricalrisk (12) using one of the two kernels above. Our maintheoretical result is that for suitably chosen activation func-tions, the generalization error of fccnn is comparable to thatof the best CNN model. In particular, we consider the fol-lowing types of activation functions σ:

3We can treat the multiclass case by performing a standardone-versus-all reduction to the binary case.


(a) arbitrary polynomial functions (e.g., used by Chen &Manning (2014); Livni et al. (2014)).

(b) sinusoid activation function σ(t) := sin(t) (e.g., usedby Sopena et al. (1999); Isa et al. (2010)).

(c) erf function σerf(t) := 2/√π∫ t0e−z

2

dz, which rep-resents a close approximation to the sigmoid func-tion (Zhang et al., 2016a).

(d) a smoothed hinge loss σsh(t) :=∫ t−∞

12 (σerf(z) +

1)dz, which represents a close approximation to theReLU function (Zhang et al., 2016a).

To understand how these activation functions pair with ourchoice of kernels, we consider polynomial expansions ofthe above activation functions: σ(t) =

∑∞j=0 ajt

j , andnote that the smoothness of these functions are character-ized by the rate of their coefficients {aj}∞j=0 converging tozero. If σ is a polynomial in category (a), then the richnessof the RKHS guarantees that it contains the class of filtersactivated by function σ. If σ is a non-polynomial functionin categories (b),(c),(d), then as Appendix A shows, theRKHS contains the filter only if the coefficients {aj}∞j=0

converge quickly enough to zero (the criterion depends onthe concrete choice of the kernel). Concretely, the inversepolynomial kernel is shown to capture all of the four cat-egories of activations: thus, (a), (b), (c), and (d) are allare referred as valid activation functions for the inversepolynomial kernel. The Gaussian kernel induces a smallerRKHS, so only (a) and (b) are valid activation functionsfor the Gaussian kernel. In contrast, the sigmoid functionand the ReLU function are not valid for either kernel, be-cause their polynomial expansions fail to converge quicklyenough, or more intuitively speaking, because they are notsmooth enough to be contained in the RKHS.

We are ready to state the main theoretical result. In thetheorem statement, we use K(X) ∈ RP×P to denote therandom kernel matrix obtained from an input vector X ∈Rd0 drawn randomly from the population. More precisely,the (p, q)-th entry of K(X) is given by K(zp(X), zq(X)).

Theorem 1. Assume that the loss function L(·; y) is L-Lipchitz continuous for every y ∈ [d2] and that K is theinverse polynomial kernel or the Gaussian kernel. For anyvalid activation function σ, there is a constantCσ(B1) suchthat by choosing hyper-parameterR := Cσ(B1)B2r in Al-gorithm 1, the expected generalization error is at most

EX,Y [L(fccnn(X);Y )] ≤ inff∈Fcnn

EX,Y [L(f(X);Y )]

+c LCσ(B1)B2r

√log(nP ) EX [‖K(X)‖2]√

n, (16)

where c > 0 is a universal constant.

Proof sketch The proof of Theorem 1 consists of twoparts: First, we consider a larger function class that con-

tains the class of CNNs. This function class is defined as:

Fccnn :={x 7→

r∗∑j=1

P∑p=1

αj,phj(zp(x)) : r∗ <∞ (17)

andr∗∑j=1

‖αj‖2‖hj‖H ≤ Cσ(B1)B2d2

}. (18)

where ‖·‖H is the norm of the RKHS associated with thekernel. This new function class relaxes the class of CNNsin two ways: 1) the filters are relaxed to belong to theRKHS, and 2) the `2-norm bounds on the weight vectorsare replaced by a single constraint on ‖αj‖2 and ‖hj‖H.We prove the following property for the predictor fccnn: itmust be an empirical risk minimizer ofFccnn. This propertyholds even though equation (18) defines a non-parametricfunction class Fccnn, while Algorithm 1 optimizes fccnn ina parametric function class.

Second, we characterize the Rademacher complexity ofthis new function class Fccnn, proving an upper bound forit based on the matrix concentration theory. Combiningthis bound with the classical Rademacher complexity the-ory (Bartlett & Mendelson, 2003), we conclude that thegeneralization loss of fccnn converges to the least possiblegeneralization error of Fccnn. The latter loss is bounded bythe generalization loss of CNNs (because Fcnn ⊆ Fccnn),which establishes the theorem. See the full version of thispaper (Zhang et al., 2016b) for a rigorous proof of Theo-rem 1. �

Remark on activation functions. It is worth noting thatthe quantity Cσ(B1) depends on the activation function σ,and more precisely, depends on the convergence rate ofthe polynomial expansion of σ. Appendix A shows thatif σ is a polynomial function of degree `, then Cσ(B1) =O(B`1). If σ is the sinusoid function, the erf function orthe smoothed hinge loss, then the quantity Cσ(B1) will beexponential in B1. From an algorithmic perspective, wedon’t need to know the activation function for executingAlgorithm 1. From a theoretical perspective, however, thechoice of σ is relevant from the point of Theorem 1 to com-pare fccnn with the best CNN, whose representation poweris characterized by the choice of σ. Therefore, if a CNNwith a low-degree polynomial σ performs well on a giventask, then CCNN also enjoys correspondingly strong gen-eralization. Empirically, this is actually borne out: in Sec-tion 5, we show that the quadratic activation function per-forms almost as well as the ReLU function for digit classi-fication.

Remark on parameter sharing. In order to demonstratethe importance of parameter sharing, consider a CNN with-out parameter sharing, so that we have filter weights wj,pfor each filter index j and patch index p. With this change,


the new CNN output (2) is

f(x) =

r∑j=1

P∑p=1

αj,pσ(w>j,pzp(x)),

where αj,p ∈ R and wj,p ∈ Rd1 . Note that the hid-den layer of this new network has P times more pa-rameters than that of the convolutional neural networkwith parameter sharing. These networks without pa-rameter sharing can be learned by the recursive kernelmethod proposed by Zhang et al. (2016a). Their papershows that under the norm constraints ‖wj‖2 ≤ B′1 and∑rj=1

∑Pp=1 |αj,p| ≤ B′2, the excess risk of the recur-

sive kernel method is at most O(LCσ(B′1)B′2√Kmax/n),

where Kmax = maxz:‖z‖2≤1K(z, z) is the maximal valueof the kernel function. Plugging in the norm constraintsof the function class Fcnn, we have B′1 = B1 and B′2 =

B2r√P . Thus, the expected risk of the estimated f is

bounded by:

EX,Y [L(f(X);Y )] ≤ inff∈Fcnn

EX,Y [L(f(X);Y )]

+c LCσ(B1)B2r

√PKmax√

n. (19)

Comparing this bound to Theorem 1, we see that (apartfrom the logarithmic terms) they differ in the multiplica-tive factors of

√P Kmax versus

√E[‖K(X)‖2]. Since the

matrix K(X) is P -dimensional, we have

‖K(X)‖2 ≤ maxp∈[P ]

∑q∈[P ]

|K(zp(X), zq(X))| ≤ P Kmax.

This demonstrates that√P Kmax is always greater than√

E[‖K(X)‖2]. In general, the first term can be up tofactor of

√P times greater, which implies that the sample

complexity of the recursive kernel method is up to P timesgreater than that of the CCNN. This difference is intuitivegiven that the recursive kernel method learns a model withP times more parameters. Although comparing the upperbounds doesn’t rigorously show that one method is betterthan the other, it gives intuition for understanding the im-portance of parameter sharing.

4. Learning multi-layer CCNNsIn this section, we describe a heuristic method for learningCNNs with more layers. The idea is to estimate the param-eters of the convolutional layers incrementally from bot-tom to top. Before presenting the multi-layer algorithm, wepresent two extensions, average pooling and multi-channelinputs.

Average pooling. Average pooling is a technique to re-duce the output dimension of the convolutional layer fromdimensions P × r to dimensions P ′ × r with P ′ < P .For the CCNN model, if we apply average pooling after

Algorithm 2 Learning multi-layer CCNNsInput:Data {(xi, yi)}ni=1, kernel function K, number of layersm, regularization parameters R1, . . . , Rm, number of filtersr1, . . . , rm.Define H1(x) = x. For each layer s = 2, . . . ,m:• Train a two-layer network by Algorithm 1, taking{(Hs−1(xi), yi)}ni=1 as training examples and Rs, rs asparameters. Let Hs be the output of the convolutional layerand fs be the predictor.

Output: Predictor fm and the top layer output Hm.

the convolutional layer, then the k-th output of the CCNNmodel becomes tr(GZ(x)Ak) where G ∈ RP ′×P is thepooling matrix. Thus, performing a pooling operation re-quires only replacing every matrix Z(xi) in problem (12)by the pooled matrix GZ(xi). Note that the linearity ofthe CCNN allows us to effectively pool before convolu-tion, even though for the CNN, pooling must be done afterapplying the nonlinear filters. The resulting ERM problemis still convex, and the number of parameters have been re-duced by P/P ′-fold.

Processing multi-channel inputs. If our input has Cchannels (corresponding to RGB colors, for example), thenthe input becomes a matrix x ∈ RC×d0 . The c-th row ofmatrix x, denoted by x[c] ∈ Rd0 , is a vector representingthe c-th channel. We define the multi-channel patch vectoras a concatenation of patch vectors for each channel:

zp(x) := (zp(x[1]), . . . , zp(x[C])) ∈ RCd1 .

Then we construct the feature matrix Z(x) using the con-catenated patch vectors {zp(x)}Pp=1. From here, everythingelse of Algorithm 1 remains the same. We note that thisapproach learns a convex relaxation of filters taking theform σ(

∑Cc=1〈wc, zp(x[c])〉), parametrized by the vectors

{wc}Cc=1.

Multi-layer CCNN. Given these extensions, we areready to present the algorithm for learning multi-layer CC-NNs, summarized in Algorithm 2. For each layer s, we callAlgorithm 1 using the output of previous convolutional lay-ers as input—note that this consists of r channels (one fromeach previous filter); thus we must use the multi-channelextension. Algorithm 2 outputs a new convolutional layeralong with a prediction function, which is kept only at thelast layer. We optionally use averaging pooling after eachsuccessive layer. to reduce the output dimension of the con-volutional layers.

5. ExperimentsIn this section, we compare the CCNN approach withother methods on the MNIST dataset and more challeng-ing variations (VariationsMNIST), including adding whitenoise (rand), random rotation (rot), random image back-


basic rand rot img img+rotSVMrbf (Vincent et al., 2010) 3.03% 14.58% 11.11% 22.61% 55.18%NN-1 (Vincent et al., 2010) 4.69% 20.04% 18.11% 27.41% 62.16%CNN-1 (ReLU) 3.37% 9.83% 18.84% 14.23% 45.96%CCNN-1 2.35% 8.13% 13.45% 10.33% 42.90%TIRBM (Sohn & Lee, 2012) - - 4.20% - 35.50%SDAE-3 (Vincent et al., 2010) 2.84% 10.30% 9.53% 16.68% 43.76%ScatNet-2 (Bruna & Mallat, 2013) 1.27% 12.30% 7.48% 18.40% 50.48%PCANet-2 (Chan et al., 2015) 1.06% 6.19% 7.37% 10.95% 35.48%CNN-2 (ReLU) 2.11% 5.64% 8.27% 10.17% 32.43%CNN-2 (Quad) 1.75% 5.30% 8.83% 11.60% 36.90%CCNN-2 1.39% 4.64% 6.91% 7.44% 30.23%

Table 1. Classification error on the basic MNIST and its four variations. The best performance within each block is bolded. “ReLU” and“Quad” denote using the ReLU and quadratic activation functions, respectively.

ground (img) or combining the last two (img+rot). Forall datasets, we use 10,000 images for training, 2,000 im-ages for validation and 50,000 images for testing. This10k/2k/50k partitioning is standard for MNIST varia-tions (VariationsMNIST).

For the CCNN method and the baseline CNN method, wetrain two-layer and three-layer models respectively. Themodels with k convolutional layers are denoted by CCNN-k and CNN-k. Each convolutional layer is constructed on5 × 5 patches with unit stride, followed by 2 × 2 averagepooling. The first and the second convolutional layers con-tains 16 and 32 filters, respectively. The loss function ischosen as the 10-class logistic loss. We use Gaussian ker-nel for the CCNN. The feature matrix Z(x) is constructedvia random feature approximation (Rahimi & Recht, 2007)with dimension m = 500 for the first convolutional layerand m = 1000 for the second. Before training each CCNNlayer, we preprocess the input vectors zp(xi) using localcontrast normalization and ZCA whitening (Coates et al.,2010). The convex optimization problem is solved by pro-jected SGD with mini-batches of size 50. Code and re-producible experiments are available on the CodaLab plat-form4.

As a baseline approach, the CNN models are activated bythe ReLU function σ(t) = max{0, t} or the quadraticfunction σ(t) = t2. We train them using mini-batchSGD. The input images are preprocessed by global con-trast normalization and ZCA whitening (Srivastava et al.,2014). We compare our method against several alterna-tive baselines. The CCNN-1 model is compared againstan SVM with the Gaussian RBF kernel (SVMrbf ) and afully connected neural network with one hidden layer (NN-1). The CCNN-2 model is compared against methods thatreport the state-of-the-art results on these datasets, includ-ing the translation-invariant RBM model (TIRBM) (Sohn& Lee, 2012), the stacked denoising auto-encoder with

4http://worksheets.codalab.org/worksheets/0x1468d91a878044fba86a5446f52aacde/

three hidden layers (SDAE-3) (Vincent et al., 2010), theScatNet-2 model (Bruna & Mallat, 2013) and the PCANet-2 model (Chan et al., 2015).

Table 1 shows the classification errors on the test set. Themodels are grouped with respect to the number of layersthat they contain. For models with one convolutional layer,the errors of CNN-1 are significantly lower than that ofNN-1, highlighting the benefits of local filters and param-eter sharing. The CCNN-1 model outperforms CNN-1 onall datasets. For models with two or more hidden layers,the CCNN-2 model outperforms CNN-2 on all datasets,and is competitive against the state-of-the-art. In partic-ular, it achieves the best accuracy on the rand, img andimg+rot dataset, and is comparable to the state-of-the-arton the remaining two datasets. Further adding a third con-volutional layer doesn’t notibly improve the performanceon these datasets.

In Section 3.4, we showed that if the activation function σ isa polynomial function, then the CCNN (which does not de-pend on σ) requires lower sample complexity to match theperformance of the best possible CNN using σ. More pre-cisely, if σ is a degree-` polynomial, then Cσ(B) in the up-per bound will be controlled by O(B`). This motivates usto study the performance of low-degree polynomial activa-tions. Table 1 shows that the CNN-2 model with a quadraticactivation function achieves error rates comparable to thatwith a ReLU activation: CNN-2 (Quad) outperforms CNN-2 (ReLU) on the basic and rand datasets, and is onlyslightly worse on the rot and img dataset. Since the per-formance of CCNN matches that of the best possible CNN,the good performance of the quadratic activation in part ex-plains why the CCNN is also good.

Acknowledgements. MJW and YZ were partially sup-ported by the Office of Naval Research Grant DOD ONR-N00014 and the NSF Grant NSF-DMS-1612948. PL andYZ were partially supported by the Microsoft Faculty Fel-lowship.

http://worksheets.codalab.org/worksheets/0x1468d91a878044fba86a5446f52aacde/

http://worksheets.codalab.org/worksheets/0x1468d91a878044fba86a5446f52aacde/


ReferencesAslan, Ozlem, Cheng, Hao, Zhang, Xinhua, and Schuur-

mans, Dale. Convex two-layer modeling. In Advances inNeural Information Processing Systems, pp. 2985–2993,2013.

Aslan, Ozlem, Zhang, Xinhua, and Schuurmans, Dale.Convex deep learning via normalized kernels. In Ad-vances in Neural Information Processing Systems, pp.3275–3283, 2014.

Bartlett, Peter L and Mendelson, Shahar. Rademacher andGaussian complexities: Risk bounds and structural re-sults. The Journal of Machine Learning Research, 3:463–482, 2003.

Bengio, Yoshua, Roux, Nicolas L, Vincent, Pascal, Delal-leau, Olivier, and Marcotte, Patrice. Convex neural net-works. In Advances in Neural Information ProcessingSystems, pp. 123–130, 2005.

Blum, Avrim L and Rivest, Ronald L. Training a 3-nodeneural network is NP-complete. Neural Networks, 5(1):117–127, 1992.

Bottou, Leon. Online learning and stochastic approxima-tions. On-line learning in neural networks, 17(9):142,1998.

Bruna, Joan and Mallat, Stephane. Invariant scattering con-volution networks. Pattern Analysis and Machine Intel-ligence, IEEE Transactions on, 35(8):1872–1886, 2013.

Chan, Tsung-Han, Jia, Kui, Gao, Shenghua, Lu, Jiwen,Zeng, Zinan, and Ma, Yi. Pcanet: A simple deep learn-ing baseline for image classification? IEEE Transactionson Image Processing, 24(12):5017–5032, 2015.

Chen, Danqi and Manning, Christopher D. A fast and ac-curate dependency parser using neural networks. In Pro-ceedings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), volume 1,pp. 740–750, 2014.

Coates, Adam, Lee, Honglak, and Ng, Andrew Y. Ananalysis of single-layer networks in unsupervised featurelearning. Ann Arbor, 1001(48109):2, 2010.

Daniely, Amit, Frostig, Roy, and Singer, Yoram. Towarddeeper understanding of neural networks: The powerof initialization and a dual view on expressivity. arXivpreprint arXiv:1602.05897, 2016.

Drineas, Petros and Mahoney, Michael W. On the Nystrommethod for approximating a Gram matrix for improvedkernel-based learning. The Journal of Machine LearningResearch, 6:2153–2175, 2005.

Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, andChandra, Tushar. Efficient projections onto the `1-ballfor learning in high dimensions. In Proceedings of the25th International Conference on Machine Learning, pp.272–279. ACM, 2008.

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptivesubgradient methods for online learning and stochasticoptimization. The Journal of Machine Learning Re-search, 12:2121–2159, 2011.

Fahlman, Scott E. An empirical study of learning speed inback-propagation networks. Journal of Heuristics, 1988.

Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E,Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, An-drew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath,Tara N, et al. Deep neural networks for acoustic mod-eling in speech recognition: The shared views of fourresearch groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012.

Isa, IS, Saad, Z, Omar, S, Osman, MK, Ahmad, KA, andSakim, HA Mat. Suitable mlp network activation func-tions for breast cancer and thyroid disease detection.In 2010 Second International Conference on Computa-tional Intelligence, Modelling and Simulation, pp. 39–44, 2010.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.Imagenet classification with deep convolutional neuralnetworks. In Advances in Neural Information ProcessingSystems, pp. 1097–1105, 2012.

Lawrence, Steve, Giles, C Lee, Tsoi, Ah Chung, and Back,Andrew D. Face recognition: A convolutional neural-network approach. Neural Networks, IEEE Transactionson, 8(1):98–113, 1997.

Le, Quoc, Sarlos, Tamas, and Smola, Alex. Fastfood-approximating kernel expansions in loglinear time. InProceedings of the International Conference on MachineLearning, 2013.

LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner,Patrick. Gradient-based learning applied to documentrecognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Livni, Roi, Shalev-Shwartz, Shai, and Shamir, Ohad. Onthe computational efficiency of training neural networks.In Advances in Neural Information Processing Systems,pp. 855–863, 2014.

Mairal, Julien, Koniusz, Piotr, Harchaoui, Zaid, andSchmid, Cordelia. Convolutional kernel networks. InAdvances in Neural Information Processing Systems, pp.2627–2635, 2014.


Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,Rusu, Andrei A, Veness, Joel, Bellemare, Marc G,Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K,Ostrovski, Georg, et al. Human-level control throughdeep reinforcement learning. Nature, 518(7540):529–533, 2015.

Rahimi, Ali and Recht, Benjamin. Random features forlarge-scale kernel machines. In Advances in Neural In-formation Processing Systems, pp. 1177–1184, 2007.

Shalev-Shwartz, Shai, Shamir, Ohad, and Sridharan,Karthik. Learning kernel-based halfspaces with the 0-1 loss. SIAM Journal on Computing, 40(6):1623–1646,2011.

Silver, David, Huang, Aja, Maddison, Chris J, Guez,Arthur, Sifre, Laurent, Van Den Driessche, George,Schrittwieser, Julian, Antonoglou, Ioannis, Panneershel-vam, Veda, Lanctot, Marc, et al. Mastering the game ofGo with deep neural networks and tree search. Nature,529(7587):484–489, 2016.

Sohn, Kihyuk and Lee, Honglak. Learning invariant repre-sentations with local transformations. In Proceedings ofthe 29th International Conference on Machine Learning(ICML-12), pp. 1311–1318, 2012.

Sopena, Josep M, Romero, Enrique, and Alquezar, Rene.Neural networks with periodic and monotonic activationfunctions: a comparative study in classification prob-lems. In ICANN 99, pp. 323–328, 1999.

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:A simple way to prevent neural networks from overfit-ting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

VariationsMNIST. Variations on the MNIST digits. http://www.iro.umontreal.ca/˜lisa/twiki/bin/view.cgi/Public/MnistVariations,2007.

Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Ben-gio, Yoshua, and Manzagol, Pierre-Antoine. Stackeddenoising autoencoders: Learning useful representationsin a deep network with a local denoising criterion. TheJournal of Machine Learning Research, 11:3371–3408,2010.

Wang, Tao, Wu, David J, Coates, Andrew, and Ng, An-drew Y. End-to-end text recognition with convolutionalneural networks. In Pattern Recognition (ICPR), 201221st International Conference on, pp. 3304–3308. IEEE,2012.

Xiao, Lin and Zhang, Tong. A proximal stochastic gradi-ent method with progressive variance reduction. SIAMJournal on Optimization, 24(4):2057–2075, 2014.

Zhang, Yuchen, Lee, Jason D, and Jordan, Michael I. `1-regularized neural networks are improperly learnable inpolynomial time. In Proceedings on the 33rd Interna-tional Conference on Machine Learning, 2016a.

Zhang, Yuchen, Liang, Percy, and Wainwright, Martin J.Convexified convolutional neural networks. CoRR,abs/1609.01000, 2016b. URL http://arxiv.org/abs/1609.01000.

http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations



http://arxiv.org/abs/1609.01000

http://arxiv.org/abs/1609.01000

Supplementary Materials

A Inverse polynomial kernel and Gaussian kernel

In this appendix, we describe the properties of the two types of kernels — the inverse polynomialkernel (14) and the Gaussian RBF kernel (15). We prove that the associated reproducing kernelHilbert Spaces (RKHS) of these kernels contain filters taking the form h : z 7→ σ(〈w, z〉) forparticular activation functions σ.

A.1 Inverse polynomial kernel

We first verify that the function (14) is a kernel function. This holds since that we can find amapping ϕ : Rd1 → `2(N) such that K(z, z′) = 〈ϕ(z), ϕ(z′)〉. We use zi to represent the i-thcoordinate of an infinite-dimensional vector z. The (k1, . . . , kj)-th coordinate of ϕ(z), where j ∈ Nand k1, . . . , kj ∈ [d1], is defined as 2−

j+12 xk1 . . . xkj . By this definition, we have

〈ϕ(x), ϕ(y)〉 =∞∑j=0

2−(j+1)∑

(k1,...,kj)∈[d1]jzk1 . . . zkjz

′k1 . . . z

′kj. (21)

Since ‖z‖2 ≤ 1 and ‖z′‖2 ≤ 1, the series on the right-hand side is absolutely convergent. The innerterm on the right-hand side of equation (21) can be simplified to∑

(k1,...,kj)∈[d1]jzk1 . . . zkjz

′k1 . . . z

′kj

= (〈z, z′〉)j . (22)

Combining equations (21) and (22) and using the fact that |〈z, z′〉| ≤ 1, we have

〈ϕ(z), ϕ(z′)〉 =

∞∑j=0

2−(j+1)(〈z, z′〉)j (i)=

1

2− 〈z, z′〉= K(z, z′),

which verifies that K is a kernel function and ϕ is the associated feature map. Next, we prove thatthe associated RKHS contains the class of nonlinear filters. The lemma was proved by Zhang et al.[2]. We include the proof to make the paper self-contained.

Lemma 1. Assume that the function σ(x) has a polynomial expansion σ(t) =∑∞

j=0 ajtj. Let

Cσ(λ) :=√∑∞

j=0 2j+1a2jλ2j. If Cσ(‖w‖2) <∞, then the RKHS induced by the inverse polynomial

kernel contains function h : z 7→ σ(〈w, z〉) with Hilbert norm ‖h‖H = Cσ(‖w‖2).

1

Proof. Let ϕ be the feature map that we have defined for the polynomial inverse kernel. We definevector w ∈ `2(N) as follow: the (k1, . . . , kj)-th coordinate of w, where j ∈ N and k1, . . . , kj ∈ [d1],

is equal to 2j+12 ajwk1 . . . wkj . By this definition, we have

σ(〈w, z〉) =

∞∑t=0

aj(〈w, z〉)j =

∞∑j=0

aj∑

(k1,...,kj)∈[d1]jwk1 . . . wkjzk1 . . . zkj = 〈w,ϕ(z)〉, (23)

where the first equation holds since σ(x) has a polynomial expansion σ(x) =∑∞

j=0 ajxj , the second

by expanding the inner product, and the third by definition of w and ϕ(z). The `2-norm of w isequal to:

‖w‖22 =∞∑j=0

2j+1a2j∑

(k1,...,kj)∈[d1]jw2k1w

2k2 · · ·w

2kj

=∞∑j=0

2j+1a2j‖w‖2j2 = C2

σ(‖w‖2) <∞. (24)

By the basic property of the RKHS, the Hilbert norm of h is equal to the `2-norm of w. Combiningequations (23) and (24), we conclude that h ∈ H and ‖h‖H = ‖w‖2 = Cσ(‖w‖2).

According to Lemma 1, it suffices to upper bound Cσ(λ) for a particular activation functionσ. To make Cσ(λ) < ∞, the coefficients {aj}∞j=0 must quickly converge to zero, meaning that theactivation function must be sufficiently smooth. For polynomial functions of degree `, the definitionof Cσ implies that Cσ(λ) = O(λ`). For the sinusoid activation σ(t) := sin(t), we have

Cσ(λ) =

√√√√ ∞∑j=0

22j+2

((2j + 1)!)2· (λ2)2j+1 ≤ 2eλ

2.

For the erf function and the smoothed hinge loss function defined in Section 3.4, Zhang et al. [2,Proposition 1] proved that Cσ(λ) = O(ecλ

2) for universal numerical constant c > 0.

A.2 Gaussian kernel

The Gaussian kernel also induces an RKHS that contains a particular class of nonlinear filters. Theproof is similar to that of Lemma 1.

Lemma 2. Assume that the function σ(x) has a polynomial expansion σ(t) =∑∞

j=0 ajtj. Let

Cσ(λ) :=√∑∞

j=0j!e2γ

(2γ)ja2jλ

2j. If Cσ(‖w‖2) < ∞, then the RKHS induced by the Gaussian kernel

contains the function h : z 7→ σ(〈w, z〉) with Hilbert norm ‖h‖H = Cσ(‖w‖2).

Proof. When ‖z‖2 = ‖z′‖2 = 1, It is well-known [see, e.g. 1] the following mapping ϕ : Rd1 → `2(N)is a feature map for the Gaussian RBF kernel: the (k1, . . . , kj)-th coordinate of ϕ(z), where j ∈ Nand k1, . . . , kj ∈ [d1], is defined as e−γ((2γ)j/j!)1/2xk1 . . . xkj . Similar to equation (23), we define avector w ∈ `2(N) as follow: the (k1, . . . , kj)-th coordinate of w, where j ∈ N and k1, . . . , kj ∈ [d1],is equal to eγ((2γ)j/j!)−1/2ajwk1 . . . wkj . By this definition, we have

σ(〈w, z〉) =∞∑t=0

aj(〈w, z〉)j =∞∑j=0

aj∑

(k1,...,kj)∈[d1]jwk1 . . . wkjzk1 . . . zkj = 〈w,ϕ(z)〉. (25)

2

The `2-norm of w is equal to:

‖w‖22 =∞∑j=0

j!e2γ

(2γ)ja2j

∑(k1,...,kj)∈[d1]j

w2k1w

2k2 · · ·w

2kj

=∞∑j=0

j!e2γ

(2γ)ja2j‖w‖

2j2 = C2

σ(‖w‖2) <∞. (26)

Combining equations (23) and (24), we conclude that h ∈ H and ‖h‖H = ‖w‖2 = Cσ(‖w‖2).

Comparing Lemma 1 and Lemma 2, we find that the Gaussian kernel imposes a stronger con-dition on the smoothness of the activation function. For polynomial functions of degree `, we stillhave Cσ(λ) = O(λ`). For the sinusoid activation σ(t) := sin(t), it can be verified that

Cσ(λ) =

√√√√e2γ∞∑j=0

1

(2j + 1)!·(λ2

2γ

)2j+1≤ eλ2/(4γ)+γ .

However, the value of Cσ(λ) is infinite when σ is the erf function or the smoothed hinge loss,meaning that the Gaussian kernel’s RKHS doesn’t contain filters activated by these two functions.

B Convex relaxation for nonlinear activation

In this appendix, we provide a detailed derivation of the relaxation for nonlinear activation functionsthat we previously sketched in Section 3.2. Recall that the filter output is σ(〈wj , z〉). Appendix Ashows that given a sufficiently smooth activation function σ, we can find some kernel functionK : Rd1×Rd1 → R and a feature map ϕ : Rd1 → `2(N) satisfying K(z, z′) ≡ 〈ϕ(z), ϕ(z′)〉, such that

σ(〈wj , z〉) ≡ 〈wj , ϕ(z)〉. (27)

Here wj ∈ `2(N) is a countable-dimensional vector and ϕ := (ϕ1, ϕ2, . . . ) is a countable sequenceof functions. Moreover, the `2-norm of wj is bounded as ‖wj‖2 ≤ Cσ(‖wj‖2) for a monotonicallyincreasing function Cσ that depends on the kernel (see Lemma 1 and Lemma 2). As a consequence,we may use ϕ(z) as the vectorized representation of the patch z, and use wj as the linear trans-formation weights, then the problem is reduced to training a CNN with the identity activationfunction.

The filter is parametrized by an infinite-dimensional vector wj . Our next step is to reduce theoriginal ERM problem to a finite-dimensional one. In order to minimize the empirical risk, oneonly needs to concern the output on the training data, that is, the output of 〈wj , ϕ(zp(xi))〉 forall (i, p) ∈ [n] × [P ]. Let T be the orthogonal projector onto the linear subspace spanned by thevectors {ϕ(zp(xi)) : (i, p) ∈ [n]× [P ]}. Then we have

∀ (i, p) ∈ [n]× [P ] : 〈wj , ϕ(zp(xi))〉 = 〈wj , Tϕ(zp(xi))〉 = 〈Twj , ϕ(zp(xi))〉.

The last equation follows since the orthogonal projector T is self-adjoint. Thus, for empirical riskminimization, we can without loss of generality assume that wj belongs to the linear subspacespanned by {ϕ(zp(xi)) : (i, p) ∈ [n]× [P ]} and reparametrize it by:

wj =∑

(i,p)∈[n]×[P ]

βj,(i,p)ϕ(zp(xi)). (28)

3

Let βj ∈ RnP be a vector whose whose (i, p)-th coordinate is βj,(i,p). In order to estimate wj , it

suffices to estimate the vector βj . By definition, the vector satisfies the relation β>j Kβj = ‖wj‖22,where K is the nP × nP kernel matrix defined in Section 3.2. As a consequence, if we can find amatrix Q such that QQ> = K, then we have the norm constraint

‖Q>βj‖2 =√β>j Kβj = ‖wj‖2 ≤ Cσ(‖wj‖2) ≤ Cσ(B). (29)

Let v(z) ∈ RnP be a vector whose (i, p)-th coordinate is equal to K(z, zp(xi)). Then by equa-tions (27) and (28), the filter output can be written as

σ(〈wj , z〉

)≡ 〈wj , ϕ(z)〉 ≡ 〈βj , v(z)〉. (30)

For any patch zp(xi) in the training data, the vector v(zp(xi)) belongs to the column space of thekernel matrix K. Therefore, letting Q† represent the pseudo-inverse of matrix Q, we have

∀ (i, p) ∈ [n]× [P ] : 〈βj , v(zp(xi))〉 = β>j QQ†v(zp(xi)) = 〈(Q>)†Q>βj , v(zp(xi))〉.

It means that if we replace the vector βj on the right-hand side of equation (30) by the vector(Q>)†Q>βj , then it won’t change the empirical risk. Thus, for ERM we can parametrize the filtersby

hj(z) := 〈(Q>)†Q>βj , v(z)〉 = 〈Q†v(z), Q>βj〉. (31)

Let Z(x) be an P × nP matrix whose p-th row is equal to Q†v(zp(x)). Similar to the steps inequation (6), we have

fk(x) =r∑j=1

α>k,jZ(x)Q>βj = tr(Z(x)

( r∑j=1

Q>βjα>k,j

))= tr(Z(x)Ak),

where Ak :=∑r

j=1Q>βjα

>k,j . If we let A := (A1, . . . , Ad2) denote the concatenation of these

matrices, then this larger matrix satisfies the constraints:

Constraint (C1): maxj∈[r]‖Q>βj‖2 ≤ Cσ(B1) and max

(k,j)∈[d2]×[r]‖αk,j‖2 ≤ B2.

Constraint (C2): The matrix A has rank at most r.

We relax these two constraints to the nuclear norm constraint:

‖A‖∗ ≤ Cσ(B1)B2r√d2. (32)

By comparing constraints (8) and (32), we see that the only difference is that the term B1 in thenorm bound has been replaced by Cσ(B1). This change is needed because we have used the kerneltrick to handle nonlinear activation functions.

References

[1] I. Steinwart and A. Christmann. Support vector machines. Springer, New York, 2008.

[2] Y. Zhang, J. D. Lee, and M. I. Jordan. `1-regularized neural networks are improperly learnablein polynomial time. In Proceedings on the 33rd International Conference on Machine Learning,2016.

4

convexiﬁed convolutional neural networks · 2018-04-15 · than cnns trained by backpropagation,...

Documents