ivanov-regularised least-squares estimators over …...ivanov-regularised estimators over rkhss...

Ivanov-Regularised Estimators over RKHSs

Ivanov-Regularised Least-Squares Estimators over LargeRKHSs and Their Interpolation Spaces

Stephen Page [email protected] UniversityLancaster, LA1 4YF, United Kingdom

Steffen Grunewalder [email protected]

Department of Mathematics and Statistics

Lancaster University

Lancaster, LA1 4YF, United Kingdom

Editor:

Abstract

We study kernel least-squares estimation under a norm constraint. This form of regulari-sation is known as Ivanov regularisation and it provides far better control of the regressionfunction than the well-established Tikhonov regularisation. In particular, the smoothnessand the scale of the regression function are directly controlled by the constraint. Thischoice of estimator also allows us to dispose of the standard assumption that the reproduc-ing kernel Hilbert space (RKHS) has a Mercer kernel. The Mercer assumption is restrictiveas it usually requires compactness of the covariate set. Instead, we make only the mini-mal assumption that the RKHS is separable with a bounded and measurable kernel. Weprovide rates of convergence for the expected squared L2 error of our estimator under theweak assumption that the variance of the response variables is bounded and the unknownregression function lies in an interpolation space between L2 and the RKHS. We comple-ment this result with a high probability bound under the stronger assumption that theresponse variables have subgaussian errors and that the regression function is bounded.Finally, we derive an adaptive version of the high probability bound under the assumptionthat the response variables are bounded. The rates we achieve are close to the optimalrates attained under the stronger Mercer kernel assumption.

Keywords: Ivanov Regularisation, RKHSs, Mercer Kernels

1. Introduction

One of the key problems to overcome in nonparametric regression is overfitting due to es-timators coming from large hypothesis classes. To avoid this phenomenon it is common toensure that both the empirical risk and some regularisation function are small when defin-ing an estimator. There are three natural ways to achieve this goal. We can minimise theempirical risk subject to a constraint on the regularisation function, minimise the regularisa-tion function subject to a constraint on the empirical risk or minimise a linear combinationof the two. These techniques are known as Ivanov regularisation, Morozov regularisation

1

Page and Grunewalder

and Tikhonov regularisation respectively (Oneto, Ridella, and Anguita, 2016). Ivanov andMorozov regularisation can be viewed as dual problems, while Tikhonov regularisation canbe viewed as the Lagrangian relaxation of either. Tikhonov regularisation has gained pop-ularity as it provides a closed-form estimator in many situations. However, it is Ivanovregularisation which provides the greatest control over the hypothesis class and hence overthe estimator it produces. For example, if the regularisation function is the norm of thespace then the bound forces the estimator to lie in a ball of predefined radius inside thefunction space. In our case the function space is a reproducing kernel Hilbert space (RKHS).An RKHS norm measures the smoothness of a function and the norm constraint boundsthe smoothness of the estimator. Beside enforcing these properties, the norm constraintalso allows us to control empirical processes evaluated at the estimator by controlling thesupremum of the process over a ball in the RKHS. This is vital to the proofs of the boundsin this paper. For example, see Lemmas 4 and 10 on pages 8 and 11.

Tikhonov regularisation has been extensively applied to RKHS regression in the context ofsupport vector machines (SVMs) and other least-squares estimators (Steinwart and Christ-mann, 2008; Caponnetto and de Vito, 2007; Mendelson and Neeman, 2010; Steinwart, Hush,and Scovel, 2009). In these cases the regularisation function is some function of the normof the RKHS. In this paper we instead use an Ivanov-regularised least-squares estimatorwith regularisation function equal to the norm of the RKHS. This choice of regularisationautomatically gives us a bound on the norm of our estimator in the RKHS. Having a tightbound on the norm of an estimator in the RKHS is important for its analysis, so it issurprising that Ivanov-regularised least-squares estimators have not attracted more atten-tion in the kernel setting. Although there is a relatively trivial bound on the norm of aTikhonov-regularised least-squares estimator in an RKHS, it is often difficult to make thisbound tight. For example, see the start of Chapter 7 of Steinwart and Christmann (2008).Furthermore, such estimators are often clipped so as to impose a uniform bound despitealready being regularised (Steinwart and Christmann, 2008; Steinwart et al., 2009).

Using Ivanov regularisation we can avoid clipping of the estimator in Sections 3 and 4.Furthermore, this form of regularisation allows us also to dispose of the restrictive Mercerkernel assumption. This is because we control empirical processes over balls in the RKHSinstead. On the other hand, the current theory of RKHS regression only applies when theRKHS H has a Mercer kernel k with respect to the covariate distribution P . This allowsfor a simple decomposition of k and succinct representation of H as a subspace of L2(P ).These descriptions are in terms of the eigenfunctions and eigenvalues of the kernel operatorT on L2(P ). Many results have assumed a fixed rate of decay of these eigenvalues in orderto produce estimators whose squared L2(P ) error decreases quickly with the number of datapoints (Mendelson and Neeman, 2010; Steinwart et al., 2009). However, the assumptionsnecessary for H to have a Mercer kernel are in general quite restrictive. The usual setof assumptions is that the covariate set S is compact, the kernel k of H is continuous onS × S and the covariate distribution satisfies suppP = S (see Section 4.5 of Steinwart andChristmann, 2008). In particular, the assumption that the covariate set S is compact isinconvenient and there has been some research into how to relax this condition by Steinwartand Scovel (2012). By contrast, we provide results that hold under the significantly weakerassumption that the RKHS is separable with a bounded and measurable kernel.

2


In this paper we prove an expectation bound on the squared L2(P ) error of our estimator oforder n−β/2 under the weak assumption that the response variables have bounded variance.Here n is the number of data points and β parameterises the interpolation space containingthe regression function. The expected squared L2(P ) error can be viewed as the expectedsquared error of our estimator at a new independent covariate with the same distribution P .We then move away from the average behaviour of the error towards something which tells usmore about what happens in the worst case. We obtain high probability bounds of the sameorder under the stronger assumption that the response variables have subgaussian errors andthat the regression function is bounded. Finally, we assume that the response variables arebounded to analyse an adaptive version of our estimator which does not require us to knowwhich interpolation space contains the regression function. We use the high probabilitybound from our previous result to obtain a high probability bound in this setting which isof order log(n)1/2n−β/2. For small β > 0 the order n−β/2 is around a power of 1/2 largerthan the order of n−β/(β+1) of Steinwart et al. (2009) when p = 1. However, we do notmake the restrictive assumption that k is a Mercer kernel of H as discussed above.

1.1 RKHSs and Their Interpolation Spaces

A Hilbert space H of real-valued functions on S is an RKHS if the evaluation functionalLx : H → R by Lxh = h(x) is bounded for all x ∈ S. In this case, Lx ∈ H∗ the dualof H and the Riesz representation theorem tells us that there is some kx ∈ H such thath(x) = 〈h, kx〉H for all h ∈ H. The kernel is then given by k(x1, x2) = 〈kx1 , kx2〉 forx1, x2 ∈ S and is symmetric and positive-definite.

We can define a range of spaces between L2(P ) and H. Let (Z, ‖·‖Z) be a Banach spaceand (V, ‖·‖V ) be a subspace of Z. Then the K-functional of (Z, V ) is

K(z, t) = infv∈V

(‖z − v‖Z + t‖v‖V )

for z ∈ Z and t > 0. For β ∈ (0, 1) and 1 ≤ q <∞ we can define

‖z‖β,q =

(∫ ∞0

(t−βK(z, t))qt−1dt

)1/q

and

‖z‖β,∞ = supt>0

(t−βK(z, t))

for z ∈ Z. The interpolation space [Z, V ]β,q is defined to be the set of z ∈ Z such that‖z‖β,q < ∞ for β ∈ (0, 1) and 1 ≤ q ≤ ∞. Smaller values of β give larger spaces. Whenβ is close to 1 the space is not much larger than V , but as β decreases we obtain spaceswhich get closer to Z. Hence, we can define the interpolation spaces [L2(P ), H]β,q. We willwork with q = ∞ which gives the largest space of functions for a fixed β ∈ (0, 1). Notethat although H is not a subspace of L2(P ), the above definitions are still valid as there isa natural embedding of H into L2(P ).

3


1.2 Literature Review

The current literature assumes that the RKHS H has a Mercer kernel k as discussed above.Earlier research in this area, such as that of Caponnetto and de Vito (2007), assumes thatthe regression function is at least as smooth as an element of H. However, their paper stillallows for regression functions of varying smoothness by letting the regression function beof the form g = T (β−1)/2h for β ∈ [1, 2] and h ∈ H. Here T : L2(P ) → L2(P ) is the kerneloperator and P is the covariate distribution. By assuming that the ith eigenvalue of T is oforder i−1/p for p ∈ (0, 1] the authors achieve a squared L2(P ) error of order n−β/(β+p) withhigh probability using SMVs. This squared L2(P ) error is shown to be of optimal order forβ ∈ (1, 2].

Later research focuses on the case in which the regression function is at most as smoothas an element of H. Often this research demands that the response variables are bounded.For example, Mendelson and Neeman (2010) assume that g ∈ T β/2(L2(P )) for β ∈ (0, 1)to obtain a squared L2(P ) error of order n−β/(1+p) with high probability using Tikhonov-regularised least-squares estimators. The authors also show that if the eigenfunctions ofthe kernel operator T are uniformly bounded in L∞(P ) then the order can be improved ton−β/(β+p). Steinwart et al. (2009) relax the condition on the eigenfunctions to the condition

‖h‖∞ ≤ Cp‖h‖pH‖h‖1−pL2(P )

for all h ∈ H and some constant Cp > 0. The same rate is attained using clipped Tikhonov-regularised least-squares estimators, including clipped SMVs, and is shown to be optimal.The authors assume that g is in an interpolation space between L2(P ) and H, which isslightly more general than the assumption of Mendelson and Neeman (2010). A detaileddiscussion about the image of L2(P ) under powers of T and interpolation spaces betweenL2(P ) and H is given by Steinwart and Scovel (2012).

Lately the assumption that the response variables must be bounded has been relaxed toallow for subexponential errors. However, the assumption that the regression function isbounded has been maintained. For example, Fischer and Steinwart (2017) assume thatg ∈ T β/2(L2(P )) for β ∈ (0, 2] and that g is bounded. The authors also assume thatTα/2(L2(P )) is continuously embedded in L∞(P ) with respect to an appropriate norm onTα/2(L2(P )) for some α < β. This gives the same squared L2(P ) error of order n−β/(β+p)

with high probability using SVMs.

1.3 Contribution

In this paper we provide bounds on the squared L2(P ) error of our Ivanov-regularised least-squares estimator when the regression function comes from an interpolation space betweenL2(P ) and an RKHS H which is separable with a bounded and measurable kernel k. Weuse the norm of the RKHS as our regularisation function. Under the weak assumption thatthe response variables have bounded variance we prove a bound on the expected squaredL2(P ) error of order n−β/2. Under the stronger assumption that the response variableshave subgaussian errors we show that the squared L2(P ) error is of order n−β/2 with high

4


probability. Finally, we assume that the response variables are bounded. In this case weuse training and validation on the data in order to select the size of the constraint on thenorm of our estimator. This gives us an adaptive estimation result which does not requireus to know which interpolation space contains the regression function. We obtain a squaredL2(P ) error of order log(n)1/2n−β/2 with high probability.

2. Problem Definition and Assumptions

We now formally define our regression problem and the assumptions that we make in thispaper. Assume that (Xi, Yi) ∈ S × R are i.i.d. for 1 ≤ i ≤ n where Xi ∼ P and

E(Yi|Xi) = g(Xi), (1)

var(Yi|Xi) ≤ σ2. (2)

The conditional expectation used is that of Kolmogorov, defined using the Radon-Nikodymderivative. Here we assume that g ∈ [L2(P ), H]β,∞ with norm at most B where H is aseparable RKHS and β ∈ (0, 1). Finally we assume that H has kernel k which is boundedand measurable with

‖k‖∞ = supx∈S

k(x, x)1/2 <∞.

Let BH be the unit ball of H and r > 0. Our assumption that g ∈ [L2(P ), H]β,∞ with normat most B and β ∈ (0, 1) provides us with the following result. Theorem 3.1 of Smale andZhou (2003) shows that

inf{‖h− g‖2L2(P ) : h ∈ rBH

}≤ B2/(1−β)

r2β/(1−β)

when H is dense in L2(P ). This additional condition is present because these are the onlyinterpolation spaces considered by the authors and in fact the result holds by the sameproof even when H is not dense. Hence, for all α > 0 there is some hr,α ∈ rBH such that

‖hr,α − g‖2L2(P ) ≤B2/(1−β)

r2β/(1−β)+ α. (3)

We will estimate hr,α for small α > 0 by

hr = arg minf∈rBH

1

n

n∑i=1

(f(Xi)− Yi)2.

Note that the estimator hr is not uniquely defined. It is a least-squares estimator withnorm at most r in H if one exists. However, there is a unique least-squares estimator withminimal norm in H which is measurable and lies in the span of the kXi . It is the limit ofSVMs as the regularisation parameter tends to 0. We will use this least-squares estimatorin the definition of hr. If this estimator has norm strictly bigger than r then hr is the SVMwith regularisation parameter selected so that its norm is r in H.

5


2.1 Measurability

Since H has a measurable kernel k it follows that all functions in H are measurable (Lemma4.24 of Steinwart and Christmann, 2008). Furthermore, the estimator hr is measurable. Theleast-squares estimator used in the definition of hr is measurable, as is its norm. An SVMis also measurable as long as its regularisation parameter is measurable. The norm of anSVM in H is measurable in the data and strictly decreasing in its regularisation parameter.Hence, the regularisation parameter for which an SVM has norm r is measurable and so isthe corresponding SVM.

We will also require the measurability of certain suprema over subsets of H. By definitionof H being separable, it has a countable dense subset H0. Furthermore, H is continuouslyembedded in L2(Pn) and L2(P ) because its kernel k is bounded. Hence,

supf∈rBH

∣∣∣‖f‖2L2(Pn)− ‖f‖2L2(P )

∣∣∣ = supf∈rBH∩H0

∣∣∣‖f‖2L2(Pn)− ‖f‖2L2(P )

∣∣∣ .This is a random variable since the right-hand side is a supremum of countably manyrandom variables. Therefore this quantity has a well-defined expectation and we can alsoapply Talagrand’s inequality to it.

3. Expectation Bound

In our first results section we prove an upper bound on the expectation of the squaredL2(P ) error of our estimator hr under the weak assumption that the response variableshave bounded variance. We recall that g is the regression function. The L2(P ) norm isappropriate here as if Xn+1 is a new independent covariate with the same distribution Pthen

E(‖hr − g‖2L2(P )

)= E

((hr(Xn+1)− g(Xn+1))

2),

the expected squared error of hr at Xn+1. Recall that r > 0 is the level of the hard constrainton the norm of our estimator. By selecting r of order n(1−β)/4 we achieve a bound of ordern−β/2. Note that in the first bound of the following theorem, the second and third termsare the largest asymptotically as we require r →∞ for our bound to tend to 0.

Theorem 1 We have

E(‖hr − g‖2L2(P )

)≤ 8‖k‖∞σr

n1/2+

10B2/(1−β)

r2β/(1−β)+

64‖k‖2∞r2

n1/2.

Based on this bound the asymptotically optimal choice of r is

(5β

32(1− β)

)(1−β)/2‖k‖−(1−β)∞ Bn(1−β)/4

6


which gives a bound of(10

(32(1− β)

5β

)β+ 64

(5β

32(1− β)

)1−β)‖k‖2β∞B2n−β/2

+ 8

(5β

32(1− β)

)(1−β)/2‖k‖β∞σBn−(1+β)/2.

We prove this result as follows. We already have an upper bound on the squared L2(P ) normof hr−g from (3) and so our main aim is to establish an upper bound on the expectation ofthe squared L2(P ) norm of hr − hr,α. To do this we first bound the squared L2(Pn) norm

of hr − hr,α where Pn is the empirical distribution of the Xi. The definition of hr and somerearranging provide us with the following inequality.

Lemma 2 The definition of hr shows

‖hr − hr,α‖2L2(Pn)≤ 4

n

n∑i=1

(Yi − g(Xi))(hr(Xi)− hr,α(Xi)) + 4‖hr,α − g‖2L2(Pn). (4)

We continue by bounding the expectation of the right-hand side of (4). The second termhas expectation

E(‖hr,α − g‖2L2(Pn)

)= ‖hr,α − g‖2L2(P ) ≤

B2/(1−β)

r2β/(1−β)+ α

by (3). The first term is more difficult to deal with but we can use the method of Remark6.1 of Sriperumbudur (2016). It can be bounded using an expression of the form

supf∈BH

1

n

n∑i=1

(Yi − g(Xi))f(Xi)

which is equal to ∥∥∥∥∥ 1

n

n∑i=1

(Yi − g(Xi))kXi

∥∥∥∥∥H

by the reproducing kernel property and the Cauchy–Schwarz inequality. By writing thenorm as the square root of the corresponding inner product and using the reproducingkernel property again we find that this is equal to 1

n2

n∑i,j=1

(Yi − g(Xi))(Yj − g(Xj))k(Xi, Xj)

1/2

.

We can take the expectation of this quantity inside the square root using Jensen’s inequalityand use the independence and variance assumptions on the data to obtain the followingbound.

7


Lemma 3 By bounding the expectation of the right-hand side of (4) we have

E(‖hr − hr,α‖2L2(Pn)

)≤ 4‖k‖∞σr

n1/2+

4B2/(1−β)

r2β/(1−β)+ 4α.

The next step is to move our bound on the expectation of the squared L2(Pn) norm ofhr − hr,α to the expectation of the squared L2(P ) norm of hr − hr,α. We can achieve thisby bounding

E

(supf∈BH

∣∣∣‖f‖2L2(Pn)− ‖f‖2L2(P )

∣∣∣) .By symmetrisation and the contraction principle for Rademacher processes it is enough tobound the expectation of

supf∈BH

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣which can be done in the same way as bounding the expectation of

supf∈BH

1

n

n∑i=1

(Yi − g(Xi))f(Xi)

as discussed earlier.

Lemma 4 Using Rademacher processes we can show

E

(supf∈BH

∣∣∣‖f‖2L2(Pn)− ‖f‖2L2(P )

∣∣∣) ≤ 8‖k‖2∞n1/2

.

Since hr,α, hr ∈ rBH we have

‖hr − hr,α‖2L2(P ) ≤ ‖hr − hr,α‖2L2(Pn)

+ 4r2 supf∈BH

∣∣∣‖f‖2L2(Pn)− ‖f‖2L2(P )

∣∣∣and the next result follows as a simple consequence of Lemmas 3 and 4.

Corollary 5 We have

E(‖hr − hr,α‖2L2(P )

)≤ 4‖k‖∞σr

n1/2+

4B2/(1−β)

r2β/(1−β)+

32‖k‖2∞r2

n1/2+ 4α.

All that remains to prove Theorem 1 is to combine Corollary 5 with (3) by using

‖hr − g‖2L2(P ) ≤(‖hr − hr,α‖L2(P ) + ‖hr,α − g‖L2(P )

)2≤ 2‖hr − hr,α‖2L2(P ) + 2‖hr,α − g‖2L2(P )

and let α→ 0.

8


4. High Probability Bound

In this section we move away from the average behaviour of the squared L2(P ) error of ourestimator towards understanding what happens in the worst case. We do this by provinga high probability bound on the squared L2(P ) error of our estimator. Our proof methodfollows that of the expectation bound. However, we must additionally prove concentrationinequalities for all of the quantities involved. In order to achieve these concentration re-sults we make the following further assumptions. We assume that the moment generatingfunction of Yi − g(Xi) conditional on Xi satisfies

E(exp(t(Yi − g(Xi)))|Xi) ≤ exp(σ2t2/2) (5)

for all t ∈ R and 1 ≤ i ≤ n. Note that the right-hand side of the inequality is the momentgenerating function of an N(0, σ2) random variable. The random variable Yi− g(Xi) is saidto be σ2-subgaussian conditional on Xi. This condition implies our weaker assumptions (1)and (2), so our results from Section 3 still apply. We also assume that ‖g‖∞ ≤ C. Again,we can select r of order n(1−β)/4 to achieve a bound of order n−β/2.

Theorem 6 Let t > 0. With probability at least 1− 4e−t we have

‖hr − g‖2L2(P ) ≤8‖k‖∞σr

((1 + t+ 2(t2 + t)1/2)1/2 + (2t)1/2

)n1/2

+10B2/(1−β)

r2β/(1−β)+

8(C + ‖k‖∞r)2(2t)1/2

n1/2+

64‖k‖2∞r2

n1/2

+ 8

(2‖k‖4∞r4

n+

32‖k‖4∞r4

n3/2

)1/2

t1/2 +16‖k‖2∞r2t

3n.

For t ≥ 1 this can be simplified to

‖hr − g‖2L2(P ) ≤8(4C2 + 5‖k‖∞σr + 18‖k‖2∞r2)t1/2

n1/2+

10B2/(1−β)

r2β/(1−β)+

16‖k‖2∞r2t3n

.

Based on this bound the asymptotically optimal choice of r is(5β

72(1− β)

)(1−β)/2‖k‖−(1−β)∞ Bt−(1−β)/4n(1−β)/4

which ignoring factors involving only β gives a bound of asymptotic order

‖k‖2β∞B2tβ/2n−β/2.

The concentration results we need are reasonably simple to show for normal random vari-ables but are more complicated to derive in the general subgaussian case. We need thefollowing result which relates a quadratic form of subgaussians to that of centred normalrandom variables.

9


Lemma 7 For 1 ≤ i ≤ n let εi be independent conditional on some sigma algebra G andlet

E(exp(tεi)|G) ≤ exp(σ2t2/2)

for all t ∈ R. Also let δi be independent of each other and G with δi ∼ N(0, σ2). If A is ann× n and G-measurable non-negative-definite matrix then

E(exp(tεTAε)|G) ≤ E(exp(tδTAδ)|G)

for all t ≥ 0.

The above lemma can be proved by integrating the argument of the conditional momentgenerating function of the vector of εi over a Gaussian kernel. Having established thisrelationship we can now obtain a probability bound on a quadratic form of subgaussiansby using Chernoff bounding. The following is a conditional subgaussian version of theHanson–Wright inequality.

Lemma 8 For 1 ≤ i ≤ n let εi be independent conditional on some sigma algebra G andlet

E(exp(tεi)|G) ≤ exp(σ2t2/2)

for all t ∈ R. If A is an n × n and G-measurable non-negative-definite matrix and t ≥ 0then

εTAε ≤ σ2 tr(A) + 2σ2(

max1≤i≤n

λi

)t+ 2

(σ4(

max1≤i≤n

λi

)2

t2 + σ4 tr(A2)t

)1/2

with probability at least 1 − e−t conditional on G. Here λi are the eigenvalues of A for1 ≤ i ≤ n and they are G-measurable.

We now proceed with the main part of the proof. As in Section 3 we first bound the squaredL2(Pn) norm and then the squared L2(P ) norm of hr − hr,α. Finally, we combine this withthe bound on the squared L2(P ) norm of hr,α − g from (3) to achieve our goal. In order to

prove the probability bound on the squared L2(Pn) norm of hr − hr,α we bound the right-hand side of (3) in probability. As discussed in Section 3, the first term can be bounded bybounding 1

n2

n∑i,j=1


1/2

.

This can be done by using our probability bound on a quadratic form of subgaussians fromLemma 8. The second term can be bounded using Hoeffding’s inequality. This is becausehr,α ∈ rBH implies that ‖hr,α‖∞ ≤ ‖k‖∞r and we assume that ‖g‖∞ ≤ C.

Lemma 9 Let t > 0. With probability at least 1− 3e−t we have

‖hr − hr,α‖2L2(Pn)≤

4‖k‖∞σr((1 + t+ 2(t2 + t)1/2)1/2 + (2t)1/2

)n1/2

+4B2/(1−β)

r2β/(1−β)+

4(C + ‖k‖∞r)2(2t)1/2

n1/2+ 4α.

10


We now move our probability bound on the squared L2(Pn) norm of hr−hr,α to the squared

L2(P ) norm of hr − hr,α. We can achieve this by bounding

supf∈BH

∣∣∣‖f‖2L2(Pn)− ‖f‖2L2(P )

∣∣∣using Talagrand’s inequality.

Lemma 10 Let t > 0. With probability at least 1− e−t we have

supf∈rBH

∣∣∣‖f‖2L2(Pn)− ‖f‖2L2(P )

∣∣∣ ≤ 8‖k‖2∞r2

n1/2+

(2‖k‖4∞r4

n+

32‖k‖4∞r4

n3/2

)1/2

t1/2 +2‖k‖2∞r2t

3n.

The proof of Theorem 6 is then as follows. Since hr,α, hr ∈ rBH we have

‖hr − hr,α‖2L2(P ) ≤ ‖hr − hr,α‖2L2(Pn)

+ supf∈2rBH

∣∣∣‖f‖2L2(Pn)− ‖f‖2L2(P )

∣∣∣ ,which can be bounded using Lemmas 9 and 10. Finally, we combine this bound with (3)in the same way as for Theorem 1 and let α→ 0.

5. Adaptive Probability Bound

We now use training and validation on the data in order to select r > 0 without knowingwhich interpolation space contains the regression function. In order to do this we have tomake the additional assumption that the response variables Yi are bounded in [−M,M ].Because of this our regression function satisfies ‖g‖∞ ≤ M , which in turn implies that|Yi − g(Xi)| ≤ 2M . Hoeffding’s Lemma then implies that Yi − g(Xi) is 4M2-subgaussianconditional on Xi, as defined by (5). This means the results from Section 4 hold withσ2 = 4M2 and C = M . Similarly to Chapter 7 of Steinwart and Christmann (2008) andSteinwart et al. (2009), we will define the projection V : R → [−M,M ]. Since |Y1| ≤ Mand

‖h− g‖2L2(P ) = E((h(X1)− Y1)2)− E((g(X1)− Y1)2)

for h ∈ H we find‖V h− g‖2L2(P ) ≤ ‖h− g‖

2L2(P ). (6)

In particular we find that our results for hr in Section 4 also hold for V hr. We then havethe following result based on Theorem 7.24 of Steinwart and Christmann (2008).

Theorem 11 Let a ≥ b > 0, I = d(a/b)2(n1/2 − 1)e and

R = {ri = (a2 + b2i)1/2 : 0 ≤ i ≤ I − 1} ∪ {rI = an1/4}.

Also let p ∈ (0, 1). For n > p−1 the following result holds. Suppose that we only use thefirst m = bpnc data points to calculate the hr for r ∈ R. Let

r = arg minr∈R

1

n−m

n∑i=m+1

(V hr(Xi)− Yi)2

11


with ties broken towards the smallest value of r ∈ R. Note that r, hr and V hr are measur-able. Then ‖V hr − g‖2L2(P ) is of order at most

log(n)1/2n−β/2 + t1/2n−β/2

in n and t ≥ 1 with probability at least 1− e−t.

6. Discussion

In this paper we show how Ivanov regularisation can be used to produce smooth estimatorswhich have a small squared L2(P ) error. We consider the case in which the regression func-tion lies in an interpolation space between L2(P ) and an RKHS. We achieve these boundswithout the standard assumption that the RKHS has a Mercer kernel, which means thatour results apply even when the covariate set is not compact. In fact, our only assumptionon the RKHS is that it is separable with a bounded and measurable kernel. Under theweak assumption that the response variables have bounded variance we prove an expecta-tion bound on the squared L2(P ) error of our estimator of order n−β/2. Under the strongerassumption that the response variables have subgaussian errors we achieve a probabilitybound of the same order. Finally, we assume that the response variables are bounded anduse training and validation on the data set. This allows us to select the size of the normconstraint for our Ivanov regularisation without knowing which interpolation space containsthe regression function.

For small β > 0 the order n−β/2 is around a power of 1/2 larger than the order of n−β/(β+1)

of Steinwart et al. (2009) when p = 1. However, we do not make the restrictive assumptionthat k is a Mercer kernel of H. This is because we use Ivanov regularisation instead ofTikhonov regularisation and control empirical processes over balls in the RKHS. By contrast,the current theory uses the embedding of H in L2(P ) from Mercer’s theorem to achievebounds on Tikhonov-regularised estimators (Mendelson and Neeman, 2010; Steinwart et al.,2009). For this reason, it seems unlikely that Tikhonov-regularised estimators such as SVMswould perform well in the absence of a Mercer kernel, although it would be interesting toinvestigate whether or not this is the case.

Lower bounds on estimation rates are useful for complementing upper bounds. Togetherthey gives a range in which the optimal estimation rate lies and this can be used to asses theperformance of estimators with good upper bounds. The lower bound of order n−β/(β+1)

of Steinwart et al. (2009) when p = 1 certainly applies to our setting. Improving this lowerbound would require constructing an element of an interpolation space between L2(P ) andan RKHS which is difficult to estimate.

It is restrictive to assume that the response variables are bounded in order to achieve ouradaptive estimation result and it would be useful to be able to relax this assumption. Weneed the boundedness assumption to make use of the training and validation approach basedon Theorem 7.24 of Steinwart and Christmann (2008). It is not obvious that this trainingand validation approach can be made to work without the boundedness assumption and adifferent tool might be needed to remove this restriction.

12


Appendix A. Proof of Expectation Bound

Proof of Lemma 2 Since hr,α ∈ rBH , the definition of hr gives

1

n

n∑i=1

(hr(Xi)− Yi)2 ≤1

n

n∑i=1

(hr,α(Xi)− Yi)2.

Expanding

(hr(Xi)− Yi)2 =(

(hr(Xi)− hr,α(Xi)) + (hr,α(Xi)− Yi))2,

substituting into the above and rearranging gives

1

n

n∑i=1

(hr(Xi)− hr,α(Xi))2 ≤ 2

n

n∑i=1

(Yi − hr,α(Xi))(hr(Xi)− hr,α(Xi)).

Substituting

Yi − hr,α(Xi) = (Yi − g(Xi)) + (g(Xi)− hr,α(Xi))

into the above and applying the Cauchy–Schwarz inequality to the second term gives

‖hr − hr,α‖2L2(Pn)≤ 2

n

n∑i=1

(Yi − g(Xi))(hr(Xi)− hr,α(Xi))

+ 2‖g − hr,α‖L2(Pn)‖hr − hr,α‖L2(Pn).

For constants a, b ∈ R we have

x2 ≤ a+ 2bx =⇒ x2 ≤ 2a+ 4b2

for x ∈ R by completing the square and rearranging. Applying this result to the aboveinequality proves the lemma.

Proof of Lemma 3 We have

E(‖hr,α − g‖2L2(Pn)

)= ‖hr,α − g‖2L2(P ) ≤

B2/(1−β)

r2β/(1−β)+ α

by (3) and

E

(1

n

n∑i=1

(Yi − g(Xi))hr,α(Xi)

)= E

(1

n

n∑i=1

E(Yi − g(Xi)|Xi)hr,α(Xi)

)= 0.

13


The remainder of this proof method is due to Remark 6.1 of Sriperumbudur (2016). Sincehr ∈ rBH we have

1

n

n∑i=1

(Yi − g(Xi))hr(Xi) ≤ supf∈rBH

1

n

n∑i=1

(Yi − g(Xi))f(Xi)

= supf∈rBH

⟨1

n

n∑i=1

(Yi − g(Xi))kXi , f

⟩H

= r

∥∥∥∥∥ 1

n

n∑i=1

(Yi − g(Xi))kXi

∥∥∥∥∥H

= r

1

n2

n∑i,j=1


1/2

by the Cauchy–Schwarz inequality. Denote the vector of the Xi ∈ S by X ∈ Sn. Then byJensen’s inequality and the independence of the (Xi, Yi) we have

E

(1

n

n∑i=1

(Yi − g(Xi))hr(Xi)

∣∣∣∣∣X)≤ r

1

n2

n∑i,j=1

cov(Yi, Yj |X)k(Xi, Xj)

1/2

≤ r

(σ2

n2

n∑i=1

k(Xi, Xi)

)1/2

and again by Jensen’s inequality we have

E

(1

n

n∑i=1

(Yi − g(Xi))hr(Xi)

)≤ r

(σ2‖k‖2∞

n

)1/2

.

The result follows from Lemma 2.

Proof of Lemma 4 Let the εi be i.i.d. Rademacher random variables independent of theXi. Lemma 2.3.1 of van der Vaart and Wellner (2013) shows

E supf∈BH

∣∣∣∣∣ 1nn∑i=1

f(Xi)2 −

∫f2dP

∣∣∣∣∣ ≤ 2E supf∈BH

∣∣∣∣∣ 1nn∑i=1

εif(Xi)2

∣∣∣∣∣by symmetrisation. Since

f(Xi) = 〈kXi , f〉H ≤ ‖kXi‖H = k(Xi, Xi)1/2 ≤ ‖k‖∞

for all f ∈ BH we findf(Xi)

2

2‖k‖∞is a contraction vanishing at 0 as a function of f(Xi) for all 1 ≤ i ≤ n. By Theorem 3.2.1of Gine and Nickl (2016) we have

E

(supf∈BH

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

2

2‖k‖∞

∣∣∣∣∣∣∣∣∣∣X)≤ 2E

(supf∈BH

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣∣∣∣∣∣X).

14


Therefore

E supf∈BH

∣∣∣∣∣ 1nn∑i=1

f(Xi)2 −

∫f2dP

∣∣∣∣∣ ≤ 8‖k‖∞ E

(supf∈BH

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣).

We now follow a similar argument to the proof of Lemma 3. We have

supf∈BH

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣ = supf∈BH

∣∣∣∣∣⟨

1

n

n∑i=1

εikXi , f

⟩H

∣∣∣∣∣=

∥∥∥∥∥ 1

n

n∑i=1

εikXi

∥∥∥∥∥H

=

1

n2

n∑i,j=1

εiεjk(Xi, Xj)

1/2

.

By Jensen’s inequality we have

E

(supf∈BH

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣∣∣∣∣∣X)≤

1

n2

n∑i,j=1

cov(εi, εj |X)k(Xi, Xj)

1/2

=

(1

n2

n∑i=1

k(Xi, Xi)

)1/2

and again by Jensen’s inequality we have

E

(supf∈BH

∣∣∣∣∣ 1nn∑i=1

εif(Xi)

∣∣∣∣∣)≤(‖k‖2∞n

)1/2

.

The result follows.

Appendix B. Proof of High Probability Bound

Proof of Lemma 7 This proof method uses techniques from the proof of Lemma 9 ofAbbasi-Yadkori, Pal, and Szepesvari (2011). We have

E(exp(tiεi/σ)|G) ≤ exp(t2i /2)

for all 1 ≤ i ≤ n and ti ∈ R. Furthermore, the εi are independent conditional on G so

E(exp(tTε/σ)|G) ≤ exp(‖t‖22/2).

Quintana and Rodrıguez (2014) show that a measurable strictly-positive-definite matrixcan be diagonalised by a measurable orthogonal matrix and a measurable diagonal matrix.The result holds for non-negative-definite matrices by adding the identity matrix before

15


diagonalisation, so let A have the G-measurable square root A1/2. Therefore we can replacet with sA1/2u for s ∈ R and u ∈ Rn to get

E(exp(suTA1/2ε/σ)|G) ≤ exp(s2‖A1/2u‖22/2).

Integrating u with respect to the distribution of δ gives

E(exp(s2εTAε/2)|G) ≤ E(exp(s2δTAδ/2)|G).

The result follows.

Proof of Lemma 8 This proof method follows that of Theorem 3.1.9 of Gine and Nickl(2016). Quintana and Rodrıguez (2014) show that a measurable strictly-positive-definitematrix can be diagonalised by a measurable orthogonal matrix and a measurable diagonalmatrix. The result holds for non-negative-definite matrices by adding the identity matrixbefore diagonalisation, so let

A = QDQT

where Q is an n×n and G-measurable orthogonal matrix and D is an n×n and G-measurablediagonal matrix with non-negative entries. Also let δi be independent of each other and Gwith δi ∼ N(0, σ2). Then by Lemma 7 we have

E(exp(tεTAε)|G) ≤ E(exp(tδTAδ)|G) = E(exp(tδTDδ)|G)

for all t ≥ 0. Furthermore,

E(exp(tδ2i /σ2)) =

∫ ∞−∞

1

(2π)1/2exp(tx2 − x2/2)dx =

1

(1− 2t)1/2

so

E(exp(t(δ2i /σ2 − 1))) = exp(−(log(1− 2t) + 2t)/2).

Hence,

−2(log(1− 2t) + 2t) ≤∞∑i=2

(2t)i(2/i) ≤ 4t2/(1− 2t)

for 0 ≤ t ≤ 1/2. Therefore

E(exp(tDi,i(δ2i − σ2))|G) ≤ exp

(σ4D2

i,it2

1− 2σ2Di,it

)

for 0 ≤ t ≤ 1/(2σ2Di,i) for all 1 ≤ i ≤ n and

E(exp(t(δTDδ − σ2 tr(D)))|G) ≤ exp

(σ4 tr(D2)t2

1− 2σ2(maxiDi,i)t

)for 0 ≤ t ≤ 1/(2σ2(maxiDi,i)). By Chernoff bounding we have

εTAε− σ2 tr(A) > s

16


for s ≥ 0 with probability at most

exp

(σ4 tr(A2)t2

1− 2σ2(maxi λi)t− ts

)conditional on G for 0 ≤ t ≤ 1/(2σ2(maxi λi)). Letting

t =s

2σ4 tr(A2) + 2σ2(maxi λi)s

gives the bound

exp

(− s2

4σ4 tr(A2) + 4σ2(maxi λi)s

).

Rearranging gives the result.

We prove Lemma 9 by proving the bound on each of the two terms on the right-hand sideof (4) individually in the following two lemmas.

Lemma 12 Let t > 0. With probability at least 1− 2e−t we have

1

n

n∑i=1

(Yi − g(Xi))(hr(Xi)− hr,α(Xi)) ≤‖k‖∞σr

((1 + t+ 2(t2 + t)1/2)1/2 + (2t)1/2

)n1/2

.

Proof From the proof of Lemma 3 we have

1

n

n∑i=1

(Yi − g(Xi))hr(Xi) ≤ r

1

n2

n∑i,j=1


1/2

.

Let K be the n× n matrix with Ki,j = k(Xi, Xj) for 1 ≤ i, j ≤ n and let ε be the vector ofthe Yi − g(Xi). Then

1

n2

n∑i,j=1

(Yi − g(Xi))(Yj − g(Xj))k(Xi, Xj) = εT(n−2K)ε.

Furthermore, since k is a measurable kernel we have that n−2K is measurable and non-negative-definite. Let λi for 1 ≤ i ≤ n be the eigenvalues of n−2K. Then

maxiλi ≤ tr(n−2K) ≤ n−1‖k‖2∞

andtr((n−2K)2) = ‖λ‖22 ≤ ‖λ‖21 ≤ n−2‖k‖4∞.

Therefore by Lemma 8 we have

εT(n−2K)ε ≤ ‖k‖2∞σ2n−1(1 + t+ 2(t2 + t)1/2)

and1

n

n∑i=1

(Yi − g(Xi))hr(Xi) ≤‖k‖∞σr(1 + t+ 2(t2 + t)1/2)1/2

n1/2

17


with probability at least 1− e−t. Finally,

− 1

n

n∑i=1

(Yi − g(Xi))hr,α(Xi)

is subgaussian given X with parameter

1

n2

n∑i=1

σ2hr,α(Xi)2 ≤ ‖k‖

2∞σ

2r2

n.

So for t > 0 we have

− 1

n

n∑i=1

(Yi − g(Xi))hr,α(Xi) ≤‖k‖∞σr(2t)1/2

n1/2

with probability at least 1− e−t by Chernoff bounding. The result follows.

Lemma 13 Let t > 0. With probability at least 1− e−t we have

‖hr,α − g‖2L2(Pn)≤ B2/(1−β)

r2β/(1−β)+

(C + ‖k‖∞r)2(2t)1/2

n1/2+ α

Proof We have

1

n

n∑i=1

(hr,α(Xi)−g(Xi))2 =

1

n

n∑i=1

((hr,α(Xi)− g(Xi))

2 − ‖hr,α − g‖2L2(P )

)+‖hr,α−g‖2L2(P ).

Furthermore,∣∣∣(hr,α(Xi)− g(Xi))2 − ‖hr,α − g‖2L2(P )

∣∣∣ ≤ max{

(hr,α(Xi)− g(Xi))2, ‖hr,α − g‖2L2(P )

}≤ (C + ‖k‖∞r)2

since hr,α ∈ rBH implies that ‖hr,α‖∞ ≤ ‖k‖∞r and we assume that ‖g‖∞ ≤ C. Hence, byHoeffding’s inequality we have

1

n

n∑i=1


2 − ‖hr,α − g‖2L2(P )

)> t

for t > 0 with probability at most

exp

(− nt2

2(C + ‖k‖∞r)4

).

Therefore we have

1

n

n∑i=1


2 − ‖hr,α − g‖2L2(P )

)≤ (C + ‖k‖∞r)2(2t)1/2

n1/2

18


for t > 0 with probability at least 1− e−t. The result follows from (3).

Proof of Lemma 10 Let

Z = supf∈rBH

∣∣∣‖f‖2L2(Pn)− ‖f‖2L2(P )

∣∣∣ = supf∈BH

∣∣∣∣∣n∑i=1

n−1(f(Xi)

2 − ‖f‖2L2(P )

)∣∣∣∣∣ .We have

E(n−1

(f(Xi)

2 − ‖f‖2L2(P )

))= 0,

n−1∣∣∣f(Xi)

2 − ‖f‖2L2(P )

∣∣∣ ≤ ‖k‖2∞r2n

,

E(n−2

(f(Xi)

2 − ‖f‖2L2(P )

)2)≤ ‖k‖

4∞r

4

n2

for all 1 ≤ i ≤ n. Furthermore rBH is separable so we can use Talagrand’s inequality(Theorem A.9.1 of Steinwart and Christmann, 2008) to show

Z > E(Z) +

(2t

(‖k‖4∞r4

n+

2‖k‖2∞r2 E(Z)

n

))1/2

+2t‖k‖2∞r2

3n

for t > 0 with probability at most e−t. From Lemma 4 we have

E(Z) ≤ 8‖k‖2∞r2

n1/2.

The result follows.

Appendix C. Proof of Adaptive Bound

Proof of Theorem 11 We follow the proof of Theorem 7.24 of Steinwart and Christmann(2008). Fix r ∈ R. From Lemma 6 and (6) we have

‖V hr − g‖2L2(P ) ≤8(4M2 + 10‖k‖∞Mr + 18‖k‖2∞r2)t1/2

(pn− 1)1/2+

10B2/(1−β)

r2β/(1−β)+

16‖k‖2∞r2t3(pn− 1)

with probability at least 1− 4e−t for t ≥ 1. Hence, the same inequality holds for all r ∈ Rwith probability at least 1−4(1+I)e−t for t ≥ 1. Theorem 7.2 and Example 7.3 of Steinwartand Christmann (2008) show

‖V hr − g‖2L2(P ) ≤ 6 minr∈R‖V hr − g‖2L2(P ) +

512M2(t+ log(2 + I))

(1− p)n

with probability at least 1− e−t for t > 0 which gives the upper bound

6 minr∈R

(8(4M2 + 10‖k‖∞Mr + 18‖k‖2∞r2)t1/2

(pn− 1)1/2+

10B2/(1−β)

r2β/(1−β)+

16‖k‖2∞r2t3(pn− 1)

)

+512M2(t+ log(2 + I))

(1− p)n

19


with probability at least 1−(5+4I)e−t for t ≥ 1. Since r0 = a, rI = an1/4 and r2i −r2i−1 ≤ b2for all 1 ≤ i ≤ I, there is some 0 ≤ i ≤ I such that

a2n(1−β)/2 ≤ r2i ≤ a2n(1−β)/2 + b2.

Bounding the minimum over R using the value at ri shows

‖V hr − g‖2L2(P ) ≤48(4M2 + 10a‖k‖∞Mn(1−β)/4 + 18a2‖k‖2∞n(1−β)/2)t1/2

(pn− 1)1/2+

60B2/(1−β)

a2β/(1−β)nβ/2

+96a2‖k‖2∞n(1−β)/2t

3(pn− 1)+

48(10b‖k‖∞M + 18b2‖k‖2∞)t1/2

(pn− 1)1/2+

96b2‖k‖2∞t3(pn− 1)

+512M2(t+ log(2 + I))

(1− p)n

for t ≥ 1 with probability at least

1− (5 + 4I)e−t ≥ 1− (4(a/b)2n1/2 + 9)e−t

since

I ≤ (a/b)2n1/2 + 1.

Replacing t with

log(4(a/b)2n1/2 + 9) + t

gives a bound on ‖V hr − g‖2L2(P ) of order

log(n)1/2n−β/2 + t1/2n−β/2

in n and t ≥ 1 with probability at least 1− e−t.

References

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linearstochastic bandits. In Proceedings of the 24th International Conference on Neural Infor-mation Processing Systems, pages 2312–2320, 2011.

Andrea Caponnetto and Ernesto de Vito. Optimal rates for the regularized least-squaresalgorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.

Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squaresalgorithm. arXiv preprint arXiv:1702.07254, 2017.

Evarist Gine and Richard Nickl. Mathematical Foundations of Infinite-Dimensional Statis-tical Models. Cambridge University Press, 2016.

Shahar Mendelson and Joseph Neeman. Regularization in kernel learning. The Annals ofStatistics, 38(1):526–565, 2010.

20


Luca Oneto, Sandro Ridella, and Davide Anguita. Tikhonov, Ivanov and Morozov regular-ization for support vector machine learning. Machine Learning, 103(1):103–136, 2016.

Yamilet Quintana and Jose M. Rodrıguez. Measurable diagonalization of positive definitematrices. Journal of Approximation Theory, 185:91–97, 2014.

Steve Smale and Ding-Xuan Zhou. Estimating the approximation error in learning theory.Analysis and Applications, 1(1):17–41, 2003.

Bharath Sriperumbudur. On the optimal estimation of probability measures in weak andstrong topologies. Bernoulli, 22(3):1839–1893, 2016.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer New York,2008.

Ingo Steinwart and Clint Scovel. Mercer’s theorem on general domains: On the interactionbetween measures, kernels, and RKHSs. Constructive Approximation, 35(3):363–417,2012.

Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squaresregression. In The 22nd Conference on Learning Theory, 2009.

Aad W. van der Vaart and Jon A. Wellner. Weak convergence and empirical processes:With applications to statistics. Springer New York, 2013.

21

ivanov-regularised least-squares estimators over …...ivanov-regularised estimators over rkhss...

Documents