journal of mac hine learning researc h 1 (2001) 211{244...

34

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Journal of Machine Learning Research 1 (2001) 211{244 Submitted 5/00; Published 6/01

Sparse Bayesian Learning and the Relevance Vector Machine

Michael E. Tipping [email protected]

Microsoft Research

St George House, 1 Guildhall Street

Cambridge CB2 3NH, U.K.

Editor: Alex Smola

Abstract

This paper introduces a general Bayesian framework for obtaining sparse solutions to re-gression and classi�cation tasks utilising models linear in the parameters. Although thisframework is fully general, we illustrate our approach with a particular specialisation thatwe denote the `relevance vector machine' (RVM), a model of identical functional form tothe popular and state-of-the-art `support vector machine' (SVM). We demonstrate that byexploiting a probabilistic Bayesian learning framework, we can derive accurate predictionmodels which typically utilise dramatically fewer basis functions than a comparable SVMwhile o�ering a number of additional advantages. These include the bene�ts of probabilis-tic predictions, automatic estimation of `nuisance' parameters, and the facility to utilisearbitrary basis functions (e.g. non-`Mercer' kernels).

We detail the Bayesian framework and associated learning algorithm for the RVM, andgive some illustrative examples of its application along with some comparative benchmarks.We o�er some explanation for the exceptional degree of sparsity obtained, and discussand demonstrate some of the advantageous features, and potential extensions, of Bayesianrelevance learning.

1. Introduction

In supervised learning we are given a set of examples of input vectors fxngNn=1 along withcorresponding targets ftngNn=1, the latter of which might be real values (in regression)or class labels (classi�cation). From this `training' set we wish to learn a model of thedependency of the targets on the inputs with the objective of making accurate predictionsof t for previously unseen values of x. In real-world data, the presence of noise (in regression)and class overlap (in classi�cation) implies that the principal modelling challenge is to avoid`over-�tting' of the training set.

Typically, we base our predictions upon some function y(x) de�ned over the input space,and `learning' is the process of inferring (perhaps the parameters of) this function. A exibleand popular set of candidates for y(x) is that of the form:

y(x;w) =

MXi=1

wi�i(x) = wT�(x); (1)

where the output is a linearly-weighted sum of M , generally nonlinear and �xed, basis func-tions �(x) = (�1(x); �2(x); : : : ; �M (x))T. Analysis of functions of the type (1) is facilitated

c 2001 Michael E. Tipping.

Page 2: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

since the adjustable parameters (or `weights') w = (w1; w2; : : : ; wM )T appear linearly, andthe objective is to estimate `good' values for those parameters.

In this paper, we detail a Bayesian probabilistic framework for learning in general modelsof the form (1). The key feature of this approach is that as well as o�ering good gener-alisation performance, the inferred predictors are exceedingly sparse in that they containrelatively few non-zero wi parameters. The majority of parameters are automatically set tozero during the learning process, giving a procedure that is extremely e�ective at discerningthose basis functions which are `relevant' for making good predictions.

While the range of models of the type (1) that we can address is extremely broad, weconcentrate here on a specialisation that we denote the `relevance vector machine' (RVM),originally introduced by Tipping (2000). We consider functions of a type correspondingto those implemented by another sparse linearly-parameterised model, the support vector

machine (SVM) (Boser et al., 1992; Vapnik, 1998; Sch�olkopf et al., 1999a). The SVM makespredictions based on the function:

y(x;w) =NXi=1

wiK(x;xi) + w0; (2)

where K(x;xi) is a kernel function, e�ectively de�ning one basis function for each examplein the training set.1 The key feature of the SVM is that, in the classi�cation case, its targetfunction attempts to minimise a measure of error on the training set while simultaneouslymaximising the `margin' between the two classes (in the feature space implicitly de�nedby the kernel). This is a highly e�ective mechanism for avoiding over-�tting, which leadsto good generalisation, and which furthermore results in a sparse model dependent onlyon a subset of kernel functions: those associated with training examples xn (the \supportvectors") that lie either on the margin or on the `wrong' side of it. State-of-the-art resultshave been reported on many tasks where the SVM has been applied.

However, despite its success, we can identify a number of signi�cant and practical dis-advantages of the support vector learning methodology:

� Although relatively sparse, SVMs make unnecessarily liberal use of basis functionssince the number of support vectors required typically grows linearly with the size ofthe training set. Some form of post-processing is often required to reduce computa-tional complexity (Burges, 1996; Burges and Sch�olkopf, 1997).

� Predictions are not probabilistic. In regression the SVM outputs a point estimate, andin classi�cation, a `hard' binary decision. Ideally, we desire to estimate the conditionaldistribution p(tjx) in order to capture uncertainty in our prediction. In regression thismay take the form of `error-bars', but it is particularly crucial in classi�cation whereposterior probabilities of class membership are necessary to adapt to varying classpriors and asymmetric misclassi�cation costs. Posterior probability estimates havebeen coerced from SVMs via post-processing (Platt, 2000), although we argue thatthese estimates are unreliable (Appendix D.2).

1. Note that the SVM predictor is not de�ned explicitly in this form | rather (2) emerges implicitly as aconsequence of the use of the kernel function to de�ne a dot-product in some notional feature space.

212

Page 3: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

� It is necessary to estimate the error/margin trade-o� parameter `C' (and in regression,the insensitivity parameter `�' too). This generally entails a cross-validation procedure,which is wasteful both of data and computation.

� The kernel function K(x;xi) must satisfy Mercer's condition. That is, it must be thecontinuous symmetric kernel of a positive integral operator.2

The `relevance vector machine' (RVM) is a Bayesian treatment3 of (2) which does notsu�er from any of the above limitations. Speci�cally, we adopt a fully probabilistic frame-work and introduce a prior over the model weights governed by a set of hyperparameters,one associated with each weight, whose most probable values are iteratively estimated fromthe data. Sparsity is achieved because in practice we �nd that the posterior distributionsof many of the weights are sharply (indeed in�nitely) peaked around zero. We term thosetraining vectors associated with the remaining non-zero weights `relevance' vectors, in def-erence to the principle of automatic relevance determination which motivates the presentedapproach (MacKay, 1994; Neal, 1996). The most compelling feature of the RVM is that,while capable of generalisation performance comparable to an equivalent SVM, it typicallyutilises dramatically fewer kernel functions.

In the next section, we introduce the Bayesian model, initially for regression, and de�nethe procedure for obtaining hyperparameter values, and from them, the weights. Theframework is then extended straightforwardly to the classi�cation case in Section 3. InSection 4, we give some visualisable examples of application of the RVM in both scenarios,along with an illustration of some potentially powerful extensions to the basic model, beforeo�ering some benchmark comparisons with the SVM. We o�er some theoretical insight intothe reasons behind the observed sparsity of the technique in Section 5 before summarisingin Section 6. To streamline the presentation within the main text, considerable theoreticaland implementational details are reserved for the appendices.

2. Sparse Bayesian Learning for Regression

We now detail the sparse Bayesian regression model and associated inference procedures.The classi�cation counterpart is considered in Section 3.

2.1 Model Speci�cation

Given a data set of input-target pairs fxn; tngNn=1, considering scalar-valued target functionsonly, we follow the standard probabilistic formulation and assume that the targets aresamples from the model with additive noise:

tn = y(xn;w) + �n;

where �n are independent samples from some noise process which is further assumed to bemean-zero Gaussian with variance �2. Thus p(tnjx) = N (tnjy(xn); �2), where the notation2. This restriction can be relaxed slightly to include conditionally positive kernels (Smola et al., 1998;

Sch�olkopf, 2001).3. Note that our approach is not a Bayesian treatment of the SVM methodology per se, an area which has

seen much recent interest (Sollich, 2000; Seeger, 2000; Kwok, 2000) | here we treat the kernel functionas simply de�ning a set of basis functions, rather than as a de�nition of a dot-product in some space.

213

Page 4: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

speci�es a Gaussian distribution over tn with mean y(xn) and variance �2. The functiony(x) is as de�ned in (2) for the SVM where we identify our general basis functions with thekernel as parameterised by the training vectors: �i(x) � K(x;xi). Due to the assumptionof independence of the tn, the likelihood of the complete data set can be written as

p(tjw; �2) = (2��2)�N=2 exp

�� 1

2�2kt��wk2

�; (4)

where t = (t1 : : : tN )T, w = (w0 : : : wN )

T and � is the N�(N+1) `design' matrix with � =[�(x1);�(x2); : : : ;�(xN )]

T, wherein�(xn) = [1;K(xn;x1);K(xn;x2); : : : ;K(xn;xN )]T. For

clarity, we omit to notate the implicit conditioning upon the set of input vectors fxng in(4) and subsequent expressions.

With as many parameters in the model as training examples, we would expect maximum-likelihood estimation of w and �2 from (4) to lead to severe over-�tting. To avoid this, acommon approach is to impose some additional constraint on the parameters, for example,through the addition of a `complexity' penalty term to the likelihood or error function. Thisis implicitly e�ected by the inclusion of the `margin' term in the SVM. Here, though, weadopt a Bayesian perspective, and `constrain' the parameters by de�ning an explicit priorprobability distribution over them.

We encode a preference for smoother (less complex) functions by making the popularchoice of a zero-mean Gaussian prior distribution over w:

p(wj�) =NYi=0

N (wij0; ��1i ); (5)

with � a vector of N + 1 hyperparameters. Importantly, there is an individual hyperpa-rameter associated independently with every weight, moderating the strength of the priorthereon.4

To complete the speci�cation of this hierarchical prior, we must de�ne hyperpriorsover �, as well as over the �nal remaining parameter in the model, the noise variance �2.These quantities are examples of scale parameters, and suitable priors thereover are Gammadistributions (see, e.g., Berger, 1985):

p(�) =NYi=0

Gamma(�ija; b);

p(�) = Gamma(�jc; d);with � � ��2 and where

Gamma(�ja; b) = �(a)�1ba�a�1e�b�; (6)

with �(a) =R10 ta�1e�tdt, the `gamma function'. To make these priors non-informative (i.e.

at), we might �x their parameters to small values: e.g. a = b = c = d = 10�4. However, by

4. Note that although it is not a characteristic of this parameter prior in general, for the case of the RVMthat we consider here, the overall implied prior over functions is data dependent due to the appearanceof xn in the basis functions K(x;xn). This presents no practical diÆculty, although we must take care ininterpreting the \error-bars" implied by the model. In Appendix D.1 we consider this in further detail.

214

Page 5: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

setting these parameters to zero, we obtain uniform hyperpriors (over a logarithmic scale).Since all scales are equally likely, a pleasing consequence of the use of such `improper'hyperpriors here is that of scale-invariance: predictions are independent of linear scaling ofboth t and the basis function outputs so, for example, results do not depend on the unitof measurement of the targets. For completeness, the more detailed derivations o�ered inAppendix A will consider the case of general Gamma priors for � and �, but in the mainbody of the paper, all further analysis and presented results will assume uniform scale priorswith a = b = c = d = 0.

This formulation of prior distributions is a type of automatic relevance determination

(ARD) prior (MacKay, 1994; Neal, 1996). Using such priors in a neural network, individualhyperparameters would typically control groups of weights | those associated with eachinput dimension x (this idea has also been applied to the input variables in `Gaussianprocess' models). Should the evidence from the data support such a hypothesis, using abroad prior over the hyperparameters allows the posterior probability mass to concentrateat very large values of some of these � variables, with the consequence that the posteriorprobability of the associated weights will be concentrated at zero, thus e�ectively `switchingo�' the corresponding inputs, and so deeming them to be `irrelevant'.

Here, the assignment of an individual hyperparameter to each weight, or basis function,is the key feature of the relevance vector machine, and is responsible ultimately for itssparsity properties. To introduce an additional N + 1 parameters to the model may seemcounter-intuitive, since we have already conceded that we have too many parameters, butfrom a Bayesian perspective, if we correctly `integrate out' all such `nuisance' parameters(or can approximate such a procedure suÆciently accurately), then this presents no problemfrom a methodological perspective (see Neal, 1996, pp. 16{17). Any subsequently observed`failure' in learning is attributable to the form, not the parameterisation, of the prior overfunctions.

2.2 Inference

Having de�ned the prior, Bayesian inference proceeds by computing, from Bayes' rule, theposterior over all unknowns given the data:

p(w;�; �2jt) = p(tjw;�; �2)p(w;�; �2)p(t)

: (7)

Then, given a new test point, x�, predictions are made for the corresponding target t�, interms of the predictive distribution:

p(t�jt) =Z

p(t�jw;�; �2) p(w;�; �2jt) dw d� d�2: (8)

To those familiar, or even not-so-familiar, with Bayesian methods, it may come as no surpriseto learn that we cannot perform these computations in full analytically, and must seek ane�ective approximation.

We cannot compute the posterior p(w;�; �2jt) in (7) directly since we cannot performthe normalising integral on the right-hand-side, p(t) =

Rp(tjw;�; �2)p(w;�; �2) dw d� d�2.

Instead, we decompose the posterior as:

p(w;�; �2jt) = p(wjt;�; �2)p(�; �2jt); (9)

215

Page 6: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

and note that we can compute analytically the posterior distribution over the weights sinceits normalising integral, p(tj�; �2) = R

p(tjw; �2) p(wj�) dw, is a convolution of Gaussians.The posterior distribution over the weights is thus given by:5

p(wjt;�; �2) = p(tjw; �2)p(wj�)p(tj�; �2) ; (10)

= (2�)�(N+1)=2j�j�1=2 exp

��1

2(w � �)T��1(w� �)

�; (11)

where the posterior covariance and mean are respectively:

� = (��2�T�+A)�1; (12)

� = ��2��Tt; (13)

with A = diag(�0; �1; : : : ; �N ).We are now forced to adopt some approximation, and do so by representing the second

term on the right-hand-side in (9), the hyperparameter posterior p(�; �2jt), by a delta-function at its mode6, i.e. at its most-probable values �MP; �

2MP. We do so on the basis that

this point-estimate is representative of the posterior in the sense that functions generatedutilising the posterior mode values are near-identical to those obtained by sampling fromthe full posterior distribution. It is important to realise that this does not necessitate thatthe entire mass of the posterior be accurately approximated by the delta-function. Forpredictive purposes, rather than requiring p(�; �2jt) � Æ(�MP; �

2MP), we only desireZ

p(t�j�; �2)Æ(�MP; �2MP) d� d�2 �

Zp(t�j�; �2)p(�; �2jt) d� d�2; (14)

to be a good approximation. This notion may be visualised by a thought experiment wherewe consider that we are utilising two identical basis functions �i(x) and �j(x). It followsfrom (15) shortly that the mode of p(�; �2jt) will not be unique, but will comprise an in�nite`ridge' where ��1i + ��1j is some constant value. No delta-function can be considered toreasonably approximate the probability mass associated with this ridge, yet any point alongit implies an identical predictive distribution and so (14) holds. All the evidence from theexperiments presented in this paper suggests that this predictive approximation is verye�ective in general.

Relevance vector `learning' thus becomes the search for the hyperparameter posteriormode, i.e. the maximisation of p(�; �2jt) / p(tj�; �2)p(�)p(�2) with respect to � and �.

5. Rather than evaluating (10) explicitly, there is a quicker way to obtain both the weight posterior (11)and the marginal likelihood (15) together. From Bayes rule simply write p(wjt;�; �2)p(tj�; �2) =p(tjw; �2)p(wj�). Then, expanding the known right-hand-side quantities, gather together all terms inw that appear within the exponential, and complete the square, introducing some new terms in t, togive by inspection the posterior Gaussian distribution p(wjt;�; �2). Combining all the remaining termsin t then gives p(tj�; �2) (15).

6. An alternative approach is to iteratively maximise a variational lower bound on p(t), via a factorisedapproximation to p(w;�; �2jt), the joint posterior distribution over all the model parameters (Bishopand Tipping, 2000). This is a computationally intensive technique and in practice gives expected valuesfor the hyperparameters which are identical to the point-estimates obtained by the method describedhere.

216

Page 7: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

For the case of uniform hyperpriors (we consider the general case in Appendix A), we needonly maximise the term p(tj�; �2), which is computable and given by:

p(tj�; �2) =Z

p(tjw; �2)p(wj�) dw;

= (2�)�N=2j�2I+�A�1�Tj�1=2 exp��1

2tT(�2I+�A�1�T)�1t

�: (15)

In related Bayesian models, this quantity is known as the marginal likelihood, and its max-imisation known as the type-II maximum likelihood method (Berger, 1985). The marginallikelihood is also referred to as the \evidence for the hyperparameters" by MacKay (1992a),and its maximisation as the \evidence procedure".

2.3 Optimising the Hyperparameters

Values of � and �2 which maximise (15) cannot be obtained in closed form, and here wesummarise formulae for their iterative re-estimation. Further details concerning hyperpa-rameter estimation, including alternative expectation-maximisation-based re-estimates, aregiven in Appendix A.

For �, di�erentiation of (15), equating to zero and rearranging, following the approachof MacKay (1992a), gives:

�newi = i�2i; (16)

where �i is the i-th posterior mean weight from (13) and we have de�ned the quantities iby:

i � 1� �i�ii;

with �ii the i-th diagonal element of the posterior weight covariance from (12) computedwith the current � and �2 values. Each i 2 [0; 1] can be interpreted as a measure of how`well-determined' its corresponding parameter wi is by the data (MacKay, 1992a). For �ilarge, where wi is highly constrained by the prior, �ii � ��1i and it follows that i � 0.Conversely, when �i is small and wi �ts the data, i � 1.

For the noise variance �2, di�erentiation leads to the re-estimate:

(�2)new =kt���k2N �Pi i

: (18)

Note that the `N ' in the denominator refers to the number of data examples and not thenumber of basis functions.

The learning algorithm thus proceeds by repeated application of (16) and (18), con-current with updating of the posterior statistics � and � from (12) and (13), until somesuitable convergence criteria have been satis�ed (see Appendix A for some further imple-mentation details). In practice, during re-estimation, we generally �nd that many of the�i tend to in�nity (or, in fact, become numerically indistinguishable from in�nity given themachine accuracy)7. From (11), this implies that p(wijt;�; �2) becomes highly (in princi-

7. This is true only for the case of the uniform hyperparameter priors adopted here. The use of more generalGamma priors, detailed in the appendix, would typically lead to some �'s taking on large, but �nite,values, and so implying some small, but non-zero, weights. Sparsity would then be realised throughthresholding of the weights.

217

Page 8: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

ple, in�nitely) peaked at zero | i.e. we are a posteriori `certain' that those wi are zero.The corresponding basis functions can thus be `pruned', and sparsity is realised.

2.4 Making Predictions

At convergence of the hyperparameter estimation procedure, we make predictions based onthe posterior distribution over the weights, conditioned on the maximising values �MP and�2MP. We can then compute the predictive distribution, from (8), for a new datum x� using(11):

p(t�jt;�MP; �2MP) =

Zp(t�jw; �2MP)p(wjt;�MP; �

2MP) dw: (19)

Since both terms in the integrand are Gaussian, this is readily computed, giving:

p(t�jt;�MP; �2MP) = N (t�jy�; �2�);

with

y� = �T�(x�); (21)

�2� = �2MP + �(x�)T��(x�): (22)

So the predictive mean is intuitively y(x�;�), or the basis functions weighted by the pos-terior mean weights, many of which will typically be zero. The predictive variance (or`error-bars') comprises the sum of two variance components: the estimated noise on thedata and that due to the uncertainty in the prediction of the weights. In practice, then,we may thus choose to set our parameters w to �xed values � for the purposes of pointprediction, and retain � if required for computation of error bars (see Appendix D.1).

3. Sparse Bayesian Classi�cation

Relevance vector classi�cation follows an essentially identical framework as detailed forregression in the previous section. We simply adapt the target conditional distribution(likelihood function) and the link function to account for the change in the target quantities.As a consequence, we must introduce an additional approximation step in the algorithm.

For two-class classi�cation, it is desired to predict the posterior probability of member-ship of one of the classes given the input x. We follow statistical convention and generalisethe linear model by applying the logistic sigmoid link function �(y) = 1=(1 + e�y) to y(x)and, adopting the Bernoulli distribution for P (tjx), we write the likelihood as:

P (tjw) =NYn=1

�fy(xn;w)gtn [1� �fy(xn;w)g]1�tn ; (23)

where, following from the probabilistic speci�cation, the targets tn 2 f0; 1g. Note that thereis no `noise' variance here.

However, unlike the regression case, we cannot integrate out the weights analytically,and so are denied closed-form expressions for either the weight posterior p(wjt;�) or themarginal likelihood P (tj�). We thus choose to utilise the following approximation proce-dure, as used by MacKay (1992b), which is based on Laplace's method:

218

Page 9: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

1. For the current, �xed, values of �, the `most probable' weights wMP are found, givingthe location of the mode of the posterior distribution.

Since p(wjt;�) / P (tjw)p(wj�), this is equivalent to �nding the maximum, over w,of

log fP (tjw)p(wj�)g =NXn=1

[tn log yn + (1� tn) log(1� yn)]� 1

2wTAw; (24)

with yn = �fy(xn;w)g. This is a standard procedure, since (24) is a penalised(regularised) logistic log-likelihood function, and necessitates iterative maximisation.Second-order Newton methods may be e�ectively applied, since the Hessian of (24),required next in step 2, is explicitly computed. We adapted the eÆcient `iteratively-reweighted least-squares' algorithm (e.g. Nabney, 1999) to �nd wMP.

2. Laplace's method is simply a quadratic approximation to the log-posterior around itsmode. The quantity (24) is di�erentiated twice to give:

rwrw log p(wjt;�)��wMP

= �(�TB� +A); (25)

whereB = diag (�1; �2; : : : ; �N ) is a diagonal matrix with �n = �fy(xn)g [1� �fy(xn)g].This is then negated and inverted to give the covariance � for a Gaussian approxi-mation to the posterior over weights centred at wMP.

3. Using the statistics � and wMP (in place of �) of the Gaussian approximation, thehyperparameters � are updated using (16) in identical fashion to the regression case.

At the mode of p(wjt;�), using (25) and the fact that rw log p(wjt;�)��wMP

= 0, we canwrite:

� = (�TB� +A)�1; (26)

wMP = ��TBt: (27)

These equations are equivalent to the solution to a `generalised least squares' problem(e.g. Mardia et al., 1979, p.172). Compared with (12) and (13), it can be seen that theLaplace approximation e�ectively maps the classi�cation problem to a regression one withdata-dependent (heteroscedastic) noise, with the inverse noise variance for �n given by�n = �fy(xn)g [1� �fy(xn)g].

How accurate is the Laplace approximation? In the Bayesian treatment of multilayerneural networks, the Gaussian approximation is considered a weakness of the method asthe single mode of p(wjt;�) at wMP can often be unrepresentative of the overall posteriormass, particularly when there are multiple such modes (as is often the case). Here inthis linearly-parameterised model, we know that p(wjt;�) is log-concave (as the Hessian isnegative-de�nite everywhere). Not only is the posterior thus unimodal, log-concavity alsoimplies that its tails are no heavier than exp(�jwj), and so we expect much better accuracy.8

8. An alternative Gaussian approximation is realisable using the variational bound of Jaakkola and Jordan(1997), exploited in the variational RVM (Bishop and Tipping, 2000).

219

Page 10: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

For polychotomous classi�cation, where the number of classes K is greater than two,the likelihood (23) is generalised to the standard multinomial form:

P (tjw) =NYn=1

KYk=1

�fyk(xn;wk)gtnk ; (28)

where a conventional \one-of-K" target coding for t is used and the classi�er has multipleoutputs yk(x;wk), each with its own parameter vector wk and associated hyperparameters�k (although the hyperparameters could be shared amongst outputs if desired). The mod-i�ed Hessian is computed from this (see Nabney, 1999) and inference proceeds as shownabove. There is no need to heuristically combine multiple classi�ers as is the case with, forexample, the SVM. However, the size of � scales with K, which is a highly disadvantageousconsequence from a computational perspective.

4. Relevance Vector Examples

In this section we �rst present some visualisations of the relevance vector machine appliedto simple example synthetic data sets in both regression (Section 4.1) and classi�cation(Section 4.2), followed by another synthetic regression example to demonstrate some poten-tial extensions of the approach (Section 4.3). We then o�er some illustrative `benchmark'results in Section 4.4.

4.1 Relevance Vector Regression: the `sinc' function

The function sinc(x) = sin(x)=x has been a popular choice to illustrate support vectorregression (Vapnik et al., 1997; Vapnik, 1998), where in place of the classi�cation margin,the �-insensitive region is introduced, a `tube' of �� around the function within which errorsare not penalised. In this case, the support vectors lie on the edge of, or outside, this region.For example, using a univariate `linear spline' kernel:

K(xm; xn) = 1+xmxn+xmxnmin(xm; xn)� xm + xn2

min(xm; xn)2+

min(xm; xn)3

3; (29)

and with � = 0:01, the approximation of sinc(x) based on 100 uniformly-spaced noise-freesamples in [�10; 10] utilises 36 support vectors as shown in Figure 1 (left).

In the RVM, we model the same data with the same kernel (29), which is utilised tode�ne a set of basis functions �n(x) = K(x; xn), n = 1 : : : N . Typically, we will be tacklingproblems where the target function has some additive noise component, the variance ofwhich we attempt to estimate with �2. However, for the purposes of comparison withthis `function approximation' SVM example, we model the `sinc' function with a relevancevector machine but �x the noise variance in this case at 0:012 and then re-estimate �alone. This setting of the noise standard deviation to 0:01 is intended to be analogous,in an approximate sense, to the setting the �-insensitivity to the same value in the SVM.Using this �xed �, the RVM approximator is plotted in Figure 1 (right), and requires only9 relevance vectors. The largest error is 0.0070, compared to 0.0100 in the support vectorcase, and we have obtained the dual bene�t of both increased accuracy and sparsity.

220

Page 11: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

More representative of `real' problems, Figure 2 illustrates the case where uniform noise(i.e. not corresponding to the RVM noise model) in [�0:2; 0:2] is added to the targets. Again,a linear spline kernel was used. The trained RVM uses 6 relevance vectors, compared to 29for the SVM. The root-mean-square (RMS) deviation from the true function for the RVMis 0.0245, while for the SVM it is 0.0291. Note that for the latter model, it was necessary totune the parameters C and �, in this case using 5-fold cross-validation. For the RVM, theanalogues of these parameters (the �'s and �2) are automatically estimated by the learningprocedure.

−10 −5 0 5 10

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

−10 −5 0 5 10

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Figure 1: Support (left) and relevance (right) vector approximations to sinc(x) from 100noise-free examples using `linear spline' basis functions. The estimated functionsare drawn as solid lines with support/relevance vectors shown circled.

−10 −5 0 5 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−10 −5 0 5 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 2: Support (left) and relevance (right) vector approximations to sinc(x), based on100 noisy samples. The estimated functions are drawn as solid lines, the truefunction in grey, and support/relevance vectors are again shown circled.

221

Page 12: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

4.2 Relevance Vector Classi�cation: Ripley's synthetic data

We utilise arti�cially-generated data in two dimensions in order to illustrate graphicallythe selection of relevance vectors for classi�cation. Both class 1 (denoted by `�') and class2 (`�') were generated from mixtures of two Gaussians by Ripley (1996), with the classesoverlapping to the extent that the Bayes error is around 8%.

A relevance vector classi�er is compared to its support vector counterpart, using a`Gaussian' kernel which we de�ne as

K(xm;xn) = exp(�r�2kxm � xnk2); (30)

with r the `width' parameter, chosen here to be 0.5. A value of C for the SVM was selectedusing 5-fold cross-validation on the training set. The results for a 100-example training set(randomly chosen from Ripley's original 250) are given in Figure 3. The test error (fromthe associated 1000-example test set) for the RVM (9.3%) is slightly superior to the SVM(10.6%), but the remarkable feature of contrast is the complexity of the classi�ers. Thesupport vector machine utilises 38 kernel functions compared to just 4 for the relevancevector method. This considerable di�erence in sparsity between the two methods is typical,as the later results on benchmark data sets support.

Figure 3: SVM (left) and RVM (right) classi�ers on 100 examples from Ripley's Gaussian-mixture data set. The decision boundary is shown dashed, and relevance/supportvectors are shown circled to emphasise the dramatic reduction in complexity ofthe RVM model.

Of interest also is the fact that, unlike for the SVM, the relevance vectors are somedistance from the decision boundary (in x-space), appearing more `prototypical' or even`anti-boundary' in character. Given further analysis, this observation can be seen to beconsistent with the hyperparameter update equations given the form of the posterior in-duced by the Laplace approximation of Section 3. A more qualitative explanation is thatthe output of a basis function centred on or near the decision boundary is an unreliable

222

Page 13: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

indicator of class membership (i.e. its output is poorly-aligned with the data set in t-space| see Section 5.2 for an illustration of this concept), and such basis functions are naturallypenalised (deemed `irrelevant') under the Bayesian framework. Of course, there is no im-plication that the utilisation of either boundary-located or prototypically-located functionsis `correct' in any sense.

4.3 Extensions

Before giving some example results on benchmark data sets, we use another synthetic ex-ample to demonstrate the potential of two advantageous features of the sparse Bayesianapproach: the ability to utilise arbitrary basis functions, and the facility to directly `opti-mise' parameters within the kernel speci�cation, such as those which moderate the inputscales.

This latter feature is of considerable importance: in both the SVM and RVM, it isnecessary to choose the type of kernel function, and also to determine appropriate valuesfor any associated parameters | e.g. the input scale (width) parameter9 � = r�2 of theGaussian kernel (30). In the examples of Section 4.4 which follow, � is estimated by cross-validation for both the SVM and the RVM. This is a sensible and practical approach for asingle such parameter, but is inapplicable if it is desired to associate an individual �i witheach input variable. Use of such multiple input scale parameters within kernels (or otherbasis functions) is inherently sensible, and as will be seen, can be an e�ective way of dealingwith irrelevant input variables.

Consider now the problem of estimating the following, quite simple, two-dimensionalfunction

y(x1; x2) = sinc(x1) + 0:1x2;

based on 100 examples with additive Gaussian noise of standard deviation 0.1. There aretwo evident problems with direct application of a support or relevance vector model to thisdata:

� The function is linear in x2, but this will be modelled rather unsatisfactorily by asuperposition of nonlinear functions.

� The nonlinear element, sinc(x1), is a function of x1 alone, and so x2 will simply addirrelevant `noise' to the input, and thus output, of the basis functions and this will bere ected in the overall approximator.

These two features make the function diÆcult to learn accurately, and the function alongwith its SVM approximation (the RVM gives similar, only marginally superior, results here)is shown in Figure 4.

To improve upon the results shown in Figure 4 (right), we implement two modi�cations.We emphasised earlier that the RVM is really a specialisation of a Bayesian procedurede�ned for arbitrary basis sets and as such we are free to modify the type and number of

9. In the Gaussian kernel, this input scale parameter is explicit, but all non-trivial kernel functions, eventhose which do not typically incorporate such a parameter, are sensitive to the scaling of the inputvariables. As such, there may be considered to be an input scale parameter �i associated with each inputdimension, even if these are all implicitly assumed to be equal to unity.

223

Page 14: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

−10−5

05

10

−10

−5

0

5

10−1.5

−1

−0.5

0

0.5

1

1.5

2

x1

x2

−10−5

05

10

−10

−5

0

5

10−1.5

−1

−0.5

0

0.5

1

1.5

2

x1

x2

Figure 4: LEFT: the function sinc(x1) + 0:1x2 along with the training data, and RIGHT:its approximation by a support vector model with a Gaussian kernel (r = 3).

basis functions that we feel might be useful in a given problem. Here, mirroring the generalapproach sometimes taken in Gaussian process regression (Rasmussen, 1996), we introducethe input variables as two extra `functions'. This is achieved by simply appending twoextra columns to the design matrix � containing the x1 and x2 values, and introducingcorresponding weights and hyperparameters which are updated identically to all others.We would hope that the weight associated with x1 would be zero (and indeed, would bepruned), while that corresponding to x2 should be approximately 0:1. In fact, to complicatethe problem further here, we also introduced three additional quadratic terms x21, x

22 and

x1x2 which we hope to similarly prune.

A second modi�cation is to directly optimise the marginal likelihood with respect to thekernel input scale parameters. For this problem we thus introduce the parameters �1 and�2 such that the kernel function becomes

K(xm;xn) = exp���1(xm1 � xn1)

2 � �2(xm2 � xn2)2: (31)

To estimate these parameters, at each iteration of the hyperparameter updates (16) a cycleof maximisation of the marginal likelihood (15) with respect to �1 and �2 was performed(using a gradient-based method). We give further implementation details in Appendix C.

Given these two modi�cations then, the �nal, and much more accurate, RVM approx-imating function is shown in Figure 5. The error and sparsity of this modi�ed model iscompared with the SVM in Table 1 and the estimates of the `interesting' RVM parametersare shown in Table 2.

While Figure 4 and Table 1 indicate both qualitatively and quantitatively the improve-ment obtained with the modi�ed RVM, Table 2 con�rms that this is as a result of themodel learning the `correct' values of the newly introduced parameters. First, w2 is a goodapproximation of the true function's value while all other candidates have been pruned.Second, �2 � �1, such that K(xm;xn) � exp

���1(xm1 � xn1)2and the basis functions

depend approximately on input variable x1 alone. One might fear that this procedure could

224

Page 15: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

−10−5

05

10

−10

−5

0

5

10−2

−1

0

1

2

3

Figure 5: Approximation by a relevance vector model, with additional linear inputs andoptimisation of the input scales. The noise estimate was � = 0:101.

Model error # functions

SVM 0.0194 75RVM 0.0053 8

Table 1: Root-mean-square error and number of basis functions required in approximationof the function sin(x1)=x1 + 0:1x2, using an SVM, and an RVM with additionallinear input functions and optimised input scale parameters.

Parameter estimate

wx1 0wx2 0.102wx2

1

0

wx22

0

wx1x2 0

Parameter estimate

�1 � 104 997�2 � 104 2

Table 2: LEFT: RVM parameter estimates for the weights associated with the additionalbasis functions. RIGHT: estimates of the two scale parameters.

`over-�t' and set all the �i to large values (i.e. shrink all the `widths' in the Gaussian towardszero). We o�er some schematic insight into why this does not occur in Section 5.2.

Finally, we underline that these additional parameters have been successfully estimateddirectly from the training set, of 100 noisy samples, at the same time as estimating all othermodel parameters, including the data noise level. No cross-validation was required.

225

Page 16: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

Although the above result is quite compelling, we must state some limitations to theapproach:

� The interleaved two-stage training procedure leaves open the question of how exactlyto combine optimisation over � and � (see Appendix C).

� The optimisation is computationally complex. While we can generally apply it to asingle given data set, it was only practical to apply it to a single benchmark experimentin the following subsection.

4.4 Benchmark Comparisons

The tables which follow summarise regression and classi�cation performance of the relevancevector machine on some example `benchmark' data sets, comparing results for illustrativepurposes with equivalent support vector machines. The number of training examples (N)and the number of input variables (d) are given for each data set, with further detailsregarding the data given in Appendix E. The prediction error obtained and the numberof vectors (support or relevance) required, generally averaged over a number of repetitions,are given for both models. By way of summary, the RVM statistics were also normalised bythose of the SVM and the overall average is displayed. A Gaussian kernel was utilised, andin all but a single case detailed shortly, its single input scale parameter chosen by 5-foldcross-validation.

4.4.1 Regression

errors vectors

Data set N d SVM RVM SVM RVM

Sinc (Gaussian noise) 100 1 0.378 0.326 45.2 6.7Sinc (Uniform noise) 100 1 0.215 0.187 44.3 7.0Friedman #2 240 4 4140 3505 110.3 6.9Friedman #3 240 4 0.0202 0.0164 106.5 11.5Boston Housing 481 13 8.04 7.46 142.8 39.0

Normalised Mean 1.00 0.86 1.00 0.15

For the Friedman #1 data set, an additional benchmark result was obtained where the teninput scale parameters of the Gaussian kernel were optimised in the RVM: this is designatedas `�-RVM' below.

errors vectors

Data set N d SVM RVM �-RVM SVM RVM �-RVM

Friedman #1 240 10 2.92 2.80 0.27 116.6 59.4 11.5

The dramatic improvement of �-RVM is a consequence of the fact that the target function,as deliberately constructed by Friedman (1991), does not depend on input variables 6 to 10,and the in uence of those distractor variables is suppressed by very low estimates for thecorresponding �k parameters. Unfortunately, computational resources limit the repetitionof this extended �-optimisation procedure for the complete set of benchmark experiments

226

Page 17: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

presented here. Typically, however, we do observe improvements in individual regressionand classi�cation experiments, although not generally as dramatic as that shown above.

4.4.2 Classification

Some examples of classi�cation performance are given in the table below, and further detailsof the data are given in Appendix E. All problems are two-class, with the exception of the`U.S.P.S.' handwritten digit set where, for computational reasons, we mirrored the SVMstrategy of training ten separate dichotomous classi�ers (rather than use the multinomiallikelihood).

errors vectors

Data set N d SVM RVM SVM RVM

Pima Diabetes 200 8 20.1% 19.6% 109 4U.S.P.S. 7291 256 4.4% 5.1% 2540 316Banana 400 2 10.9% 10.8% 135.2 11.4Breast Cancer 200 9 26.9% 29.9% 116.7 6.3Titanic 150 3 22.1% 23.0% 93.7 65.3Waveform 400 21 10.3% 10.9% 146.4 14.6German 700 20 22.6% 22.2% 411.2 12.5Image 1300 18 3.0% 3.9 % 166.6 34.6

Normalised Mean 1.00 1.08 1.00 0.17

5. Perspectives on Sparsity

Both the illustrative and benchmark examples indicate that the presented Bayesian learningprocedure is capable of producing highly sparse models. The purpose of this section is too�er further insight into the causes of this sparsity, and to this end we �rst look in moredetail at the form of the weight prior distribution. Following that, we adopt a Gaussianprocess perspective in order to give a more graphical explanation.

5.1 The Prior over the Weights

From the Bayesian viewpoint, the relevance vector machine is sparse since most posteriorprobability mass is distributed over solutions with small numbers of basis functions, andthe given learning algorithm �nds one such solution. That sparse solutions are likely a

posteriori relies of course on the prior: there must also be signi�cant probability on sparsemodels a priori. Given the Gaussian speci�cation of p(wj�), though, it does not appearthat we are utilising such a prior. However, the hierarchical nature of the prior disguisesits overall character, and we need to integrate out the hyperparameters to discover the trueidentity of the prior over the weights.

In Section 2, the marginal likelihood (15) was obtained by marginalising over the weights.Alternatively, for a Gamma prior over the hyperparameters, it is also possible to integrateout � instead, independently for each weight, to obtain the marginal, or what might be

227

Page 18: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

considered the `true', weight prior:

p(wi) =

Zp(wij�i)p(�i) d�i;

=ba�(a+ 1

2)

(2�)1

2�(a)(b+ w2

i =2)�(a+ 1

2); (32)

where �(�) is the gamma function as de�ned earlier. Equation (32) corresponds to thedensity of a Student-t distribution, and so the overall marginal weight prior is a productof independent Student-t distributions over the wi. A visualisation of this Student-t prior,alongside a Gaussian, is given in Figure 6. For the case of the uniform hyperprior, witha = b = 0, we obtain the improper prior p(wi) / 1=jwij. Intuitively, this looks verymuch like a `sparse' prior since it is sharply peaked at zero like the popular Laplace priorp(wi) / exp (�jwij), which has been utilised to obtain sparsity both in Bayesian contexts(Williams, 1995), and, taking the negative log, as the `L1' regulariser

Pi jwij elsewhere (e.g.

Chen et al., 1995; Grandvalet, 1998; Smola et al., 1999). The implication is that althoughsuper�cially we appear to be utilising a non-sparse Gaussian prior over the weights, in truththe hierarchical formulation implies that the real weight prior is one which can be clearlyrecognised as encouraging sparsity.

0

0

Gaussian

0

0

0

0

Student−t

0

0

Figure 6: LEFT: an example Gaussian prior p(wj�) in two dimensions. RIGHT: the priorp(w), where the hyperparameters have been integrated out to give a product ofStudent-t distributions. Note that the probability mass is concentrated both atthe origin and along `spines' where one of the two weights is zero.

Unfortunately, we cannot continue the Bayesian analysis down this route to computep(wjt), since the marginal p(w) is no longer Gaussian. However, we might pose the question:why not integrate out � explicitly and maximise over w | i.e. �nd the mode of p(wjt) |instead of vice-versa as detailed in Section 2? This alternative approach would be equivalent

228

Page 19: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

to maximisation of a penalised likelihood function of the form:

L(w) = ��2

NXn=1

[tn �wT�(xn)]2 �

NXi=0

log jwij; (33)

where we note that the presence of the log di�erentiates (33) from L1 regularisation. Infact, we must discount this alternative inference strategy since we typically �nd that L(w),and so p(wjt), is signi�cantly multi-modal, often extremely so. These modes occur wherethe likelihood, which has the form of a Gaussian in w-space, overlaps the `spines' (seeFigure 6, right) of the prior. (We remind the reader here that the �-conditional posterior,p(wjt;�), which we maximise in step 1 of the classi�cation case in Section 3, is log concaveand unimodal.) The implication therefore is that the marginalised weight posterior mode ishighly unrepresentative of the distribution of posterior probability mass. A good illustrationof this phenomenon is given by MacKay (1999) in the context of single-hyperparametermodels. Conversely, as discussed in Section 2.2, all the experimental evidence for relevancevector learning suggests that �MP is representative of the posterior p(�jt).

We �nally note here that a model with a single hyperparameter governing the inverseprior variance of all weights would place less probability on sparse models than the `rel-evance' prior we use here. Such a prior does not specify independence over the weights,the magnitudes of which are therefore coupled to all other weights through the commonhyperparameter (so, a priori, two large weights would be more probable than one large andone small, for example).

5.2 A Gaussian Process View

We �rst note that relevance vector learning in regression is maximising the probability ofthe N -dimensional vector of target values t under the model of (15). This is a Gaussian

process model (Rasmussen, 1996; MacKay, 1998; Williams, 1999): i.e. p(t) = N (tj0;C)where we re-write C from (15) as:

C = �2I+NXi=0

��1i vivT

i ; (34)

and vi = (�i(x1); �i(x2); : : : ; �i(xN ))T is anN -vector containing the output of basis function

i evaluated at all the training examples (whereas �(x) earlier denoted the vector of all thebasis functions evaluated at a single datum). This implicit, `weightless', formulation of theBayesian model can be adopted explicitly in other contexts to realise sparsity, for examplein `sparse kernel PCA' (Tipping, 2001). In Figure 7 we illustrate schematically an idealisedtwo-dimensional `projection' of the Gaussian process.

The basis functions vi specify directions in t-space whose outer product contributionto the overall covariance is modulated by the inverse hyperparameters ��1i . Adjustment ofeach �i thus changes both the size and shape of C, while adjusting �2 grows or shrinks itequally in all directions. The RVM learning procedure iteratively updates the noise levelalong with all the contributions of the vectors vi in order to make the observed data set tmost probable.

229

Page 20: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

Figure 7: A Gaussian process view of the regression RVM, showing a two-dimensional `pro-jection' of the N -dimensional data set space. The data set to be modelled ismarked with the cross, the weighted directions of some of the N basis vectors areshown by grey arrows emanating from the origin, the noise component of the co-variance is denoted by a dashed grey circle, and the form of the overall covarianceC is illustrated by the ellipse.

To see how sparsity can arise, consider, then, a trivial example with just a single basisfunction v1, which is not well `aligned' with t. In Figure 8 (left), one possible C is shownwhere v1 contributes signi�cantly. In Figure 8 (right), t is explained by the noise alone. Inboth cases, the normalisation term jCj�1=2 is the same, but based on the unit Mahalanobiscovariance ellipses, the noise explanation is more probable. Intuitively, this occurs since it`wastes' less probability mass in the direction of v1.

Of course, in practice there will be N+1 v-vectors all competing with each other and thenoise to explain the data in an N -dimensional space. The general point though still holds: ifwe consider increasing the contribution of a given vi, and if the `spread' inC orthogonal to tis greater than that in the direction of t, then the data can be better explained by increasingthe noise variance �2, as this increases C identically in all directions (and so increases jCjless). Thus, at convergence of the �-optimisation, all the deleted basis functions are thosefor which the noise level (whether also optimised or set in advance) can better explain thedata.

A further intuition that may be gleaned from this pictorial perspective is that we mightexpect vi that lie more `in the direction of' t to be more relevant10, and so in classi�cation,

10. Unfortunately, as it would simplify learning if so, theR relevance vectors that are found do not correspondto the R most closely aligned vi vectors.

230

Page 21: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

Figure 8: Two potential explanations of a data set t. Left: using a basis function in con-junction with noise. Right: using noise alone. Covariance ellipses depicting unitMahalanobis distances are shown. For both Gaussians, jCj is the same.

basis functions centred near the decision boundary are unlikely to be retained. Note thatthis does not rule out functions located signi�cantly on the `wrong' side of the boundarysince vi contributes to C in (34) as an outer product and so can also be relevant if it is wellaligned with �t. In fact, an example of such a relevance vector may be seen in Figure 3(right) earlier.

Finally here, we may glean some understanding of why it may be pro�table to optimisethe marginal likelihood with respect to input scale parameters as was outlined in Section4.3. Figure 9 provides an approximate representation (again projected to two dimensions)of the marginal likelihood model which results from using a Gaussian kernel. For a verynarrow kernel width, it can be seen from (34) that the Gaussian process covariance becomesdiagonal. At the other extreme, a very large width implies that the data set is modelledonly by the (isotropic) noise. It follows that the data set can become more probable at someintermediate width. While it is not necessarily the case that this is the `optimal' width interms of model accuracy, it can at least be seen why the marginal likelihood is not optimisedby inappropriately shrinking the width of the kernel to zero, as would occur if maximisingthe conventional likelihood.

Figure 9: E�ect of the width of a Gaussian kernel on the marginal probability of a dataset t. Covariance ellipses depicting lines of equal probability are shown, with theintermediate Gaussian width (centre) giving the greatest likelihood for t.

231

Page 22: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

6. Discussion

In this paper we have sought to provide the technical detail and experimental justi�cationfor a Bayesian approach to sparse learning in linearly-parameterised models, and we haveendeavoured to o�er some understanding of the mechanism through which sparsity is re-alised. Although we have concentrated on comparison with the popular `support vectormachine', we would underline once more that the presented method is a general one.

We do not seek to state that this approach is de�nitively superior to any other; ratherthat it is a regression and classi�cation tool worthy of serious consideration and one whicho�ers some quite compelling and advantageous features, some of which we summarise here:

� Generalisation is typically very good. While it was not the intention to present acomprehensive comparison with other methods, it was demonstrated in Section 4.4that results comparable with the state-of-the-art can be obtained.

� `Learned' models are typically highly sparse. Indeed, for the visualisable examplesgiven in Section 4, the models appear almost optimally compact.

� In classi�cation, the model gives estimates of the posterior probability of class mem-bership. This is a highly important, but often overlooked, feature of any practicalpattern recognition system.

� There are no `nuisance' parameters to validate, in that the type-II maximum likelihoodprocedure automatically sets the `regularisation' parameters, while the noise variancecan be similarly estimated in regression.

� There is no constraint over the number or type of basis functions that may be used, al-though we note below the computational implications of using a large number thereof.

� We can optimise `global' parameters, such as those which moderate the input variablescales. This is a very powerful feature, as it is impossible to set such scale parametersby cross-validation. On a single benchmark task and on isolated problems, we havefound that this can be highly e�ective, though for computational reasons, we cannotyet present a more extended experimental assessment of the technique.

Another potential advantage of a Bayesian approach is that the fully marginalised proba-bility of a given model (basis set) M, p(tjM) =

Rp(tj�; �2;M)p(�)p(�2) d� d�2, may be

considered a measure of the merit of the model (e.g. could provide a criterion for selectionof the kernel). However, we have already indicated that this quantity cannot be computedanalytically, and the simple and practical numerical approximations that we have employedhave proved ine�ective. In practice, we would use a cross-validation approach, and �ndinga criterion (which should require signi�cantly less computational overhead) based on ane�ective approximation to p(tjM) remains an open research question.

We �nally note that the primary disadvantage of the sparse Bayesian method is thecomputational complexity of the learning algorithm. Although the presented update rulesare very simple in form, their required memory and computation scale respectively withthe square and cube of the number of basis functions. For the RVM, this implies that thealgorithm presented here becomes less practical when the training examples number several

232

Page 23: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

thousand or more. In light of this we have recently developed a much more eÆcient strategyfor maximising the marginal likelihood, which is discussed brie y in Appendix B.2, and weintend to publish further details shortly.

Acknowledgments

The author wishes to thank Chris Bishop, Chris Williams, Neil Lawrence, Ralf Herbrich,Anita Faul, John Platt and Bernhard Sch�olkopf for helpful discussions, and the latter twoagain for the use of their SVM code. In addition, the author was appreciative of manyexcellent comments from the reviewers.

Appendix A. Further Details of Relevance Vector Learning

Relevance vector learning involves the maximisation of the product of the marginal likeli-hood and the priors over � and �2 (or � � ��2 for convenience). Equivalently, and morestraightforwardly, we maximise the log of this quantity. In addition, we also choose tomaximise with respect to log� and log � | this is convenient since in practice we assumeuniform hyperpriors over a logarithmic scale, and the derivatives of the prior terms vanishin this space. So, retaining for now general Gamma priors over � and �,we maximise:

log p(tj log�; log �) +NXi=0

log p(log�i) + log p(log �); (35)

which, ignoring terms independent of � and �, and noting that p(log�) = �p(�), gives theobjective function:

L = �1

2

�log j��1I+�A�1�Tj+ tT(��1I+�A�1�T)�1t

�+

NXi=0

(a log�i � b�i) + c log � � d�: (36)

Note that the latter terms disappear with a; b; c; d set to zero.We now consider how to robustly and eÆciently compute L, outline the derivation of

the hyperparameter updates, and consider the numerical diÆculties involved.

A.1 Computing the Log Objective Function

The matrix ��1I+�A�1�T, which appears in the �rst two terms in L, is of size N �N .However, computation of both of the terms of interest may be written as a function ofthe posterior weight covariance � = (A + ��T�)�1. This matrix is M �M , where M isthe number of basis functions in the model. While initially M = N + 1, in practice manybasis functions will be `deleted' during optimisation (see B.1 shortly) and M will decreaseconsiderably giving signi�cant computational advantages as optimisation progresses.

We compute the �rst term by exploiting the determinant identity (see, e.g., Mardiaet al., 1979, Appendix A):

jAjj��1I+�A�1�Tj = j��1IjjA+ ��T�j;

233

Page 24: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

giving

log j��1I+�A�1�Tj = � log j�j �N log � � log jAj: (38)

Using the Woodbury inversion identity:

(��1I+�A�1�T)�1 = �I� ��(A+ ��T�)�1�T�;

the second, data-dependent, term may be expressed as

tT(��1I+�A�1�T)�1t = �tTt� �tT���T�t;

= �tT (t���) ; (40)

with � = ���Tt, the posterior weight mean. Note that (40) may be also be re-expressedas:

�tT (t���) = �kt���k2 + �tT��� ��T�T��;

= �kt���k2 + �T��1�� ��T�T��;

= �kt���k2 + �TA�; (41)

which corresponds to the penalised log-likelihood evaluated using the posterior mean weights.The terms in (38) are sometimes referred to as the \Ockham factors".

Computation of �, its determinant and � is achieved robustly through Cholesky de-composition of A+ ��T�, an O(M3) procedure.

A.2 Derivatives and Updates

A.2.1 The hyperparameters

The derivatives of (36) with respect to log� are:

@L@ log�i

=1

2

�1� �i(�

2i +�ii)

�+ a� b�i: (42)

Setting this to zero and solving for �i gives a re-estimation rule:

�newi =1 + 2a

�2i +�ii + 2b: (43)

This is equivalent to an expectation-maximisation (EM) update (see A.3 below) and so isguaranteed to locally maximise L. However, setting (42) to zero, and following MacKay(1992a) in de�ning quantities i � 1� �i�ii, leads to the following update:

�newi = i + 2a

�2i + 2b; (44)

which was observed to lead to much faster convergence although it does not bene�t from theguarantee of local maximisation of L. In practice, we did not encounter any optimisationdiÆculties (i.e. `downhill' steps) utilising (44).

234

Page 25: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

A.2.2 The noise variance

Derivatives with respect to log � are:

@Llog �

=1

2

�N

�� kt���k2 � tr (��T�)

�+ c� d�; (45)

and since tr (��T�) can be re-written as ��1P

i i, setting the derivative to zero andrearranging and re-expressing in terms of �2 gives:

(�2)new =kt���k2 + 2d

N �Pi i + 2c: (46)

A.3 Expectation-Maximisation (EM) Updates

Another strategy to maximise (36) is to exploit an EM formulation, treating the weightsas the `hidden' variables and maximise Ewjt;�;� [log p(tjw; �)p(wj�)p(�)p(�)], where theoperator Ewjt;�;�[�] denotes an expectation with respect to the distribution over the weightsgiven the data and hidden variables, the posterior p(wjt;�; �).

For �, ignoring terms in the logarithm independent thereof, we equivalently maximise:

Ewjt;�;� [log p(wj�)p(�)] ; (47)

which through di�erentiation gives an update:

�i =1 + 2a

hw2i i+ 2b

; (48)

and where, from the posterior (11), hw2i i � Ewjt;�;�

�w2i

�= �ii + �2i . This is therefore

equivalent to the (less eÆcient) gradient update (43) earlier.

Following the corresponding procedure for the noise level �2 we maximise;

Ewjt;�;� [log p(tjw; �)p(�)] ; (49)

which gives

(�2)new =kt���k2 + (�2)old

Pi i + 2d

N + 2c: (50)

Appendix B. Computational Considerations

B.1 Numerical Accuracy

The posterior covariance � is computed as the inverse of the `Hessian' matrix H = A +��T� which, although positive de�nite in theory, may become numerically singular inpractice. As optimisation of the hyperparameters progresses, the range of �-values typicallybecomes highly extended as many tend towards very large values. Indeed, for a; b; c; d =0, many �'s typically would tend to in�nity if machine precision permitted. In fact, ill-conditioning of the Hessian matrix becomes a problem when, approximately, the ratio ofthe smallest to largest �-values is in the order of the `machine precision'.

235

Page 26: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

Consider the case of a single �i !1, where for convenience of presentation we choosei = 1, the �rst hyperparameter. Using the expression for the inverse of a partitioned matrix,it can be shown that:

�!�0 0

0 (A�i + ��T

�i��i)�1

�(51)

where the subscript `�i' denotes the matrix with the appropriate i-th row and/or columnremoved. The term (A�i + ��T

�i��i)�1 is of course the posterior covariance computed

with basis function i `pruned'. Furthermore, it follows from (51) and (13) that �i ! 0 andas a consequence of �i !1, the model intuitively becomes exactly equivalent to one withbasis function �i(x) excluded.

We may thus choose to avoid ill-conditioning by pruning the corresponding basis func-tion from the model at that point (i.e. by deleting the appropriate column from �). Thissparsi�cation of the model during optimisation implies that we typically experience a veryconsiderable and advantageous acceleration of the learning algorithm. The potential dis-advantage is that if we believed that the marginal likelihood might be increased by re-introducing those deleted basis functions (i.e. reducing �i from in�nity) at a later stage,then their permanent removal would be suboptimal. For reassurance, at the end of optimi-sation we can eÆciently compute the sign of the gradient of the marginal likelihood withrespect to all �'s corresponding to deleted basis functions. A negative gradient would implythat reducing an in�nite �, and therefore reintroducing the corresponding basis function,would improve the likelihood. So far, no such case has been found.

The intuitive and reliable criterion we used for deletion of basis functions and theirweights at each iteration in regression was to remove those whose `well-determinedness'factors i fell below the machine precision (� 2:22 � 10�16 in our case, the smallest �such that 1 + � 6= 1). As a result, presumably, of inaccuracies introduced by the Laplaceapproximation step, in classi�cation a slightly more robust method was required, withweights deleted for which �i > 1012. Note that the action of deleting such a basis functionshould not in theory change the value of the objective function L (in practice, it shouldchange only negligibly) since in (38), the in�nite part of the terms log j�j = log j��ij�log�iand log jAj = log jA�ij+ log�i cancel.

B.2 Algorithm Complexity

The update rules for the hyperparameters depend on computing the posterior weight co-variance matrix, which requires an inverse operation (in fact, Cholesky decomposition) oforder O(M3) complexity and O(M2) memory storage, with M the number of basis func-tions. Although, typically, the pruning discussed above rapidly reduces M to a manageablesize in most problems, M = N + 1 for the RVM model at initialisation, and N may bevery large. This of course leads to extended training times, although the disadvantage ofthis is signi�cantly o�set by the lack of necessity to perform cross-validation over nuisanceparameters, such as for C and � in the SVM. So, for example, with the exception of thelarger data sets (e.g. roughly N > 700) the benchmark results in Section 4.4 were obtainedmore quickly for the RVM (this observation depends on the exact implementations andcross-validation schedules of course).

236

Page 27: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

Even so, for large data sets, with computation scaling approximately in O(N3), the fullRVM algorithm becomes prohibitively expensive to run. We have therefore developed analternative algorithm to maximise the marginal likelihood which is `constructive'. It startswith a single basis function, the bias, and both adds in further basis functions, or deletescurrent ones, as appropriate, rather than starting with all possible candidates and pruning.This is a much more eÆcient approach, as the number of basis functions included at anystep in the algorithm tends to remain low. It is, however, a more `greedy' optimisationstrategy, although our preliminary results show little, if any, loss of accuracy compared tothe standard algorithm. This appears a very promising mechanism for ensuring the sparseBayesian approach remains practical even for very large basis function sets.

Appendix C. Adapting Input Scale Parameters

In Section 4.3 we considered how parameters within the kernel speci�cation could be adaptedto improve the marginal likelihood. We note that in general, we may still prefer to select anindividual kernel parameter (for example, the `width' parameter of a Gaussian ) via cross-validation. Here, however, we consider the case of multiple input scale parameters, wherecross-validation is not an option and where by optimising such parameters considerableperformance gains may be achieved. The bene�t is typically most notable in regressionproblems, as exempli�ed by the Friedman #1 data set benchmark in Section 4.4.1.

Assume that the basis functions take the form

�m(x) = �m(p�1x1;

p�2x2; : : : ;

p�dxd); (52)

where the input vector x comprises d variables (x1; : : : ; xd) and the corresponding scale(inverse squared `width') parameters are � = (�1; : : : ; �d). We notate �nm = �m(xn;�) tobe the elements of the design matrix � as before. The gradient of the likelihood L (36)with respect to the k-th scale parameter is �rst written in the form:

@L@�k

=NXn=1

NXm=1

@L@�nm

@�nm@�k

: (53)

Note that m = 0 refers to the bias, and so, being independent of x, does not enter into(53). The �rst term in (53) is independent of the basis function parameterisation and it isconvenient to collect all the terms into a matrix D such that Dnm = @L=@�nm. Evaluatingthese derivatives then gives:

D = (C�1ttTC�1 �C�1)�A�1; (54)

= � [(t� y)�T ���] ; (55)

where C = �2I+�A�1�T. The form of (55) is intended to be a more intuitive re-writingof (54).

So for a set of Gaussian basis functions with shared scale parameters:

�nm = exp

(�

dXk=1

�k(xmk � xnk)2

); (56)

237

Page 28: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

and we have

@L@�k

=

NXm=1

NXn=1

�Dnm�nm(xmk � xnk)2: (57)

The expression (57) can thus be used as the basis for a gradient-based local optimisationover �. Exactly how this is performed represents the primary implementation diÆculty. Ajoint nonlinear optimisation over f�;�g, using, for example, conjugate gradient methods,is prohibitively slow. Since the � update equation (44) is so e�ective, we chose to interleavethose updates with a few cycles (e.g. 1 to 5) of optimisation of � using a simple hill-climbingmethod. The exact quality of results is somewhat dependent on the ratio of the numberof � to � updates, although in nearly all cases a signi�cant improvement (in generalisationerror) is observed.

Clearly it would be more satisfactory to o�er a de�nitive method for optimising �,and we would expect that there is a better mechanism for doing so than the one we haveemployed. Nevertheless, the results given for the Friedman #1 task and the `toy' exampleof Section 4.3 indicate that, even if simplistically implemented, this is a potentially verypowerful extension to the method.

Appendix D. Probabilistic Outputs

D.1 Error Bars in Regression

Note that, as alluded to earlier, care must be exercised when interpreting the error barsgiven by equation (22) since they are predicated on the prior over functions, which in turndepends on the basis. In the case of the RVM, the basis functions are data-dependent.To see the implication of this, we re-visit the link with Gaussian process models that wasdeveloped in Section 5.2.

A Gaussian process model is speci�ed as a (usually zero mean) multivariate Gaussiandistribution over observations, i.e. p(t) = N (tj0;CGP) with

CGP = �2I+Q; (58)

where Qmn = Q(xm;xn) is the covariance function, de�ned over all x-pairs. There areno `weights' in the model. Instead, predictions (in terms of predictive distributions) areobtained from Bayes' rule using p(t�jt) = p(t�; t)=p(t). The covariance function mustgive rise to a positive Q, so like the SVM kernel, must adhere to Mercer's condition.The `Gaussian' (or negative squared exponential) is a common choice: Q(xm;xn;�) =�1 exp

���2kxm � xnk2�.

Clearly, the RVM is also a Gaussian process, with C de�ned as in (34), and indeed wecould also make predictions without recourse to any model `weights'. The correspondingcovariance function, read o� from (34), for the RVM is:

QRVM(xm;xn) =

NXi=0

��1i K(xm;xi)K(xn;xi); (59)

where K(x;x0) = 1 denotes the bias. A feature of this prior is that it is data-dependent,in that the prior covariance between any two points xm and xn depends on all those fxig

238

Page 29: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

in the training set. This has implications for computing error bars in RVM regression. Tosee this, consider that we use a Gaussian covariance function, and compute the predictiveerror bars �2� away from the data, where we would expect them to be signi�cant. For theGaussian process (Williams, 1999):

�2� = �2 +Q(x�;x�)� kTCNGPk;

� �2 + �1; (60)

where (k)i = K(x�;xi), CNGP is (58) evaluated at the training data points, and we assume

the test point to be suÆciently distant from the training data such that all K(x�;xi), theelements of k, are very small. Using the identical kernel in a relevance vector model givesa predictive variance from (22) of:

�2� = �2 + kT�k;

� �2; (61)

and thus depends only on the noise variance. There is no contribution from the functionprior which may seem odd, but is of course consistent with its speci�cation.

D.2 Posterior Probabilities in Classi�cation

When performing `classi�cation' of some test example x�, we would prefer our model togive an estimate of p(t� 2 Cjx�), the posterior probability of the example's membership ofthe class C given the features x�. This quantity expresses the uncertainty in the predictionin a principled manner while facilitating the separation of `inference' and `decision' (Dudaand Hart, 1973). In practical terms, these posterior probability estimates are necessaryto correctly adapt to asymmetric misclassi�cation costs (which nearly always apply in realapplications) and varying class proportions, as well as allowing the rejection of `uncertain'test examples if desired.

Importantly, the presented Bayesian classi�cation formulation incorporates the Bernoullioutput distribution (equivalent in the log-domain to the `cross-entropy' error function),which in conjunction with the logistic sigmoid squashing function, enables �fy(x)g to beinterpreted as a consistent estimate of the posterior probability of class membership. Pro-vided y(x) is suÆciently exible, in the in�nite data limit this estimate becomes exact (see,e.g., Bishop, 1995).

By contrast, for a test point x� the SVM outputs a real number which is thresholded togive a `hard' binary decision as to the class of the target t�. The absence of posterior proba-bilities from the SVM is an acknowledged de�ciency, and a recently proposed technique fortackling this involves the a posteriori �tting of a sigmoid function to the �xed SVM outputy(x) (Platt, 2000) to give an approximate probability of the form �fA:y(x) + Bg. Whilethis approach does at least produce predictive outputs in the range [0; 1], it is importantto realise that this does not imply that the output of the sigmoid is necessarily a goodapproximation of the posterior probability. The success of this post-processing strategy ispredicated on the original output y(x) of the SVM, with appropriate shifting and rescaling,being a good approximation to the `inverse-sigmoid' of the desired posterior probability.This is the quantity logfp(t 2 C+1jx)=p(t 2 C�1jx)g, referred to as the `log-odds'. As a

239

Page 30: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

result of the nature of the SVM objective function, y(x) cannot be expected to be a reliablemodel of this.

The simple example which follows in Figure 10 underlines this. The �gure shows theoutput of trained classi�ers on a simple two-class problem in one dimension. Class C+1 isuniform in [0; 1] and class C�1 in [0:5; 1:5], with the overlap implying a minimal Bayes errorof 25%. The correct log-odds of C+1 over C�1 is shown in the �gure. It should be zeroin [0:5; 1:5] and in�nite outside this region. The output y(x) of both SVM and RVM (i.e.without the sigmoid function in the latter case) is shown for classi�ers trained on a generousquantity of examples (1000) using a Gaussian kernel of width 0.1. Note that both modelsare optimal in the generalisation sense, since the decision boundary may lie anywhere in[0:5; 1].

0.4 0.6 0.8 1 1.2

−6

−4

−2

0

2

4

6 Correct log−oddsRVM y(x)SVM y(x)

Figure 10: Output of an RVM and SVM trained on overlapping, uniformly distributed data,along with the true log-odds of membership of C+1 over C�1. The SVM outputhas been rescaled without prejudice for ease of comparison with the RVM.

The key feature of Figure 10 is that while the RVM o�ers a reasonable approximation,the SVM output is highly unrepresentative of the true log-odds. It should be evident thatno setting of the sigmoid parameters A and B can ever correct for this. This is exempli�edin the region x 2 [0:5; 1], in the vicinity of the decision boundary where posterior accuracy isarguably most important, and where the log-odds should be constant. Practically, we notethat if this were a real application with asymmetric misclassi�cation losses, for examplemedical diagnosis or insurance risk assessment, then the absence of, or lack of accuracy in,posterior estimates could be very costly.

240

Page 31: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

Appendix E. Benchmark Data Sets

E.1 Regression

Noisy sinc. Data was generated from the function y(x) = sin(x)=x at 100 equally-spacedx-values in [�10; 10] with added noise from two distributions: �rst, Gaussian of standarddeviation 0.1 and second, uniform in [�0:1; 0:1]. In both cases, results were averaged over100 random instantiations of the noise, with the error being measured to the true functionover a 1000-example test set of equally spaced samples in [�10; 10].Friedman functions. Synthetic functions taken from Friedman (1991):

y#1(x) = 10 sin(�x1x2) + 20(x3 � 1=2)2 + 10x4 + 5x5 +

10Xk=6

0:xk; (62)

y#2(x) = [x21 + (x2x3 � 1=x2x4)2]1=2; (63)

y#3(x) = tan�1�x2x3 � 1=x2x4

x1

�: (64)

For #1, inputs were generated at random from the 10-dimensional unit hypercube (note thatinput variables x6 through x10 do not contribute to the function) and Gaussian noise of unitstandard deviation added to y(x). For #2 and #3, random input samples were generateduniformly in the intervals x1 2 [0; 100], x2 2 [40�; 560�], x3 2 [0; 1] and x4 2 [1; 11].Additive Gaussian noise was generated on the output of standard deviation one-third ofthat of the signal y(x). For all Friedman functions, average results for 100 repetitions werequoted, where 240 randomly generated training examples were utilised, along with 1000noise-free test examples.

Boston housing. This popular regression benchmark data set was obtained from theStatLib archive at http://lib.stat.cmu.edu/datasets/boston. It comprises 506 exam-ples with 14 variables, and results were given for the task of predicting the median housevalue from the remaining 13 variables. Averages were given over 100 repetitions, where 481randomly chosen examples were used for training and the remaining 25 used for testing.

E.2 Classi�cation

Pima Diabetes. This data was utilised as an example set by Ripley (1996), and re-usedby Williams and Barber (1998). We used the single split of the data into 200 training and332 test examples available at http://www.stats.ox.ac.uk/pub/PRNN/.

U.S.P.S. The U.S. Postal Service database of 16�16 images of digits `0' through `9' is thatcommonly utilised as a benchmark for SVM-related techniques, e.g. as used by Sch�olkopfet al. (1999b). The data utilised here was obtained from those authors, and comprises 7291training examples along with a 2007-example test set.

Banana, Breast Cancer, Titanic, Waveform, German, Image. All these data setswere taken from the collection recently utilised by R�atsch et al. (2001). More details con-cerning the original provenance of the sets are available in a highly comprehensive onlinerepository at http://ida.first.gmd.de/�raetsch/. A total of 100 training/test splitsare provided by those authors: our results show averages over the �rst 10 of those.

241

Page 32: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

References

J. O. Berger. Statistical decision theory and Bayesian analysis. Springer, second edition,1985.

C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.

C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In C. Boutilier andM. Goldszmidt, editors, Proceedings of the 16th Conference on Uncertainty in Arti�cial

Intelligence, pages 46{53. Morgan Kaufmann, 2000.

B. Boser, I. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classi�ers.In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages144{152, 1992.

C. J. C. Burges. Simpli�ed support vector decision rules. In L. Saitta, editor, Proceedings ofthe Thirteenth International Conference on Machine Learning, pages 71{77, Bari, Italy,1996. Morgan Kaufmann.

C. J. C. Burges and B. Sch�olkopf. Improving the accuracy and speed of support vectormachines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in NeuralInformation Processing Systems 9, pages 375{381. MIT Press, 1997.

S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit.Technical Report 479, Department of Statistics, Stanford University, 1995.

R. O. Duda and P. E. Hart. Pattern Classi�cation and Scene Analysis. John Wiley, 1973.

J. H. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19(1):1{141,1991.

Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalisation. In L. Niklas-son, M. Bod�en, and T. Ziemske, editors, Proceedings of the Eighth International Confer-

ence on Arti�cial Neural Networks (ICANN98), pages 201{206. Springer, 1998.

T. S. Jaakkola and M. I. Jordan. Bayesian logistic regression: a variational approach.In D. Madigan and P. Smyth, editors, Proceedings of the 1997 Conference on Arti�cial

Intelligence and Statistics, Ft Lauderdale, FL, 1997.

J. T.-K. Kwok. The evidence framework applied to support vector machines. IEEE Trans-

actions on Neural Networks, 11(5):1162{1173, 2000.

D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415{447, 1992a.

D. J. C. MacKay. The evidence framework applied to classi�cation networks. Neural

Computation, 4(5):720{736, 1992b.

D. J. C. MacKay. Bayesian methods for backpropagation networks. In E. Domany, J. L.van Hemmen, and K. Schulten, editors, Models of Neural Networks III, chapter 6, pages211{254. Springer, 1994.

242

Page 33: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Sparse Bayesian Learning and the Relevance Vector Machine

D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, NeuralNetworks and Machine Learning, pages 133{165. Springer, 1998.

D. J. C. MacKay. Comparison of approximate methods for handling hyperparameters.Neural Computation, 11(5):1035{1068, 1999.

K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Probability and Math-ematical Statistics. Academic Press, 1979.

I. T. Nabney. EÆcient training of RBF networks for classi�cation. In Proceedings of the

Ninth International Conference on Arti�cial Neural Networks (ICANN99), pages 210{215. IEE, 1999.

R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.

J. Platt. Probabilistic outputs for support vector machines and comparisons to regularizedlikelihoodmethods. In A. J. Smola, P. Bartlett, B. Sch�olkopf, and D. Schuurmans, editors,Advances in Large Margin Classi�ers. MIT Press, 2000.

C. E. Rasmussen. Evaluation of Gaussian processes and other methods for non-linear re-

gression. PhD thesis, Graduate Department of Computer Science, University of Toronto,1996.

G. R�atsch, T. Onoda, and K.-R. M�uller. Soft margins for AdaBoost. Machine Learning, 42(3):287{320, 2001.

B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996.

B. Sch�olkopf. The kernel trick for distances. In Advances in Neural Information Processing

Systems 13. MIT Press, 2001.

B. Sch�olkopf, C. J. C. Burges, and A. J. Smola, editors. Advances in Kernel Methods:

Support Vector Learning. MIT Press, 1999a.

B. Sch�olkopf, S. Mika, C. J. C. Burges, P. Knirsch, K.-R. M�uller, G. R�atsch, and A. J.Smola. Input space versus feature space in kernel-based methods. IEEE Transactions on

Neural Networks, 10(5):1000{1017, 1999b.

M. Seeger. Bayesian model selection for support vector machines, Gaussian processes andother kernel classi�ers. In S. A. Solla, T. K. Leen, and K.-R. M�uller, editors, Advancesin Neural Information Processing Systems 12, pages 603{609. MIT Press, 2000.

A. J. Smola, B. Sch�olkopf, and K.-R. M�uller. The connection between regularization oper-ators and support vector kernels. Neural Networks, 11:637{649, 1998.

A. J. Smola, B. Sch�olkopf, and G. R�atsch. Linear programs for automatic accuracy controlin regression. In Proceedings of the Ninth International Conference on Arti�cial Neural

Networks (ICANN99), pages 575{580, 1999.

243

Page 34: Journal of Mac hine Learning Researc h 1 (2001) 211{244 …jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf · 2017-07-22 · Journal of Mac hine Learning Researc h 1 (2001)

Tipping

P. Sollich. Probabilistic methods for support vector machines. In S. A. Solla, T. K. Leen,and K.-R. M�uller, editors, Advances in Neural Information Processing Systems 12, pages349{355. MIT Press, 2000.

M. E. Tipping. The Relevance Vector Machine. In S. A. Solla, T. K. Leen, and K.-R.M�uller, editors, Advances in Neural Information Processing Systems 12, pages 652{658.MIT Press, 2000.

M. E. Tipping. Sparse kernel principal component analysis. In Advances in Neural Infor-

mation Processing Systems 13. MIT Press, 2001.

V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

V. N. Vapnik, S. E. Golowich, and A. J. Smola. Support vector method for functionapproximation, regression estimation and signal processing. In M. C. Mozer, M. I. Jordan,and T. Petsche, editors, Advances in Neural Information Processing Systems 9. MITPress, 1997.

C. K. I. Williams. Prediction with Gaussian processes: from linear regression to linearprediction and beyond. In M. I. Jordan, editor, Learning in Graphical Models, pages599{621. MIT Press, 1999.

C. K. I. Williams and D. Barber. Bayesian classi�cation with Gaussian processes. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342{1351, 1998.

P. M. Williams. Bayesian regularisation and pruning using a Laplace prior. Neural Com-putation, 7(1):117{143, 1995.

244