learning kernel classifiers chap. 3.3 relevance vector machine chap. 3.4 bayes point machines...

Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005 3.3 Relevance Vector Machine [M.Tipping, JMLR 2001] Modification to Gaussian process GP Prior Likelihood Posterior RVM Prior Likelihood same as GP Posterior Reasons To get sparce representation of Expected risk of classifier, Thus, we favor weight vectors with a small number of non- zero coeffs. One way to achieve this is to modify prior: Consider Then wi=0 is only possible Computation of is easier than before Prediction funcion GP RVM How can we learn the sparce vector To find the best, employ evidence maximizaion The evidence is given explicitly by, Derived update rules (App'x B.8): Evidence Maximization Interestingly, many of the decrease quickly toward zero which lead to a high sparsity in For faster convergence, delete ith column from whenever < pre-def threshold After termination, set the corresponding = 0 for which < thres. The remaining are set equal to corresponing values in Application to Classification Consider latent target variables Training objects: Test object: Compute the predictive distribution of at the new object, by applying a latent weight vector to all the m+1 objects and marginalizing over all, we get Note As in the case of GP, we cannot solve this analytically because is no longer Gaussian Laplace approximaion: approx. this density by a Gaussian density w/ mean and cov. Kernel trick Think about a RKHS generated by Then ith component of training objects is represented as Now, think about regression. The concept of becomes the expansion coeff. of the desired hyperplane, such that In this sense, all the training objects which have non-zero are termed relevance vectors 3.4 Bayes Point Machines [R. Herbrich, JMLR 2000] In GP and RVMs, we tried to solve classification problem via regression estimation Before we assumed prior dist. and used logit transformations to model the likelihood distribution, Now we try to model it directly Prior For classification, only the spatial direction of. Note that Thus we consider only the vectors on unit sphere Then assume a uniform prior over this ball-shaped hypothesis space Likelihood Use PAC likelihood (0-1 loss) Posterior Remark: using PAC likelihood, Predictive distribution In two class case, the Bayesian decision can be written as: That is, the Bayes classification strategy performs majority voting involving all version space classifiers However, the expectation is hard to solve Hence we approximate it by a single classifier That is, BP is the optimal projection of the Bayes classifiers to a single classifier w.r.t. generalization error However this also is intractable because we need to know input distribution and posterior Another reasonable approximation: Now the Bayes classification of new object equals to the classification w.r.t. the single weight vector Estimate by MCMC sampling (kernel billiard algorithm)

learning kernel classifiers chap. 3.3 relevance vector machine chap. 3.4 bayes point machines...

Documents