http:// sparse kernels methods steve gunn

http://www.isis.ecs.soton.ac.uk

Sparse Kernels MethodsSparse Kernels Methods

Steve Gunn

OverviewOverview

Part I : Introduction to Kernel Methods

Part II : Sparse Kernel Methods

Part IPart I

Introduction to

Kernel Methods

ClassificationClassification

Consider 2 class problem

Class A

Class B

Optimal Separating HyperplaneOptimal Separating Hyperplane

y y R yl ln

1 1 1 1, , , , , , ,x x x

w x b 0

Separate the data,

with a hyperplane,

such that the data is separated without error, and the distance

between the closest vector to the hyperplane is maximal.

SolutionSolution

w w w 1

The optimal hyperplane minimises,

subject to the constraints,

y b i li iw x 1 1, , ,

and is obtained by finding the saddle point of the Lagrange functional

L b b yi ii

w w w x w, , 1

Finding the OSHFinding the OSH

Size is dependent upon training set size

Unique global minimum

2T TH c

Quadratic Programming Problem

TiY i l 0 0 1, , , , .

Support VectorsSupport Vectors

Information contained in support vectors

Can throw away rest of training data

SVs have non zero Lagrange multipliers

Generalised Separating HyperplaneGeneralised Separating Hyperplane

Non Separable CaseNon Separable Case

Introduce slack Variables

y b i li i iw x 1 1 , , ,

w w w, 1

Minimise

C is chosen a priori

and determines trade-off to non-separable case.

2T TH c

Finding the GSHFinding the GSH

Non-Linear SVMNon-Linear SVM

Map input space to high dimensional feature space

Find OSH or GSH in Feature Space

Kernel FunctionsKernel Functions

Hilbert Schmidt Theory

Mercer’s Conditions

K x x k x k xi j i j,

K x x x x

K x x g x g x dx dx g x dx

i j m i jm

i j i j i j i i

K x xi j, is a symmetric function

Polynomial Degree 2Polynomial Degree 2

Kx x y x y

K k kx y x y,

Acceptable Kernel FunctionsAcceptable Kernel Functions

Polynomial

Multi-Layer Perceptrons

Radial Basis Functions

K x x x x

K x x x xd

i j i j

,, ,...

x xi j

i j, exp

K x x b x x ci j i j, tanh

Iris Data SetIris Data Set

RegressionRegression

Approximation Error

Model Size

Generalisation

Estimation Error

RegressionRegression

y y R y Rl ln

1 1, , , , , ,x x x

Approximate the data,

with a hyperplane,

and the SRM principle.

f x b, w x

w w cn

using a loss function, e.g.,

L y f a y f a, , ,x x

SolutionSolution

w w w, ,* *

b yi l

*,...,

Introduce slack variables and minimise

subject to the constraints

Finding the SolutionFinding the Solution

, , , .*

T Tx Hx c x1

Part I : SummaryPart I : Summary

Unique Global Minimum

Addresses Curse of Dimensionality

Complexity dependent upon data set size

Information contained in Support Vectors

Part IIPart II

Sparse Kernel Methods

Cyclic Nature of Empirical ModellingCyclic Nature of Empirical Modelling

Induce

Validate

Interpret

Design

InductionSVMs have strong theory

Good empirical performance

Solution of the form,

InterpretationInput Selection

Transparency

,xxx iiKf

Additive RepresentationAdditive Representation

f f f x f x x fi ii

i j i jj i

n( ) ,, , , ,x x 0

1 111 2 ...

Additive structure

Transparent

Rejection of redundant inputs

Unique decomposition

Sparse Kernel RegressionSparse Kernel Regression

, , 0i j j i ji j

f c K c x x x

f Kx x x

Previously ….

The PriorsThe Priors

“Different priors for different parameters”

Smoothness – controls “overfitting”

Sparseness – enables input selection and controls overfitting

Sparse Kernel ModelSparse Kernel Model

0, , cK cc L y K c c

Replace the kernel with a weighted linear sum of kernels,

And minimise the number of non-zero multipliers, along with the standard support vector optimisation,

optimisation hardSolution sparse

optimisation easierSolution sparse

optimisation easierSolution NOT sparse

, , 0i j j i ji j

f c K c x x x

Choosing the Sub-KernelsChoosing the Sub-Kernels

• Avoid additional parameters if possible

• Sub-models should be flexible

Spline KernelSpline Kernel

dttvtuuvvukspline 1

3,min6

21, vuvu

uvuvvukspline

Tensor Product SplinesTensor Product Splines

ddANOVA vukvuK

),(),(),(),(1),(1 221122112

vukvukvukvukvukd

3,min6

2, vuvu

uvuvvuk

The univariate spline which passes through the origin has a kernel of the form,

E.g. for a two input problem the ANOVA kernel is given by

And the multivariate ANOVA kernel is given by

Sparse ANOVA KernelSparse ANOVA Kernel

1 1 2 2 1 1 2 20 1 2 3, ( , ) ( , ) ( , ) ( , )K u v c c k u v c k u v c k u v k u v

Introduce multipliers for each ANOVA term,

And minimise the number of non-zero multipliers, along with the standard support vector optimisation,

OptimisationOptimisation

Quadratic LossQuadratic Loss

Epsilon-Insensitive LossEpsilon-Insensitive Loss

AlgorithmAlgorithm

ModelSparse ANOVA

Selection

ParameterSelection

ANOVA BasisSelection

3+ Stage Technique

Each stage consists of solving a convex, constrained optimisation problem. (QP or LP)

Auto-selection of Parameters

Capacity Control Parametercross-validation

Sparseness ParameterValidation error Stage I

Sparse Basis SolutionSparse Basis Solution

0subject tomin1

0subject to21min iTTT

cccycc

0subject tomin1,1

,subject to

Quadratic Loss Function (Quadratic Program)

-Insensitive Loss Function (Linear Program)

AMPG ProblemAMPG Problem

Predict automobile MPG (392 samples)

Inputs:no. of cylinders, displacement

horsepower, weight

acceleration, year

Output: MPG

60 80 100 120 140 160 180 200 220-7

Horse Power

2000 2500 3000 3500 4000 4500 5000-14

Weight

70 72 74 76 78 80 82-1

Horse Power 50 86 122

Horse Power 158 194 230

Network transparency through ANOVA representation.

SUPANOVA AMPG Results (SUPANOVA AMPG Results (=2.5)=2.5)

Loss Function Estimated Generalisation Error

Stage I Stage III Linear ModelTraining Testing

Mean Variance Mean Variance Mean Variance

Quadratic Quadratic 6.97 7.39 7.08 6.19 11.4 11.0

Insensitive Insensitive 0.48 0.04 0.49 0.03 1.80 0.11

Quadratic Insensitive 1.10 0.07 1.37 0.10

Insensitive Quadratic 7.07 6.52 7.13 6.04 11.72 10.94

AMPG Additive TermsAMPG Additive TermsTerms Quadratic Insensitive “Difference”bias 50 50 0.00C 3 1 0.08D 35 8 0.66H 2 20 0.44W 50 50 0.00Y 50 50 0.00

CD 9 26 0.54CW 0 4 0.08CA 1 11 0.24CY 2 18 0.40DW 35 44 0.38DA 42 43 0.16HY 10 5 0.18WY 2 1 0.06AY 50 47 0.06

CDW 0 1 0.02CWA 0 1 0.02CWY 0 1 0.02CAY 0 7 0.14DHW 1 2 0.06HAY 50 49 0.02WAY 0 4 0.08

CDWA 0 1 0.02CDAY 4 0 0.08

(C: No of Cylinders, D: Displacement, H: Horse Power, W: Weight, A: Acceleration, Y: Year)(All remaining terms were zero)

SummarySummary

SUPANOVA is a global approach

Strong Basis (Kernel Methods)

Can control loss function and sparseness

Can impose limit on maximum variate terms

Generalisation + Transparency

Further InformationFurther Information

http://www.isis.ecs.soton.ac.uk/

isystems/kernel/

SVM Technical Report

MATLAB SVM Toolbox

Sparse Kernel Paper

These Slides

http:// sparse kernels methods steve gunn

Documents

on the use of sparse time-relative auditory codes for...

automatic performance tuning of sparse matrix kernels · pdf...

gunn oscillator

auto-tuning sparse matrix kernels

gunn diode

autotuning sparse matrix and structured grid kernels

shari gunn

coe 509 parallel numerical computing lecture 4: the future...

essex – equipping sparse solvers for exascale• iterative...

hybrid threaded processing for sparse data kernels ›...

sparse gpu kernels for deep learning · 2020-06-22 · on a...

kernels - arxiv · among di erent kernels such security...

jo gunn antennas | jo gunn - the finest cb antenna on the...

exploiting multiple gpus in sparse qr: regular numerics with...

gunn newsletter - andrew g · 2007. 3. 22. · ⇒ jim...

· world war 1 guerin, guerin, guerin, gunn, gunn, gunn,...

spark98: sparse matrix kernels for shared memory and...

gpu kernels for block-sparse weights...gpu kernels for...

gpu kernels for block-sparse weights

generalized graphlet kernels for probabilistic inference...