kernel – based methods presented by jason friedman lena gorelick advanced topics in computer and...

Kernel – BasedKernel – Based Methods Methods

Presented by

Jason FriedmanLena Gorelick

Advanced Topics in Computer and Human Advanced Topics in Computer and Human VisionVision

Spring 2003Spring 2003

• Structural Risk Minimization (SRM)Structural Risk Minimization (SRM)

• Support Vector Machines (SVM)Support Vector Machines (SVM)

• Feature Space vs. Input SpaceFeature Space vs. Input Space

• Kernel PCAKernel PCA

• Kernel Fisher Discriminate Analysis Kernel Fisher Discriminate Analysis (KFDA)(KFDA)

Agenda…Agenda…

Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)

• Definition:– Training set with l observations:

Each observation consists of a pair:

16x16=256

• The task:“Generalization” - find a mapping

• Assumption: Training and test data drawn from the same probability distribution, i.e.

(x,y) is “similar” to (x1,y1), …, (xl,yl)

Structural Risk Minimization Structural Risk Minimization (SRM) (SRM) – Learning Machine– Learning Machine

• Definition:– Learning machine is a family of

functions {f()}, is a set of parameters.

– For a task of learning two classes f(x,) 2 {-1,1} 8 x,

Class of oriented lines in R2:sign(1x1 + 2x2 + 3)

– Capacity vs. – Capacity vs. GeneralizationGeneralization• Definition:

– Capacity of a learning machine measures the ability to learn any training set without error.

Too much Capacity

Too little Capacity

underfittingoverfitting

? Is the color green?Does it have the same # of leaves?

– Capacity vs. – Capacity vs. GeneralizationGeneralization

•For small sample sizes overfitting or underfitting might occur

•Best generalization = right balance between accuracy and capacity

– Capacity vs. – Capacity vs. GeneralizationGeneralization• Solution:

Restrict the complexity (capacity) of the function class.

• Intuition:“Simple” function that explains most of the data is preferable to a “complex” one.

-VC dimension -VC dimension• What is a “simple”/”complex” function?• Definition:

– Given l points (can be labeled

in 2l ways)

– The set of points is shattered by the function class {f()} if for each labeling there is a function which correctly assigns those labels.

-VC dimension -VC dimension

• Definition– VC dimension of {f()} is the maximum number

of points that can be shattered by {f()} and is a measure of capacity.

-VC dimension -VC dimension• Theorem:The VC dimension of the set of oriented

hyperplanes in Rn is n+1.

• Low # of parameters ) low VC dimension

-Bounds -Bounds• Definition:

Actual risk

• Minimize R()

• But, we can’t measure actual risk, since we don’t know p(x,y)

-Bounds -Bounds• Definition:

Empirical risk

• Remp() ! R(), l ! 1But for small training set deviations might occur

-Bounds -Bounds• Risk bound:

with probability (1-) h is VC dimension of the function class

• Note: R() is independent of p(x,y)

Not valid for infinite

VC dimension

Confidence term

-Bounds -Bounds

-Principal Method-Principal Method• Principle method for choosing a learning

machine for a given task:

SRMSRM• Divide the class of

functions into nested subsets

• Either calculate h for each subset, or get a bound on it

• Train each subset to achieve minimal empirical error

• Choose the subset with the minimal risk bound

Complexity

Risk Bound

Agenda…Agenda…

Support Vector Machines Support Vector Machines (SVM)(SVM)

• Currently the “en vogue” approach to classification

• Successful applications in bioinformatics, text, handwriting recognition, image processing

• Introduced by Bosner, Gayon and Vapnik, 1992

• SVM are a particular instance of Kernel Machines

Linear SVM – Separable Linear SVM – Separable casecase

• Two given classes are linearly separable

Linear SVM - definitionsLinear SVM - definitions• Separating hyperplane H:

• w is normal to H

• |b|/||w|| is the perpendicular distance from H to the origin

• d+ (d-) is the shortest distance from H to the closest positive (negative) point.

Linear SVM - definitionsLinear SVM - definitions

Linear SVM - definitionsLinear SVM - definitions• If H is a separating hyperplane, then

• No training points fall between H1 and H2

Linear SVM - definitionsLinear SVM - definitions• By scaling w and b, we can require that

Or more simply:

• Equality holds xi lies on H1 or H2

Linear SVM - definitionsLinear SVM - definitions• Note: w is no longer a unit vector

• Margin is now 2 / ||w||

• Find hyperplane with the largest margin.

Linear SVM – maximizing Linear SVM – maximizing marginmargin

• Maximizing the margin , minimizing ||w||2

• ) more room for unseen points to fall

• ) restrict the capacity

R is the radius of the smallest ball around data

Linear SVM – Linear SVM – Constrained OptimizationConstrained Optimization

• Introduce Lagrange multipliers

• “Primal” formulation:

• Minimize LP with respect to w and bRequire

• Objective function is quadratic

• Linear constraint defines a convex set

• Intersection of convex sets is a convex set

• ) can formulate “Wolfe Dual” problem

• Maximize LP with respect to i

Require

• Substitute into LP to give:

• Maximize with respect to i

Linear SVM – Linear SVM – Constrained OptimizationConstrained OptimizationThe

Solution

• Using Karush Kuhn Tucker conditions:

• If i > 0 then lies either on H1 or H2

) The solution is sparse in i

• Those training points are called “support vectors”. Their removal would change the solution

SVM – Test PhaseSVM – Test Phase

• Given the unseen sample x we take the class of x to be

Linear SVM – Linear SVM – Non-separable caseNon-separable case

• Separable case corresponds to empirical risk of zero.

• For noisy data this might not be the minimum in the actual risk. (overfitting )

• No feasible solution for non-separable case

• is an upper bound on the number of errors

• Relax the constraints by introducing positive slack variables i

• Assign extra cost to errors• Minimize

where C is a penalty parameterchosen by the user

• Lagrange formulation again:

• “Wolfe Dual” problem - maximize:

subject to:

• The solution:

Lagrange multiplier

• Using Karush Kuhn Tucker conditions:

•The solution is sparse in i

Nonlinear SVMNonlinear SVM

• Non linear decision function might be needed

Nonlinear SVM- Nonlinear SVM- Feature SpaceFeature Space

• Map the data to a high dimensional (possibly infinite) feature space

• Solution depends on • If there were function k(xi,xj) s.t.

) no need to know explicitly

Nonlinear SVM – Toy Nonlinear SVM – Toy exampleexample

Input Space Feature Space

Nonlinear SVM – Nonlinear SVM – Avoid the CurseAvoid the Curse

• Curse of dimensionality:The difficulty of estimating a problem increases drastically with the dimension

• But! Learning in F may be simpler if one uses low complexity function class (hyperplanes)

Nonlinear SVM-Nonlinear SVM-Kernel FunctionsKernel Functions

• Kernel functions exist!– effectively compute dot products in feature

• Can use it without knowing and F

• Given a kernel, and F are not unique

• F with smallest dim is called minimal embedding space

• Mercer’s condition:There exists a pair {,F} such that

iff for any g(x) s.t. is finite

• Formulation of algorithm in terms of kernels

• Kernels frequently used:

Nonlinear SVM-Nonlinear SVM-Feature SpaceFeature Space

• Hyperplane {w,b} requires dim(F) + 1 parameters

• Solving SVM means adjusting l+1 parameters

d=256, p=4 )

dim(F)=183,181,3

SVM - SolutionSVM - Solution

• LD is convex ) the solution is global

• Two type of non-uniqueness:– {w,b} is not unique

– {w,b} is unique, but the set {i} is notPrefer the set with less support vectors(sparse)

Nonlinear SVM-Toy Nonlinear SVM-Toy ExampleExample

Methods of solutionMethods of solution

• Quadratic programming packages• Chunking• Decomposition methods• Sequential minimal optimization

Applications - SVMApplications - SVM

Extracting Support Data for a Given Task

Bernhard Schölkopf Chris Burges Vladimir Vapnik

Proceedings, First International Conference on Knowledge Discovery & Data Mining. 1995

• Input (USPS handwritten digits)– Training set: 7300 – Testing set: 2000

• Constructed:– 10 class/non-class SVM classifiers– Take the class with maximal output

• Three different types of classifiers for each digit:

Applications - SVMApplications - SVMPredicting the Optimal Decision Functions - SRM

Agenda…Agenda…

Feature Space vs. Input Feature Space vs. Input SpaceSpace

• Suppose the solution is a linear combination in feature space

• Cannot generally say that each point has a preimage

• If there exists z such that

and k is an invertible function fk of (x¢ y)then can compute z as

where {e1,…,eN} is an orthonormal basis of the input space.

• Polynomial kernel is invertible when

• Then the preimage of w is given by

Feature Space vs. Input Feature Space vs. Input spacespace

• In general, cannot find the preimage • Look for an approximation such that

is small

Agenda…Agenda…

PCA PCA

• Regular PCA:Find the direction u s.t. projecting n points in d dimensions onto u gives the largest variance.

• u is the eigenvector of covariance matrix Cu=u.

Kernel PCA Kernel PCA

• Extension to feature space:– compute covariance matrix

– solve eigenvalue problem CV=V

Kernel PCAKernel PCA

– Define in terms of dot products:

– Then the problem becomes:

l x l rather than d x d

Kernel PCA – Kernel PCA – Extracting FeaturesExtracting Features

• To extract features of a new pattern x with kernel PCA, project the pattern on the first n eigenvectors

Kernel PCAKernel PCA

• Kernel PCA is used for:– De-noising– Compression– Interpretation (Visualization)– Extract features for classifiers

Kernel PCA - Toy Kernel PCA - Toy exampleexample• Regular PCA:

Kernel PCA - Toy Kernel PCA - Toy exampleexample

Applications – Kernel Applications – Kernel PCAPCA

Kernel PCA Pattern Reconstruction via Approximate Pre-Images

B. Schölkopf, S. Mika, A. Smola, G. Rätsch, and K.-R. Müller.

In L. Niklasson, M. Bodén, and T. Ziemke, editors, Proceedings of the 8th International Conference on Artificial Neural Networks,

Perspectives in Neural Computing, pages 147-152, Berlin,

1998. Springer Verlag.

• Recall

• (x) can be reconstructed from its principal components

• Projection operator:

Denoising• When ,

) can’t guarantee existence of pre-image

• First n directions ! main structure• The remaining directions ! noise

• Find z that minimizes

• Use gradient descent starting with x

• Input toy data:3 point sources(100 points each)with Gaussian noise=0.1

• Using RBF

Compare to the linear PCA

Applications – Kernel Applications – Kernel PCAPCAReal world data

Agenda…Agenda…

Fisher Linear Fisher Linear DiscriminantDiscriminant

Finds a direction w, projected on which the classes are “best” separated

• For “best” separation - maximize

where is the projected mean of class i, and is the std.

• Equivalent to finding w which maximizes:

• The solution is given by:

Kernel Fisher Kernel Fisher DiscriminantDiscriminant

• Kernel formulation:

• From the theory of reproducing kernels:

• Substituting it into the J(w) reduces the problem to maximizing:

• Solution is to solve the generalized eigenproblem:

• Projection of a new pattern is then:

• Find a suitable threshold

l x l rather than d x d

Kernel Fisher Discriminant Kernel Fisher Discriminant – Constraint Optimization– Constraint Optimization

• Can formulate problem as constraint optimization

Kernel Fisher Discriminant Kernel Fisher Discriminant ––

Toy ExampleToy Example

KFDA KPCA – 1st eigenvect

KPCA – 2nd eigenvect

Applications – Fisher Applications – Fisher Discriminant AnalysisDiscriminant Analysis

Fisher Discriminant Analysis with Kernels

S.Mika et. al.

In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX, pages 41-48. IEEE,

• Input (USPS handwritten digits): – Training set: 3000

• Constructed:– 10 class/non-class KFD classifiers– Take the class with maximal

output

• Results:3.7% error on a ten-class classifierUsing RBF with = 0.3*256

• Compare to 4.2% using SVM

• KFDA vs. SVM

Summary…Summary…

kernel – based methods presented by jason friedman lena gorelick advanced topics in computer and...

capacity slide

solution slide

empirical risk

actual risk

analysis kfdaagenda

non separable case slide

class of x

generalization definition

Documents

lena papapanagiotou

lena / elin / awake

by lena alafouzos

actions as space-time...

lena bag original

ieee transactions on pattern analysis and machine...

lena orsa - free-scores.com · lena orsa arranger,...

free to choose by milton friedman, rose friedman

lena magazine issue 2 lena issuu copy

lena orsa (the piano lady) · lena orsa (the piano lady)...

vol 2 lena-winslow .pdfproject manual for bid release 2...

ivan kupala day made by lena varaksina, lena bodazhkova and...

lena 210915

sylvia mae gorelick - olympians, we are breathless

a brief introduction to coa accreditation joe frisino zoë...

lena morris portfolio

service design: innovation 2.0 - olof schybergson & claudia...

lena orsa (the piano lady) · lena orsa (the piano lady)...

s root gorelick cochemiea halei on peninsular baja

mamma lena