kernel – based methods presented by jason friedman lena gorelick advanced topics in computer and...

Kernel – BasedKernel – Based Methods Methods

Presented by

Jason FriedmanLena Gorelick

Advanced Topics in Computer and Human Advanced Topics in Computer and Human VisionVision

Spring 2003Spring 2003

2

• Structural Risk Minimization (SRM)Structural Risk Minimization (SRM)

• Support Vector Machines (SVM)Support Vector Machines (SVM)

• Feature Space vs. Input SpaceFeature Space vs. Input Space

• Kernel PCAKernel PCA

• Kernel Fisher Discriminate Analysis Kernel Fisher Discriminate Analysis (KFDA)(KFDA)

Agenda…Agenda…

3






Agenda…Agenda…

4

Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)

• Definition:– Training set with l observations:

Each observation consists of a pair:

16x16=256

5


• The task:“Generalization” - find a mapping

• Assumption: Training and test data drawn from the same probability distribution, i.e.

(x,y) is “similar” to (x1,y1), …, (xl,yl)

6

Structural Risk Minimization Structural Risk Minimization (SRM) (SRM) – Learning Machine– Learning Machine

• Definition:– Learning machine is a family of

functions {f()}, is a set of parameters.

– For a task of learning two classes f(x,) 2 {-1,1} 8 x,

Class of oriented lines in R2:sign(1x1 + 2x2 + 3)

7


– Capacity vs. – Capacity vs. GeneralizationGeneralization• Definition:

– Capacity of a learning machine measures the ability to learn any training set without error.

Too much Capacity

Too little Capacity

underfittingoverfitting

? Is the color green?Does it have the same # of leaves?

?

8


– Capacity vs. – Capacity vs. GeneralizationGeneralization

•For small sample sizes overfitting or underfitting might occur

•Best generalization = right balance between accuracy and capacity

9


– Capacity vs. – Capacity vs. GeneralizationGeneralization• Solution:

Restrict the complexity (capacity) of the function class.

• Intuition:“Simple” function that explains most of the data is preferable to a “complex” one.

10


-VC dimension -VC dimension• What is a “simple”/”complex” function?• Definition:

– Given l points (can be labeled

in 2l ways)

– The set of points is shattered by the function class {f()} if for each labeling there is a function which correctly assigns those labels.

11


-VC dimension -VC dimension

• Definition– VC dimension of {f()} is the maximum number

of points that can be shattered by {f()} and is a measure of capacity.

12


-VC dimension -VC dimension• Theorem:The VC dimension of the set of oriented

hyperplanes in Rn is n+1.

• Low # of parameters ) low VC dimension

13


-Bounds -Bounds• Definition:

Actual risk

• Minimize R()

• But, we can’t measure actual risk, since we don’t know p(x,y)

14


-Bounds -Bounds• Definition:

Empirical risk

• Remp() ! R(), l ! 1But for small training set deviations might occur

15


-Bounds -Bounds• Risk bound:

with probability (1-) h is VC dimension of the function class

• Note: R() is independent of p(x,y)

Not valid for infinite

VC dimension

Confidence term

16


-Bounds -Bounds

17


-Bounds -Bounds

18


-Principal Method-Principal Method• Principle method for choosing a learning

machine for a given task:

19

SRMSRM• Divide the class of

functions into nested subsets

• Either calculate h for each subset, or get a bound on it

• Train each subset to achieve minimal empirical error

• Choose the subset with the minimal risk bound

Complexity

Risk Bound

20






Agenda…Agenda…

21

Support Vector Machines Support Vector Machines (SVM)(SVM)

• Currently the “en vogue” approach to classification

• Successful applications in bioinformatics, text, handwriting recognition, image processing

• Introduced by Bosner, Gayon and Vapnik, 1992

• SVM are a particular instance of Kernel Machines

22

Linear SVM – Separable Linear SVM – Separable casecase

• Two given classes are linearly separable

23

Linear SVM - definitionsLinear SVM - definitions• Separating hyperplane H:

• w is normal to H

• |b|/||w|| is the perpendicular distance from H to the origin

• d+ (d-) is the shortest distance from H to the closest positive (negative) point.

24

Linear SVM - definitionsLinear SVM - definitions

25

Linear SVM - definitionsLinear SVM - definitions• If H is a separating hyperplane, then

• No training points fall between H1 and H2

26

Linear SVM - definitionsLinear SVM - definitions• By scaling w and b, we can require that

Or more simply:

• Equality holds xi lies on H1 or H2

27

Linear SVM - definitionsLinear SVM - definitions• Note: w is no longer a unit vector

• Margin is now 2 / ||w||

• Find hyperplane with the largest margin.

28

Linear SVM – maximizing Linear SVM – maximizing marginmargin

• Maximizing the margin , minimizing ||w||2

• ) more room for unseen points to fall

• ) restrict the capacity

R is the radius of the smallest ball around data

29

Linear SVM – Linear SVM – Constrained OptimizationConstrained Optimization

• Introduce Lagrange multipliers

• “Primal” formulation:

• Minimize LP with respect to w and bRequire

30


• Objective function is quadratic

• Linear constraint defines a convex set

• Intersection of convex sets is a convex set

• ) can formulate “Wolfe Dual” problem

31

• Maximize LP with respect to i

Require

• Substitute into LP to give:

• Maximize with respect to i

Linear SVM – Linear SVM – Constrained OptimizationConstrained OptimizationThe

Solution

33


• Using Karush Kuhn Tucker conditions:

• If i > 0 then lies either on H1 or H2

) The solution is sparse in i

• Those training points are called “support vectors”. Their removal would change the solution

34

SVM – Test PhaseSVM – Test Phase

• Given the unseen sample x we take the class of x to be

35

Linear SVM – Linear SVM – Non-separable caseNon-separable case

• Separable case corresponds to empirical risk of zero.

• For noisy data this might not be the minimum in the actual risk. (overfitting )

• No feasible solution for non-separable case

36

• is an upper bound on the number of errors


• Relax the constraints by introducing positive slack variables i

37


• Assign extra cost to errors• Minimize

where C is a penalty parameterchosen by the user

38


• Lagrange formulation again:

• “Wolfe Dual” problem - maximize:

subject to:

• The solution:

Lagrange multiplier

40


• Using Karush Kuhn Tucker conditions:

•The solution is sparse in i

41

Nonlinear SVMNonlinear SVM

• Non linear decision function might be needed

42

Nonlinear SVM- Nonlinear SVM- Feature SpaceFeature Space

• Map the data to a high dimensional (possibly infinite) feature space

• Solution depends on • If there were function k(xi,xj) s.t.

) no need to know explicitly

43

Nonlinear SVM – Toy Nonlinear SVM – Toy exampleexample

Input Space Feature Space

44

Nonlinear SVM – Nonlinear SVM – Avoid the CurseAvoid the Curse

• Curse of dimensionality:The difficulty of estimating a problem increases drastically with the dimension

• But! Learning in F may be simpler if one uses low complexity function class (hyperplanes)

45

Nonlinear SVM-Nonlinear SVM-Kernel FunctionsKernel Functions

• Kernel functions exist!– effectively compute dot products in feature

space

• Can use it without knowing and F

• Given a kernel, and F are not unique

• F with smallest dim is called minimal embedding space

46


• Mercer’s condition:There exists a pair {,F} such that

iff for any g(x) s.t. is finite

then

47


• Formulation of algorithm in terms of kernels

48


• Kernels frequently used:

49

Nonlinear SVM-Nonlinear SVM-Feature SpaceFeature Space

• Hyperplane {w,b} requires dim(F) + 1 parameters

• Solving SVM means adjusting l+1 parameters

d=256, p=4 )

dim(F)=183,181,3

76

50

SVM - SolutionSVM - Solution

• LD is convex ) the solution is global

• Two type of non-uniqueness:– {w,b} is not unique

– {w,b} is unique, but the set {i} is notPrefer the set with less support vectors(sparse)

51

Nonlinear SVM-Toy Nonlinear SVM-Toy ExampleExample

52


53


54

Methods of solutionMethods of solution

• Quadratic programming packages• Chunking• Decomposition methods• Sequential minimal optimization

55

Applications - SVMApplications - SVM

Extracting Support Data for a Given Task

Bernhard Schölkopf Chris Burges Vladimir Vapnik

Proceedings, First International Conference on Knowledge Discovery & Data Mining. 1995

56

• Input (USPS handwritten digits)– Training set: 7300 – Testing set: 2000

• Constructed:– 10 class/non-class SVM classifiers– Take the class with maximal output


16x16

57


• Three different types of classifiers for each digit:

58


59

Applications - SVMApplications - SVMPredicting the Optimal Decision Functions - SRM

60


61






Agenda…Agenda…

62

Feature Space vs. Input Feature Space vs. Input SpaceSpace

• Suppose the solution is a linear combination in feature space

• Cannot generally say that each point has a preimage

63


• If there exists z such that

and k is an invertible function fk of (x¢ y)then can compute z as

where {e1,…,eN} is an orthonormal basis of the input space.

64


• Polynomial kernel is invertible when

• Then the preimage of w is given by

65

Feature Space vs. Input Feature Space vs. Input spacespace

• In general, cannot find the preimage • Look for an approximation such that

is small

66






Agenda…Agenda…

67

PCA PCA

• Regular PCA:Find the direction u s.t. projecting n points in d dimensions onto u gives the largest variance.

• u is the eigenvector of covariance matrix Cu=u.

68

Kernel PCA Kernel PCA

• Extension to feature space:– compute covariance matrix

– solve eigenvalue problem CV=V

)

69

Kernel PCAKernel PCA

– Define in terms of dot products:

– Then the problem becomes:

where

l x l rather than d x d

70

Kernel PCA – Kernel PCA – Extracting FeaturesExtracting Features

• To extract features of a new pattern x with kernel PCA, project the pattern on the first n eigenvectors

71

Kernel PCAKernel PCA

• Kernel PCA is used for:– De-noising– Compression– Interpretation (Visualization)– Extract features for classifiers

72

Kernel PCA - Toy Kernel PCA - Toy exampleexample• Regular PCA:

73

Kernel PCA - Toy Kernel PCA - Toy exampleexample

74

Applications – Kernel Applications – Kernel PCAPCA

Kernel PCA Pattern Reconstruction via Approximate Pre-Images

B. Schölkopf, S. Mika, A. Smola, G. Rätsch, and K.-R. Müller.

In L. Niklasson, M. Bodén, and T. Ziemke, editors, Proceedings of the 8th International Conference on Artificial Neural Networks,

Perspectives in Neural Computing, pages 147-152, Berlin,

1998. Springer Verlag.

75


• Recall

• (x) can be reconstructed from its principal components

• Projection operator:

76


Denoising• When ,

) can’t guarantee existence of pre-image

• First n directions ! main structure• The remaining directions ! noise

77


• Find z that minimizes

• Use gradient descent starting with x

78


• Input toy data:3 point sources(100 points each)with Gaussian noise=0.1

• Using RBF

79


80


81


82


Compare to the linear PCA

83

Applications – Kernel Applications – Kernel PCAPCAReal world data

84






Agenda…Agenda…

85

Fisher Linear Fisher Linear DiscriminantDiscriminant

Finds a direction w, projected on which the classes are “best” separated

86


• For “best” separation - maximize

where is the projected mean of class i, and is the std.

87


• Equivalent to finding w which maximizes:

where

88


• The solution is given by:

89

Kernel Fisher Kernel Fisher DiscriminantDiscriminant

• Kernel formulation:

where

90


• From the theory of reproducing kernels:

• Substituting it into the J(w) reduces the problem to maximizing:

91


where

92


• Solution is to solve the generalized eigenproblem:

• Projection of a new pattern is then:

• Find a suitable threshold

l x l rather than d x d

93

Kernel Fisher Discriminant Kernel Fisher Discriminant – Constraint Optimization– Constraint Optimization

• Can formulate problem as constraint optimization

w

94

Kernel Fisher Discriminant Kernel Fisher Discriminant ––

Toy ExampleToy Example

KFDA KPCA – 1st eigenvect

or

KPCA – 2nd eigenvect

or

95

Applications – Fisher Applications – Fisher Discriminant AnalysisDiscriminant Analysis

Fisher Discriminant Analysis with Kernels

S.Mika et. al.

In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX, pages 41-48. IEEE,

1999.

96


• Input (USPS handwritten digits): – Training set: 3000

• Constructed:– 10 class/non-class KFD classifiers– Take the class with maximal

output

97


• Results:3.7% error on a ten-class classifierUsing RBF with = 0.3*256

• Compare to 4.2% using SVM

• KFDA vs. SVM

98






Summary…Summary…

kernel – based methods presented by jason friedman lena gorelick advanced topics in computer and...

Documents

capacity slide

solution slide

empirical risk

actual risk

analysis kfdaagenda

non separable case slide

class of x

generalization definition