kernel – based methods presented by jason friedman lena gorelick advanced topics in computer and...
Post on 21-Dec-2015
214 Views
Preview:
TRANSCRIPT
Kernel – BasedKernel – Based Methods Methods
Presented by
Jason FriedmanLena Gorelick
Advanced Topics in Computer and Human Advanced Topics in Computer and Human VisionVision
Spring 2003Spring 2003
2
• Structural Risk Minimization (SRM)Structural Risk Minimization (SRM)
• Support Vector Machines (SVM)Support Vector Machines (SVM)
• Feature Space vs. Input SpaceFeature Space vs. Input Space
• Kernel PCAKernel PCA
• Kernel Fisher Discriminate Analysis Kernel Fisher Discriminate Analysis (KFDA)(KFDA)
Agenda…Agenda…
3
• Structural Risk Minimization (SRM)Structural Risk Minimization (SRM)
• Support Vector Machines (SVM)Support Vector Machines (SVM)
• Feature Space vs. Input SpaceFeature Space vs. Input Space
• Kernel PCAKernel PCA
• Kernel Fisher Discriminate Analysis Kernel Fisher Discriminate Analysis (KFDA)(KFDA)
Agenda…Agenda…
4
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
• Definition:– Training set with l observations:
Each observation consists of a pair:
16x16=256
5
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
• The task:“Generalization” - find a mapping
• Assumption: Training and test data drawn from the same probability distribution, i.e.
(x,y) is “similar” to (x1,y1), …, (xl,yl)
6
Structural Risk Minimization Structural Risk Minimization (SRM) (SRM) – Learning Machine– Learning Machine
• Definition:– Learning machine is a family of
functions {f()}, is a set of parameters.
– For a task of learning two classes f(x,) 2 {-1,1} 8 x,
Class of oriented lines in R2:sign(1x1 + 2x2 + 3)
7
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
– Capacity vs. – Capacity vs. GeneralizationGeneralization• Definition:
– Capacity of a learning machine measures the ability to learn any training set without error.
Too much Capacity
Too little Capacity
underfittingoverfitting
? Is the color green?Does it have the same # of leaves?
?
8
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
– Capacity vs. – Capacity vs. GeneralizationGeneralization
•For small sample sizes overfitting or underfitting might occur
•Best generalization = right balance between accuracy and capacity
9
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
– Capacity vs. – Capacity vs. GeneralizationGeneralization• Solution:
Restrict the complexity (capacity) of the function class.
• Intuition:“Simple” function that explains most of the data is preferable to a “complex” one.
10
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
-VC dimension -VC dimension• What is a “simple”/”complex” function?• Definition:
– Given l points (can be labeled
in 2l ways)
– The set of points is shattered by the function class {f()} if for each labeling there is a function which correctly assigns those labels.
11
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
-VC dimension -VC dimension
• Definition– VC dimension of {f()} is the maximum number
of points that can be shattered by {f()} and is a measure of capacity.
12
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
-VC dimension -VC dimension• Theorem:The VC dimension of the set of oriented
hyperplanes in Rn is n+1.
• Low # of parameters ) low VC dimension
13
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
-Bounds -Bounds• Definition:
Actual risk
• Minimize R()
• But, we can’t measure actual risk, since we don’t know p(x,y)
14
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
-Bounds -Bounds• Definition:
Empirical risk
• Remp() ! R(), l ! 1But for small training set deviations might occur
15
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
-Bounds -Bounds• Risk bound:
with probability (1-) h is VC dimension of the function class
• Note: R() is independent of p(x,y)
Not valid for infinite
VC dimension
Confidence term
16
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
-Bounds -Bounds
17
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
-Bounds -Bounds
18
Structural Risk Minimization Structural Risk Minimization (SRM)(SRM)
-Principal Method-Principal Method• Principle method for choosing a learning
machine for a given task:
19
SRMSRM• Divide the class of
functions into nested subsets
• Either calculate h for each subset, or get a bound on it
• Train each subset to achieve minimal empirical error
• Choose the subset with the minimal risk bound
Complexity
Risk Bound
20
• Structural Risk Minimization (SRM)Structural Risk Minimization (SRM)
• Support Vector Machines (SVM)Support Vector Machines (SVM)
• Feature Space vs. Input SpaceFeature Space vs. Input Space
• Kernel PCAKernel PCA
• Kernel Fisher Discriminate Analysis Kernel Fisher Discriminate Analysis (KFDA)(KFDA)
Agenda…Agenda…
21
Support Vector Machines Support Vector Machines (SVM)(SVM)
• Currently the “en vogue” approach to classification
• Successful applications in bioinformatics, text, handwriting recognition, image processing
• Introduced by Bosner, Gayon and Vapnik, 1992
• SVM are a particular instance of Kernel Machines
22
Linear SVM – Separable Linear SVM – Separable casecase
• Two given classes are linearly separable
23
Linear SVM - definitionsLinear SVM - definitions• Separating hyperplane H:
• w is normal to H
• |b|/||w|| is the perpendicular distance from H to the origin
• d+ (d-) is the shortest distance from H to the closest positive (negative) point.
24
Linear SVM - definitionsLinear SVM - definitions
25
Linear SVM - definitionsLinear SVM - definitions• If H is a separating hyperplane, then
• No training points fall between H1 and H2
26
Linear SVM - definitionsLinear SVM - definitions• By scaling w and b, we can require that
Or more simply:
• Equality holds xi lies on H1 or H2
27
Linear SVM - definitionsLinear SVM - definitions• Note: w is no longer a unit vector
• Margin is now 2 / ||w||
• Find hyperplane with the largest margin.
28
Linear SVM – maximizing Linear SVM – maximizing marginmargin
• Maximizing the margin , minimizing ||w||2
• ) more room for unseen points to fall
• ) restrict the capacity
R is the radius of the smallest ball around data
29
Linear SVM – Linear SVM – Constrained OptimizationConstrained Optimization
• Introduce Lagrange multipliers
• “Primal” formulation:
• Minimize LP with respect to w and bRequire
30
Linear SVM – Linear SVM – Constrained OptimizationConstrained Optimization
• Objective function is quadratic
• Linear constraint defines a convex set
• Intersection of convex sets is a convex set
• ) can formulate “Wolfe Dual” problem
31
• Maximize LP with respect to i
Require
• Substitute into LP to give:
• Maximize with respect to i
Linear SVM – Linear SVM – Constrained OptimizationConstrained OptimizationThe
Solution
33
Linear SVM – Linear SVM – Constrained OptimizationConstrained Optimization
• Using Karush Kuhn Tucker conditions:
• If i > 0 then lies either on H1 or H2
) The solution is sparse in i
• Those training points are called “support vectors”. Their removal would change the solution
34
SVM – Test PhaseSVM – Test Phase
• Given the unseen sample x we take the class of x to be
35
Linear SVM – Linear SVM – Non-separable caseNon-separable case
• Separable case corresponds to empirical risk of zero.
• For noisy data this might not be the minimum in the actual risk. (overfitting )
• No feasible solution for non-separable case
36
• is an upper bound on the number of errors
Linear SVM – Linear SVM – Non-separable caseNon-separable case
• Relax the constraints by introducing positive slack variables i
37
Linear SVM – Linear SVM – Non-separable caseNon-separable case
• Assign extra cost to errors• Minimize
where C is a penalty parameterchosen by the user
38
Linear SVM – Linear SVM – Non-separable caseNon-separable case
• Lagrange formulation again:
• “Wolfe Dual” problem - maximize:
subject to:
• The solution:
Lagrange multiplier
40
Linear SVM – Linear SVM – Non-separable caseNon-separable case
• Using Karush Kuhn Tucker conditions:
•The solution is sparse in i
41
Nonlinear SVMNonlinear SVM
• Non linear decision function might be needed
42
Nonlinear SVM- Nonlinear SVM- Feature SpaceFeature Space
• Map the data to a high dimensional (possibly infinite) feature space
• Solution depends on • If there were function k(xi,xj) s.t.
) no need to know explicitly
43
Nonlinear SVM – Toy Nonlinear SVM – Toy exampleexample
Input Space Feature Space
44
Nonlinear SVM – Nonlinear SVM – Avoid the CurseAvoid the Curse
• Curse of dimensionality:The difficulty of estimating a problem increases drastically with the dimension
• But! Learning in F may be simpler if one uses low complexity function class (hyperplanes)
45
Nonlinear SVM-Nonlinear SVM-Kernel FunctionsKernel Functions
• Kernel functions exist!– effectively compute dot products in feature
space
• Can use it without knowing and F
• Given a kernel, and F are not unique
• F with smallest dim is called minimal embedding space
46
Nonlinear SVM-Nonlinear SVM-Kernel FunctionsKernel Functions
• Mercer’s condition:There exists a pair {,F} such that
iff for any g(x) s.t. is finite
then
47
Nonlinear SVM-Nonlinear SVM-Kernel FunctionsKernel Functions
• Formulation of algorithm in terms of kernels
48
Nonlinear SVM-Nonlinear SVM-Kernel FunctionsKernel Functions
• Kernels frequently used:
49
Nonlinear SVM-Nonlinear SVM-Feature SpaceFeature Space
• Hyperplane {w,b} requires dim(F) + 1 parameters
• Solving SVM means adjusting l+1 parameters
d=256, p=4 )
dim(F)=183,181,3
76
50
SVM - SolutionSVM - Solution
• LD is convex ) the solution is global
• Two type of non-uniqueness:– {w,b} is not unique
– {w,b} is unique, but the set {i} is notPrefer the set with less support vectors(sparse)
51
Nonlinear SVM-Toy Nonlinear SVM-Toy ExampleExample
52
Nonlinear SVM-Toy Nonlinear SVM-Toy ExampleExample
53
Nonlinear SVM-Toy Nonlinear SVM-Toy ExampleExample
54
Methods of solutionMethods of solution
• Quadratic programming packages• Chunking• Decomposition methods• Sequential minimal optimization
55
Applications - SVMApplications - SVM
Extracting Support Data for a Given Task
Bernhard Schölkopf Chris Burges Vladimir Vapnik
Proceedings, First International Conference on Knowledge Discovery & Data Mining. 1995
56
• Input (USPS handwritten digits)– Training set: 7300 – Testing set: 2000
• Constructed:– 10 class/non-class SVM classifiers– Take the class with maximal output
Applications - SVMApplications - SVM
16x16
57
Applications - SVMApplications - SVM
• Three different types of classifiers for each digit:
58
Applications - SVMApplications - SVM
59
Applications - SVMApplications - SVMPredicting the Optimal Decision Functions - SRM
60
Applications - SVMApplications - SVM
61
• Structural Risk Minimization (SRM)Structural Risk Minimization (SRM)
• Support Vector Machines (SVM)Support Vector Machines (SVM)
• Feature Space vs. Input SpaceFeature Space vs. Input Space
• Kernel PCAKernel PCA
• Kernel Fisher Discriminate Analysis Kernel Fisher Discriminate Analysis (KFDA)(KFDA)
Agenda…Agenda…
62
Feature Space vs. Input Feature Space vs. Input SpaceSpace
• Suppose the solution is a linear combination in feature space
• Cannot generally say that each point has a preimage
63
Feature Space vs. Input Feature Space vs. Input SpaceSpace
• If there exists z such that
and k is an invertible function fk of (x¢ y)then can compute z as
where {e1,…,eN} is an orthonormal basis of the input space.
64
Feature Space vs. Input Feature Space vs. Input SpaceSpace
• Polynomial kernel is invertible when
• Then the preimage of w is given by
65
Feature Space vs. Input Feature Space vs. Input spacespace
• In general, cannot find the preimage • Look for an approximation such that
is small
66
• Structural Risk Minimization (SRM)Structural Risk Minimization (SRM)
• Support Vector Machines (SVM)Support Vector Machines (SVM)
• Feature Space vs. Input SpaceFeature Space vs. Input Space
• Kernel PCAKernel PCA
• Kernel Fisher Discriminate Analysis Kernel Fisher Discriminate Analysis (KFDA)(KFDA)
Agenda…Agenda…
67
PCA PCA
• Regular PCA:Find the direction u s.t. projecting n points in d dimensions onto u gives the largest variance.
• u is the eigenvector of covariance matrix Cu=u.
68
Kernel PCA Kernel PCA
• Extension to feature space:– compute covariance matrix
– solve eigenvalue problem CV=V
)
69
Kernel PCAKernel PCA
– Define in terms of dot products:
– Then the problem becomes:
where
l x l rather than d x d
70
Kernel PCA – Kernel PCA – Extracting FeaturesExtracting Features
• To extract features of a new pattern x with kernel PCA, project the pattern on the first n eigenvectors
71
Kernel PCAKernel PCA
• Kernel PCA is used for:– De-noising– Compression– Interpretation (Visualization)– Extract features for classifiers
72
Kernel PCA - Toy Kernel PCA - Toy exampleexample• Regular PCA:
73
Kernel PCA - Toy Kernel PCA - Toy exampleexample
74
Applications – Kernel Applications – Kernel PCAPCA
Kernel PCA Pattern Reconstruction via Approximate Pre-Images
B. Schölkopf, S. Mika, A. Smola, G. Rätsch, and K.-R. Müller.
In L. Niklasson, M. Bodén, and T. Ziemke, editors, Proceedings of the 8th International Conference on Artificial Neural Networks,
Perspectives in Neural Computing, pages 147-152, Berlin,
1998. Springer Verlag.
75
Applications – Kernel Applications – Kernel PCAPCA
• Recall
• (x) can be reconstructed from its principal components
• Projection operator:
76
Applications – Kernel Applications – Kernel PCAPCA
Denoising• When ,
) can’t guarantee existence of pre-image
• First n directions ! main structure• The remaining directions ! noise
77
Applications – Kernel Applications – Kernel PCAPCA
• Find z that minimizes
• Use gradient descent starting with x
78
Applications – Kernel Applications – Kernel PCAPCA
• Input toy data:3 point sources(100 points each)with Gaussian noise=0.1
• Using RBF
79
Applications – Kernel Applications – Kernel PCAPCA
80
Applications – Kernel Applications – Kernel PCAPCA
81
Applications – Kernel Applications – Kernel PCAPCA
82
Applications – Kernel Applications – Kernel PCAPCA
Compare to the linear PCA
83
Applications – Kernel Applications – Kernel PCAPCAReal world data
84
• Structural Risk Minimization (SRM)Structural Risk Minimization (SRM)
• Support Vector Machines (SVM)Support Vector Machines (SVM)
• Feature Space vs. Input SpaceFeature Space vs. Input Space
• Kernel PCAKernel PCA
• Kernel Fisher Discriminate Analysis Kernel Fisher Discriminate Analysis (KFDA)(KFDA)
Agenda…Agenda…
85
Fisher Linear Fisher Linear DiscriminantDiscriminant
Finds a direction w, projected on which the classes are “best” separated
86
Fisher Linear Fisher Linear DiscriminantDiscriminant
• For “best” separation - maximize
where is the projected mean of class i, and is the std.
87
Fisher Linear Fisher Linear DiscriminantDiscriminant
• Equivalent to finding w which maximizes:
where
88
Fisher Linear Fisher Linear DiscriminantDiscriminant
• The solution is given by:
89
Kernel Fisher Kernel Fisher DiscriminantDiscriminant
• Kernel formulation:
where
90
Kernel Fisher Kernel Fisher DiscriminantDiscriminant
• From the theory of reproducing kernels:
• Substituting it into the J(w) reduces the problem to maximizing:
91
Kernel Fisher Kernel Fisher DiscriminantDiscriminant
where
92
Kernel Fisher Kernel Fisher DiscriminantDiscriminant
• Solution is to solve the generalized eigenproblem:
• Projection of a new pattern is then:
• Find a suitable threshold
l x l rather than d x d
93
Kernel Fisher Discriminant Kernel Fisher Discriminant – Constraint Optimization– Constraint Optimization
• Can formulate problem as constraint optimization
w
94
Kernel Fisher Discriminant Kernel Fisher Discriminant ––
Toy ExampleToy Example
KFDA KPCA – 1st eigenvect
or
KPCA – 2nd eigenvect
or
95
Applications – Fisher Applications – Fisher Discriminant AnalysisDiscriminant Analysis
Fisher Discriminant Analysis with Kernels
S.Mika et. al.
In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX, pages 41-48. IEEE,
1999.
96
Applications – Fisher Applications – Fisher Discriminant AnalysisDiscriminant Analysis
• Input (USPS handwritten digits): – Training set: 3000
• Constructed:– 10 class/non-class KFD classifiers– Take the class with maximal
output
97
Applications – Fisher Applications – Fisher Discriminant AnalysisDiscriminant Analysis
• Results:3.7% error on a ten-class classifierUsing RBF with = 0.3*256
• Compare to 4.2% using SVM
• KFDA vs. SVM
98
• Structural Risk Minimization (SRM)Structural Risk Minimization (SRM)
• Support Vector Machines (SVM)Support Vector Machines (SVM)
• Feature Space vs. Input SpaceFeature Space vs. Input Space
• Kernel PCAKernel PCA
• Kernel Fisher Discriminate Analysis Kernel Fisher Discriminate Analysis (KFDA)(KFDA)
Summary…Summary…
top related