christopher m. bishop, pattern recognition and machine learning
TRANSCRIPT
Christopher M. Bishop,Pattern Recognition and Machine Learning
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
2
Supervised Learning
In machine learning, applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are called supervised learning
y(x)
(x,t)(1,60,pass)(2,53,fail)(3,77,pass)(4,34,fail)﹕
output
3
Classificationx2
x1
y=0 y>0y<0
t=-1 t=14
Regression
0 1 x
t
0
1
-1
new x 5
Linear Models Linear models for regression and
classification:
if we apply feature extraction,
0 1 1 1 D( ) where x = (x ,...,x ) D Dy x x x
1
0 01
( ) ( ) ( )M
Tj j
j
y x x w x
inputmodel
parameter
6
Problems with Feature Space Why feature extraction? Working in
high dimensional feature spaces solves the problem of expressing complex functions
Problems: - there is a computational problem
(working with very large vectors) - curse of dimensionality
7
Kernel Methods (1)
Kernel function: inner products in some feature space nonlinear similarity measure
Examples - polynomial: - Gaussian:
( , ') ( ) ( ')Tk x x x x
( , ') ( ' )T dk x x x x c 2 2( , ') exp( ' / 2 )k x x x x
8
Kernel Methods (2)
Many linear models can be reformulated using a “dual representation” where the kernel functions arise naturally only require inner products between data (input)
2 21 1 2 2
2 2 2 21 1 1 1 2 2 2 2
2 2 2 21 1 2 2 1 1 2 2
( , ) ( ) ( )
2
( , 2 , )( , 2 , )
( ) ( )
T
T
T
k x z x z x z x z
x z x z x z x z
x x x x z z z z
x z
9
Kernel Methods (3)
We can benefit from the kernel trick: - choosing a kernel function is equivalent to choosing φ no need to specify what features are being used - We can save computation by not explicitly mapping the data to feature space, but
just working out the inner product in the data space
10
Kernel Methods (4)
Kernel methods exploit information about the inner products between data items
We can construct kernels indirectly by choosing a feature space mapping φ, or directly choose a valid kernel function
If a bad kernel function is chosen, it will map to a space with many irrelevant features, so we need some prior knowledge of the target
11
Kernel Methods (5)
Two basic modules for kernel methods
General purpose learning model
Problem specific kernel function
12
Kernel Methods (6)
Limitation: the kernel function k(xn,xm) must be evaluated for all possible pairs xn and xm of training points when making predictions for new data points
Sparse kernel machine makes prediction only by a subset of the training data points
13
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
14
Support Vector Machines (1) Support Vector Machines are a
system for efficiently training the linear machines in the kernel-induced feature spaces while respecting the insights provided by the generalization theory and exploiting the optimization theory
Generalization theory describes how to control the learning machines to prevent them from overfitting
15
Support Vector Machines (2) To avoid overfitting, SVM modify the error
function to a “regularized form” where hyperparameter λ balances the
trade-off The aim of EW is to limit the estimated
functions to smooth functions As a side effect, SVM obtain a sparse
model
( ) ( ) ( ) D WE w E w E w
16
Support Vector Machines (3)
17
Fig. 1 Architecture of SVM
SVM for Classification (1) The mechanism to prevent
overfitting in classification is “maximum margin classifiers”
SVM is fundamentally a two-class classifier
18
Maximum Margin Classifiers (1) The aim of classification is to find a
D-1 dimension hyperplane to classify data in a D dimension space
2D example:
19
Maximum Margin Classifiers (2)
margin
support vectors
support vectors
20
Maximum Margin Classifiers (3)
small margin large margin
21
Maximum Margin Classifiers (4) Intuitively it is a “robust” solution - If we’ve made a small error in
the location of the boundary, this gives us least chance of causing a misclassification
The concept of max margin is usually justified using Vapnik’s Statistical learning theory
Empirically it works well22
SVM for Classification (2) After the optimization process, we
obtain the prediction model:
where (xn,tn) are N training data we can find that an will be zero
except for that of the support vectors sparse
23
1
( ) ( , ) N
n n nn
y x a t k x x b
SVM for Classification (3)
24
Fig. 2 data from twp classes in two dimensions showing contours of constant y(x) obtained from a SVM having a Gaussian kernel function
SVM for Classification (4) For overlapping class
distributions, SVM allow some of the training points to be misclassified soft margin
25
penalty
SVM for Classification (5) For multiclass problems, there are
some methods to combine multiple two-class SVMs
- one versus the rest - one versus one more
training time
26
Fig. 3 Problems in multiclass classification using multiple SVMs
SVM for Regression (1)
For regression problems, the mechanism to prevent overfitting is “ε-insensitive error function”
27
quadratic error
functionε-insensitive
error funciton
SVM for Regression (2)
28Fig . 4 ε-tube
No error
×
Error = |y(x)-t|- ε
SVM for Regression (3)
After the optimization process, we obtain the prediction model:
we can find that an will be zero except for that of the support vectors sparse
29
1
ˆ( ) ( ) ( , ) N
n n nn
y x a a k x x b
SVM for Regression (4)
30Fig . 5 Regression results. Support vectors are line on the boundary of the tube or outside the tube
Disadvantages
It’s not sparse enough since the number of support vectors required typically grows linearly with the size of the training set
Predictions are not probabilistic The estimation of error/margin trade-off
parameters must utilize cross-validation which is a waste of computation
Kernel functions are limited Multiclass classification problems
31
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
32
Relevance Vector Machines (1) The relevance vector machine (RVM)
is a Bayesian sparse kernel technique that shares many of the characteristics of SVM whilst avoiding its principal limitations
RVM are based on Bayesian formulation and provides posterior probabilistic outputs, as well as having much sparser solutions than SVM
33
Relevance Vector Machines (2) RVM intend to mirror the structure of the
SVM and use a Bayesian treatment to remove the limitations of SVM
the kernel functions are simply treated as basis functions, rather than dot-product in some space
34
1
( ) ( , ) N
n nn
y x w k x x b
Bayesian Inference Bayesian inference allows one to model
uncertainty about the world and outcomes of interest by combining common-sense knowledge and observational evidence.
35
Relevance Vector Machines (3) In the Bayesian framework, we use a
prior distribution over w to avoid overfitting
where α is a hyperparameter which control the model parameter w
36
1/ 2 2
1
( | ) ( ) exp( )2 2
N
mm
p w w
Relevance Vector Machines (4) Goal: find most probable α* and β* to
compute the predictive distribution over tnew for a new input xnew, i.e.
p(tnew | xnew, X, t, α*, β*)
Maximize the likelihood function to obtain α* and β* :
p(t|X, α, β)
37
Training data and their target values
Relevance Vector Machines (5) RVM utilize the “automatic relevance
determination” to achieve sparsity
where αm represents the precision of wm
In the procedure of finding αm*, some αm will become infinity which leads the corresponding wm to be zero remain relevance vectors !
38
1/ 2 2
1
( | ) ( ) exp( )2 2
Nm m
mm
p w w
Comparisons - Regression
39
RVM (on standard deviation predictive distribution)
SVM
Comparisons - Regression
40
Comparison - Classification
41
RVM SVM
Comparison - Classification
42
Comparisons
RVM are much sparser and make probabilistic prediction
RVM gives better generalization in regression
SVM gives better generalization in classification
RVM is computationally demanding while learning
43
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
44
Applications (1)
SVM for face detection
45
Applications (2)
46Marti Hearst, “ Support Vector Machines” ,1998
Applications (3)
In the feature-matching based object tracking, SVM are used to detect false feature matches
47Weiyu Zhu et al., “Tracking of Object with SVM Regression” , 2001
Applications (4)
Recovering 3D human poses by RVM
48A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector Regression” 2004
Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions
49
Conclusions
The SVM is a learning machine based on kernel method and generalization theory which can perform binary classification and real valued function approximation tasks
The RVM have the same model as SVM but provides probabilistic prediction and sparser solutions
50
References
www.support-vector.net N. Cristianini and J. Shawe-Taylor,
“An Introduction to Support Vector Machines and Other Kernel-based Learning Methods,” Cambridge University Press,2000
M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning Research, 2001
51
Underfitting and Overfitting
52
underfitting-too simple overfitting-too complex
Adapted from http://www.dtreg.com/svm.htm
new data