cpts 570 – machine learning school of eecs washington ...holder/courses/cpts570/... · cpts 570...

CptS 570 – Machine LearningSchool of EECS

Washington State University

CptS 570 - Machine Learning 1

Or, support vector machine (SVM) Discriminant-based method◦ Learn class boundaries

Support vector consists of examples closest to boundary

Kernel computes similarity between examples◦ Maps instance space to a higher-dimensional space

where (hopefully) linear models suffice Choosing the right kernel is crucial Kernel machines among best-performing

learners

Likely to underfit using only hyperplanes But we can map the data to a nonlinear

space and use hyperplanes there◦ Φ: Rd F◦ x Φ(x)

Note we want ≥+1, not ≥0 Want instances some distance from hyperplane

( ) 1asrewritten becan which

1for 1

such that and find if1 if1

where,

−=−≤+

+=+≥+

∈−∈+

Distance from instance xt to hyperplanewTxt+w0

Distance from hyperplane to closest instances is the margin

wxw )( 00 wror

w tTttT++

margin

Optimal separating hyperplane is the one maximizing the margin

We want to choose w maximizing ρ such that

Infinite number of solutions by scaling w Thus, we choose solution minimizing ‖w‖

wr tTt

∀≥+

,ρwxw 0

( ) twr tTt ∀+≥+ ,121

02 xww to subject min

Quadratic optimization problem with complexity polynomial in d (#features)

Kernel will eventually map d-dimensional space to higher-dimensional space

Prefer complexity not based on #dimensions

( ) twr tTt ∀+≥+ ,121

02 xww to subject min

Convert optimization problem to depend on number of training examples N (not d)◦ Still polynomial in N

But optimization will depend only on closest examples (support vector)◦ Typically ≪N

Rewrite quadratic optimization problem using Lagrange multipliers αt, 1 ≤ t ≤ N

Minimize Lp

( )[ ]

( ) ∑∑

++−=

−+−=

∀+≥+

,1 subject to 21min

Equivalently, we can maximize Lp subject to the constraints:

Plugging these into Lp …

=⇒=∂

α xww

Maximize Ld with respect to αt only Complexity O(N3)

∑∑∑

∑ ∑∑

∀≥=

+−−=

t and to subject

tsTtst

tttTTd

,002121

ααα

Most αt = 0◦ I.e., rt(wTxt+w0) > 1 (xt lie outside margin)

Support vectors: xt such that αt > 0◦ I.e., rt(wTxt+w0) = 1 (xt lie on margin)

w = Σt αt rtxt

w0 = rt – wTxt for any support vector xt

◦ Typically average over all support vectors Resulting discriminant is called the support

vector machine (SVM)

O = support vectors

margin

Data not linearly separable Find hyperplane with least error Define slack variables ξt ≥ 0 storing deviation

from the margin

( ) ttTt wxr ξ−≥+ 10w

(a) Correctly classified example far from margin (ξt = 0)

(b) Correctly classified example on the margin (ξt = 0)

(c) Correctly classified example, but inside the margin (0 < ξt < 1)

(d) Incorrectly classified example (ξt ≥ 1)

Soft error =

O = support vectors

margin

Lagrangian equation with slack variables

C is penalty factor μt ≥ 0, new set of Lagrange multipliers Want to minimize Lp

( )[ ] ∑∑∑ −+−+−+=t

tp wxrCL ξµξαξ 1

Minimize Lp by setting derivatives to zero

Plugging these into Lp yields dual Ld Maximize Ld with respect to αt

=−−⇒=∂

=⇒=∂

µαξ

α xww

Quadratic optimization problem Support vectors have αt >0◦ Examples on margin: αt < C◦ Examples inside margin or misclassified: αt = C

∑∑∑∀≤≤=

t, 0 and 0 subject to

tsTtst

ααα xx

C is a regularization parameter◦ High C high penalty for non-separable examples

(overfit)◦ Low C less penalty (underfit)◦ Determine using validation set (C=1 typical)

∑∑∑∀≤≤=

tsTtst

ααα xx

To use previous approaches, data must be near linearly separable

If not, perhaps a transformation φ(x) will help

φ(x) are basis functions

Transform d-dimensional x space to k-dimensional z space using basis functions φ(x)

z=φ(x) where zj = φj(x) , j=1,…,k

Instead of w0, assume z1 = φ1(x) ≡1

1)()()(

xxφwx

Replace inner product of basis functions φ(xt)Tφ(xs) with kernel function K(xt,xs)

[ ] ∑∑∑ −+−−+=t

ttTttt

tp rCL ξµξαξ 1)(

21 2 xφww

∑∑∑∀≤≤=

tsTtst

ααα xx φφ

∑∑∑ +−=t

std KrrL ααα ),(

Kernel K(xt,xs) computes z-space product φ(xt)Tφ(xs) in x-space

Matrix of kernel values K, where Kts = K(xt,xs), called the Gram matrix

K should be symmetric and positive semidefinite

( ) ( ) ( ) ( )

( ) ( )∑

∑∑

xφxφxφwx

Polynomial kernel of degree q

If q=1, then use original features For example, when q=2 and d=2

( ) ( )qtTtK 1+= xxxx ,

( ) ( )( )

( ) [ ]T

xxxxxx

yxyxyyxxyxyx

212121

2121212211

+++++=

Polynomial kernel of degree 2

O = support vectors

margin

Radial basis functions (Gaussian kernel)

xt is the center s is the radius Larger s implies smoother boundaries

−−= 2

xxxx exp,

Sigmoidal functions

)12tanh(),( += tTtK xxxx

Kernel K(x,y) increases with similarity between x and y

Prior knowledge can be included in the kernel function

E.g., training examples are documents◦ K(D1,D2) = # shared words

E.g., training examples are strings (e.g., DNA)◦ K(S1,S2) = 1 / edit distance between S1 and S2◦ Edit distance is the number of insertions, deletions

and/or substitutions to transform S1 into S2

E.g., training examples are nodes in a graph (e.g., social network)

K(N1,N2) = 1 / length of shortest path connecting nodes

K(N1,N2) = #paths connecting nodes Diffusion kernel

Training examples are graphs, not feature vectors◦ E.g., carcinogenic vs. non-carcinogenic chemical

structures Compare substructures of graphs◦ E.g., walks, paths, cycles, trees, subgraphs

K(G1,G2) = number of identical random walks in both graphs

K(G1,G2) = number of subgraphs shared by both graphs

Training data from multiple modalities (e.g., biometrics, social network, audio/visual)

Construct new kernels by combining simpler kernels

If K1(x,y) and K2(x,y) are valid kernels, and c is a constant, then

◦ are valid kernels

( )( )

( ) ( )( ) ( )

yxyxyxyx

Adaptive kernel combination

Learn both αts and ɳis

( ) ( )

( )( )∑∑

∑∑ ∑∑

ηααα

Learn K different kernel machines gi(x)◦ Each uses one class as

positive, remaining classes as negative◦ Choose class i such that

i=argmaxj gj(x)◦ Best approach in practice

Learn K(K-1)/2 kernel machines◦ Each uses one class as

positive and another class as negative◦ Easier (faster) learning per

kernel machine

Learn all margins at once

◦ zt is the class index of xt

K*N variables to optimize (expensive)

≥≠∀−++≥+

+ ∑∑∑=

tt ξξ

to subject

,,xwxw

Normally, we would use squared error

For SVM, we use ε -sensitive loss

◦ Tolerate errors up to ε◦ Errors beyond ε have only linear effect

( )( ) ( )( )

−−

otherwise if0

εε tt

02 )()]([))(,( wffrfre Ttttt +=−= xwxxx

Use slack variables to account for deviations beyond ε◦ ξt

+ for positive deviations◦ ξt

- for negative deviations

( )∑ −+ ++t

ttC ξξ2

21 wmin

( )( )

+≤−+

+≤+−

Subject to

Non-support vectors (inside margin): Support vectors◦ ⊗ on the margin:◦ ⊠ outside margin (outlier):

0)(,0,0

))()((21

=−≤≤≤≤

−−+−

−−−=

−+−+

∑∑

sTtsst

CCtosubject

αααα

ααααε

αααα xx

0== −+tt αα

CorC tt <<<< −+ αα 00CorC tt == −+ αα

Fitted line f(x) is weighted sum of support vectors

Average w0 over:

CifwrCifwr

<<−+=

+−=+=

TtttT wwf

))(()( 00

0)(,0,0

)())((21

=−≤≤≤≤

−−+−

−−−=

−+−+

∑∑

CCtosubject

αααα

ααααε

αααα xx

∑ +−=+= −+t

tttT w),Kwf 00 ()()( xxxwx αα

Polynomial (quadratic) kernel

Gaussian kernel

Classification: SMO Regression: SMOreg Sequential Minimal Optimization (SMO) Kernels◦ Polynomial◦ RBF◦ String

Seek optimal separating hyperplane Support vector machine (SVM) finds

hyperplane using only closest examples Kernel function allows SVM to operate in

higher dimensions Kernel regression Choosing correct kernel is crucial Kernel machines among best-performing

learners

cpts 570 – machine learning school of eecs washington ...holder/courses/cpts570/... · cpts 570...

Documents

eecs 570: fall 2003 -- rev3 1 chapter 7 (excl. 7.9):...

eecs 570 lecture 10 bus-based smpslecture 10 eecs 570 slide...

eecs 570 lecture 19 interconnect - intro · lecture 19 eecs...

eecs 1030 moodle.yorku€¦ · chapter 6: inheritance eecs...

eecs 570 lecture 16 sc-preserving hardware

cpts services brochure

eecs 570 lecture 8 transactional - university of michigan

te, tc 570 - sm 570 r / 2002...catalogue pieces de rechange...

optimization models - eecs 127 / eecs 227at

cpts 483: introduction to robotics

welcome to cpts 421 software design project...

welcome to cpts 421 software design project i

time synchronization keystone 2 devices 1. 2 overview...

cpts 423 software design project ii

eecs 570 lecture 2 message passing & shared memory

eecs 570 lecture 1 parallel computer architecture ·...

cpts thesis process overview cpts thesis overview … ·...

eecs 570: fall 2003 -- rev3 1 chapter 8: cache coherents in...

la cpts - urpsinfirmiers-oi.frla vision rÉgionale. 4 ......

cpts 440 / 540 artificial intelligence