cpts 570 – machine learning school of eecs washington ...holder/courses/cpts570/... · cpts 570...

Post on 20-May-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CptS 570 – Machine LearningSchool of EECS

Washington State University

CptS 570 - Machine Learning 1

Or, support vector machine (SVM) Discriminant-based method◦ Learn class boundaries

Support vector consists of examples closest to boundary

Kernel computes similarity between examples◦ Maps instance space to a higher-dimensional space

where (hopefully) linear models suffice Choosing the right kernel is crucial Kernel machines among best-performing

learners

CptS 570 - Machine Learning 2

Likely to underfit using only hyperplanes But we can map the data to a nonlinear

space and use hyperplanes there◦ Φ: Rd F◦ x Φ(x)

CptS 570 - Machine Learning 3

Φ

Note we want ≥+1, not ≥0 Want instances some distance from hyperplane

CptS 570 - Machine Learning 4

{ }

( ) 1asrewritten becan which

1for 1

1for 1

such that and find if1 if1

where,

0

0

0

0

2

1

+≥+

−=−≤+

+=+≥+

∈−∈+

==

wr

rwrw

wCC

rr

tTt

ttT

ttT

t

tt

ttt

xw

xwxw

wxx

xX

Distance from instance xt to hyperplanewTxt+w0

Distance from hyperplane to closest instances is the margin

CptS 570 - Machine Learning 5

wxw

wxw )( 00 wror

w tTttT++

w

margin

Optimal separating hyperplane is the one maximizing the margin

We want to choose w maximizing ρ such that

Infinite number of solutions by scaling w Thus, we choose solution minimizing ‖w‖

CptS 570 - Machine Learning 6

( )t

wr tTt

∀≥+

,ρwxw 0

( ) twr tTt ∀+≥+ ,121

02 xww to subject min

Quadratic optimization problem with complexity polynomial in d (#features)

Kernel will eventually map d-dimensional space to higher-dimensional space

Prefer complexity not based on #dimensions

CptS 570 - Machine Learning 7

( ) twr tTt ∀+≥+ ,121

02 xww to subject min

Convert optimization problem to depend on number of training examples N (not d)◦ Still polynomial in N

But optimization will depend only on closest examples (support vector)◦ Typically ≪N

CptS 570 - Machine Learning 8

Rewrite quadratic optimization problem using Lagrange multipliers αt, 1 ≤ t ≤ N

Minimize Lp

CptS 570 - Machine Learning 9

( )

( )[ ]

( ) ∑∑

==

=

++−=

−+−=

∀+≥+

N

t

tN

t

tTtt

N

t

tTttp

tTt

wr

wrL

twr

110

2

10

2

02

21

121

,1 subject to 21min

αα

α

xww

xww

xww

Equivalently, we can maximize Lp subject to the constraints:

Plugging these into Lp …

CptS 570 - Machine Learning 10

00

0

10

1

=⇒=∂

=⇒=∂

=

=

N

t

ttp

N

t

tttp

rwL

rL

α

α xww

Maximize Ld with respect to αt only Complexity O(N3)

CptS 570 - Machine Learning 11

( )

( )

( )

∑∑∑

∑ ∑∑

∀≥=

+−=

+−=

+−−=

t and to subject

tr

rr

rwrL

ttt

t

tsTtst

t s

st

t

tT

t t

ttt

t

tttTTd

,002121

21

0

αα

ααα

α

ααα

xx

ww

xwww

Most αt = 0◦ I.e., rt(wTxt+w0) > 1 (xt lie outside margin)

Support vectors: xt such that αt > 0◦ I.e., rt(wTxt+w0) = 1 (xt lie on margin)

w = Σt αt rtxt

w0 = rt – wTxt for any support vector xt

◦ Typically average over all support vectors Resulting discriminant is called the support

vector machine (SVM)

CptS 570 - Machine Learning 12

CptS 570 - Machine Learning 13

O = support vectors

margin

Data not linearly separable Find hyperplane with least error Define slack variables ξt ≥ 0 storing deviation

from the margin

CptS 570 - Machine Learning 14

( ) ttTt wxr ξ−≥+ 10w

(a) Correctly classified example far from margin (ξt = 0)

(b) Correctly classified example on the margin (ξt = 0)

(c) Correctly classified example, but inside the margin (0 < ξt < 1)

(d) Incorrectly classified example (ξt ≥ 1)

Soft error =

CptS 570 - Machine Learning 15

∑t

CptS 570 - Machine Learning 16

O = support vectors

margin

Lagrangian equation with slack variables

C is penalty factor μt ≥ 0, new set of Lagrange multipliers Want to minimize Lp

CptS 570 - Machine Learning 17

( )[ ] ∑∑∑ −+−+−+=t

tt

t

ttTtt

t

tp wxrCL ξµξαξ 1

21

02 ww

Minimize Lp by setting derivatives to zero

Plugging these into Lp yields dual Ld Maximize Ld with respect to αt

CptS 570 - Machine Learning 18

00

00

0

10

1

=−−⇒=∂

=⇒=∂

=⇒=∂

=

=

tttp

N

t

ttp

N

t

tttp

CL

rwL

rL

µαξ

α

α xww

Quadratic optimization problem Support vectors have αt >0◦ Examples on margin: αt < C◦ Examples inside margin or misclassified: αt = C

CptS 570 - Machine Learning 19

( )

∑∑∑∀≤≤=

+−=

t, 0 and 0 subject to

21

tCr

rrL

tttt

tsTtst

t s

std

αα

ααα xx

C is a regularization parameter◦ High C high penalty for non-separable examples

(overfit)◦ Low C less penalty (underfit)◦ Determine using validation set (C=1 typical)

CptS 570 - Machine Learning 20

( )

∑∑∑∀≤≤=

+−=

t, 0 and 0 subject to

21

tCr

rrL

tttt

tsTtst

t s

std

αα

ααα xx

To use previous approaches, data must be near linearly separable

If not, perhaps a transformation φ(x) will help

φ(x) are basis functions

CptS 570 - Machine Learning 21

φ

Transform d-dimensional x space to k-dimensional z space using basis functions φ(x)

z=φ(x) where zj = φj(x) , j=1,…,k

Instead of w0, assume z1 = φ1(x) ≡1

CptS 570 - Machine Learning 22

∑=

==

=k

jjj

T

T

wg

g

1)()()(

)(

xxφwx

zwz

ϕ

Replace inner product of basis functions φ(xt)Tφ(xs) with kernel function K(xt,xs)

CptS 570 - Machine Learning 23

[ ] ∑∑∑ −+−−+=t

ttt

ttTttt

tp rCL ξµξαξ 1)(

21 2 xφww

( )

∑∑∑∀≤≤=

+−=

t, 0 and 0 subject to

)(21

tCr

rrL

tttt

tsTtst

t s

std

αα

ααα xx φφ

∑∑∑ +−=t

tstst

t s

std KrrL ααα ),(

21 xx

Kernel K(xt,xs) computes z-space product φ(xt)Tφ(xs) in x-space

Matrix of kernel values K, where Kts = K(xt,xs), called the Gram matrix

K should be symmetric and positive semidefinite

CptS 570 - Machine Learning 24

( )

( ) ( ) ( ) ( )

( ) ( )∑

∑∑

=

==

==

t

ttt

t

TtttT

t

ttt

t

ttt

Krg

rg

rr

xxx

xφxφxφwx

xφzw

α

αα

Polynomial kernel of degree q

If q=1, then use original features For example, when q=2 and d=2

CptS 570 - Machine Learning 25

( ) ( )qtTtK 1+= xxxx ,

( ) ( )( )

( ) [ ]T

T

xxxxxx

yxyxyyxxyxyx

yxyx

K

22

212121

22

22

21

2121212211

22211

2

2221

22211

1

,,,,,

,

=

+++++=

++=

+=

x

yxyx

φ

Polynomial kernel of degree 2

CptS 570 - Machine Learning 26

O = support vectors

margin

Radial basis functions (Gaussian kernel)

xt is the center s is the radius Larger s implies smoother boundaries

CptS 570 - Machine Learning 27

( )

−−= 2

2

2sK

tt

xxxx exp,

CptS 570 - Machine Learning 28

Sigmoidal functions

CptS 570 - Machine Learning 29

)12tanh(),( += tTtK xxxx

tanh

Kernel K(x,y) increases with similarity between x and y

Prior knowledge can be included in the kernel function

E.g., training examples are documents◦ K(D1,D2) = # shared words

E.g., training examples are strings (e.g., DNA)◦ K(S1,S2) = 1 / edit distance between S1 and S2◦ Edit distance is the number of insertions, deletions

and/or substitutions to transform S1 into S2

CptS 570 - Machine Learning 30

E.g., training examples are nodes in a graph (e.g., social network)

K(N1,N2) = 1 / length of shortest path connecting nodes

K(N1,N2) = #paths connecting nodes Diffusion kernel

CptS 570 - Machine Learning 31

Training examples are graphs, not feature vectors◦ E.g., carcinogenic vs. non-carcinogenic chemical

structures Compare substructures of graphs◦ E.g., walks, paths, cycles, trees, subgraphs

K(G1,G2) = number of identical random walks in both graphs

K(G1,G2) = number of subgraphs shared by both graphs

CptS 570 - Machine Learning 32

Training data from multiple modalities (e.g., biometrics, social network, audio/visual)

Construct new kernels by combining simpler kernels

If K1(x,y) and K2(x,y) are valid kernels, and c is a constant, then

◦ are valid kernels

CptS 570 - Machine Learning 33

( )( )

( ) ( )( ) ( )

+=

yxyxyxyx

yxyx

,,

,,

,

,

21

21

KK

KK

cK

K

Adaptive kernel combination

Learn both αts and ɳis

CptS 570 - Machine Learning 34

( ) ( )

( )( )∑∑

∑∑ ∑∑

=

−=

==

i

tii

t

tt

t s i

stii

stst

t

td

i

m

ii

Krg

KrrL

KK

xxx

xx

yxyx

,)(

,

,,

ηα

ηααα

η

21

1

Learn K different kernel machines gi(x)◦ Each uses one class as

positive, remaining classes as negative◦ Choose class i such that

i=argmaxj gj(x)◦ Best approach in practice

CptS 570 - Machine Learning 35

Learn K(K-1)/2 kernel machines◦ Each uses one class as

positive and another class as negative◦ Easier (faster) learning per

kernel machine

CptS 570 - Machine Learning 36

Learn all margins at once

◦ zt is the class index of xt

K*N variables to optimize (expensive)

CptS 570 - Machine Learning 37

02

21

00

1

2

≥≠∀−++≥+

+ ∑∑∑=

ti

ttii

tTiz

tT

z

i t

ti

K

ii

ziww

C

tt ξξ

ξ

to subject

min

,,xwxw

w

Normally, we would use squared error

For SVM, we use ε -sensitive loss

◦ Tolerate errors up to ε◦ Errors beyond ε have only linear effect

CptS 570 - Machine Learning 38

( )( ) ( )( )

−−

<−=

otherwise if0

εε tt

tttt

frfr

frex

xx

02 )()]([))(,( wffrfre Ttttt +=−= xwxxx

Use slack variables to account for deviations beyond ε◦ ξt

+ for positive deviations◦ ξt

- for negative deviations

CptS 570 - Machine Learning 39

( )∑ −+ ++t

ttC ξξ2

21 wmin

( )( )

00

0

+≤−+

+≤+−

−+

+

tt

ttT

tTt

rw

wr

ξξ

ξε

ξε

,

xwxw

Subject to

Non-support vectors (inside margin): Support vectors◦ ⊗ on the margin:◦ ⊠ outside margin (outlier):

CptS 570 - Machine Learning 40

0)(,0,0

)()(

))()((21

=−≤≤≤≤

−−+−

−−−=

−+−+

−+−+

−+−+

∑∑

∑∑

t

t

ttt

t

t

ttt

t

t

sTtsst

t s

td

CCtosubject

r

L

αααα

ααααε

αααα xx

0== −+tt αα

CorC tt <<<< −+ αα 00CorC tt == −+ αα

CptS 570 - Machine Learning 41

Fitted line f(x) is weighted sum of support vectors

Average w0 over:

CptS 570 - Machine Learning 42

CifwrCifwr

ttTt

ttTt

<<−+=

<<++=

+

αε

αε

0,

0,

0

0

xwxw

−+

−+

−=

+−=+=

t

tttt

TtttT wwf

xw

xxxwx

)(

))(()( 00

αα

αα

CptS 570 - Machine Learning 43

0)(,0,0

)()(

)())((21

=−≤≤≤≤

−−+−

−−−=

−+−+

−+−+

−+−+

∑∑

∑∑

t

t

ttt

t

t

ttt

t

t

stsst

t s

td

CCtosubject

r

,KL

αααα

ααααε

αααα xx

∑ +−=+= −+t

tttT w),Kwf 00 ()()( xxxwx αα

Polynomial (quadratic) kernel

Gaussian kernel

CptS 570 - Machine Learning 44

Classification: SMO Regression: SMOreg Sequential Minimal Optimization (SMO) Kernels◦ Polynomial◦ RBF◦ String

CptS 570 - Machine Learning 45

Seek optimal separating hyperplane Support vector machine (SVM) finds

hyperplane using only closest examples Kernel function allows SVM to operate in

higher dimensions Kernel regression Choosing correct kernel is crucial Kernel machines among best-performing

learners

CptS 570 - Machine Learning 46

top related