learning in feature space (could simplify the classification task) learning in a high dimensional...

14
Learning in Feature Space (Could Simplify the Classification Task) Learning in a high dimensional space could degrade generalization performance This phenomenon is called curse of dimensionality By using a kernel function, that represents the inner roduct of training example in feature space, we nev eed to explicitly know what the nonlinear map is. Even do not know the dimensionality of feature sp There is no free lunch Deal with a huge and dense kernel matrix Reduced kernel can avoid this difficulty

Post on 22-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Learning in Feature Space(Could Simplify the Classification Task)

Learning in a high dimensional space could degradegeneralization performance This phenomenon is called curse of dimensionality

By using a kernel function, that represents the innerproduct of training example in feature space, we neverneed to explicitly know what the nonlinear map is. Even do not know the dimensionality of feature space

There is no free lunch Deal with a huge and dense kernel matrix

Reduced kernel can avoid this difficulty

X Fþ

þ( ) þ( )

þ( )þ( )

þ( )

þ( )

þ( )þ( )

f (x) =ð P

j=1

?

wjþj(x)ñ

+ b

Linear Machine in Feature Space

Let þ : X ! Fbe a nonlinear map from the

input space to some feature space

The classifier will be in the form (Primal):

Make it in the dual form:

f (x) =ð P

i=1

lë iyi

êþ(xi) áþ(x)

ëñ+ b

The Perceptron Algorithm (Dual Form)

Given a linearly separable training setS ë = 0; ë 2 R l

b= 0;R = max16 i6 l jjxijjand

Repeat: for i = 1 to l

if yi(P

j=1

l

ë jyjêxj áxi

ë+ b)60 then

ë i ë i + 1; b b+ yiR2

end if

until no mistakes made within the for loop return:

end for

w =P

i=1l ë iyixi

(ë;b)

K (x;z) =êþ(x) áþ(z)

ë

Kernel: Represent Inner Product in Feature Space

The classifier will become:

f (x) =ð P

i=1

lë iyiK (xi;x)

ñ+ b

Definition: A kernel is a functionK : X â X ! Rsuch thatfor all x;z 2 X

where þ : X ! F

A Simple Example of Kernel

Polynomial Kernel of Degree 2:K (x;z) =êx;z

ë2

Let

x = x1

x2

ô õ

;z = z1

z2

ô õ2 R2and the nonlinear map

þ : R27! R3 defined by

þ(x) =x2

1

x22

2p

x1x2

2

4

3

5 .

Then

êþ(x);þ(z)

ë=

êx;z

ë2= K (x;z).

There are many other nonlinear maps, (x), that

satisfy the relation:ê (x); (z)

ë=

êx;z

ë2= K (x;z)

Power of the Kernel Technique

Consider a nonlinear map

þ : Rn7! Rp that consists

of distinct features of all the monomials of degree d.Then p = n + dà 1

d

ð ñ.

For example:n = 11; d = 10; p = 92378

Is it necessary? We only need to know êþ(x);þ(z)

ë!

This can be achieved

K (x;z) =êx;z

ëd

Basic Properties of Kernel Function

K (x;z)26K (x;x)K (z;z)

Symmetric (inherit from inner product)

K (x;z) =êþ(x) áþ(z)

ë

=êþ(z) áþ(x)

ë= K (z;x)

Cauchy-Schwarz inequality

These conditions are not sufficient to guarantee

the existence of a feature space

Characterization of KernelsMotivation in Finite Input Space

Consider a finite spaceX = f x1; x2; . . .; xng

and k(x;z0)is a symmetric function onX .

Let K 2 Rnâ nbe a matrix defined as following:

K i j = k(xi;xj) ( symmetric

There is an orthogonal matrixV such that:

K = VËV0= [v1;v2; . . .;vn]

"õ1 ...

õn

#"v10

...vn0

#

Characterization of KernelsAssume:

Let

Be Positive Semi-definite

where

K = VËV0= [v1;v2; . . .;vn]

"õ1 ...

õn

#"v10

...vn0

#

K 2 Rnâ n

õi>0; i = 1;2; . . .;n

=êþ(xi);þ(xj)

ë

þ : xi7! [ õ1p

v1i ; õ2

pv2

i ; . . . õnp

vni ]

[ õ1p

v1;. . .; õnp

vn]iïõ1

pv10

...õn

pvn0

2

4

3

5

ï j

Mercer’s Conditions:Guarantee the Existence of Feature Space

and k(x;z)is a symmetric function onX .

K 2 Rnâ n

be a finite spaceX = f x1; x2; . . .; xngLet

Then k(x;z)is a kernel function if and only if

is positive semi-definite.;K ij = k(xi;xj)

What if Xis infinite (but compact)?

Mercer’s conditions:

Any finite subset ofXthe corresponding matrix ispositive semi-definite.

Making KernelsKernels Satisfy a Number of Closure Properties

Let

Rmâ m

Then the following functions are kernels:

k1; k2be kernels overX â X ; X ò Rn; a > 0

f : X7! R and þ : X7! Rm; k3be a kernel overandB 2 Rnâ nbe a symmetric positive semi-definite.

1: k(x;z) = k1(x;z) + k2(x;z) 2: k(x;z) = ak1(x;z)

3: k(x;z) = k1(x;z)k2(x;z) 4: k(x;z) = f (x)f (z)

5: k(x;z) = k3(þ(x);þ(z)) 6: k(x;z) = x0Bz

Translation Invariant Kernels

two inputs is unchanged if both are translatedby the same vector.

The inner product (in the feature space) of

The kernels are in the form:k(x;z) = k(x à z) Some examples:

k(x;z) = "à öjjxà zjj22

k(x;z) = (jjx à zjj22 + c2)1=2

Gaussian RBF:

Multiquadric:

Fourier: see Example 3.9 on p. 37

A Negative Definite Kernel

Generalized Support Vector Machine

The kernel k(x;z) = (à x0zà 1)3is negative definite

Does not satisfy Mercer’s conditions

Oliv L. Mangansarian used this kernel to solve

XOR classification problem