Learning in Feature Space(Could Simplify the Classification Task)
Learning in a high dimensional space could degradegeneralization performance This phenomenon is called curse of dimensionality
By using a kernel function, that represents the innerproduct of training example in feature space, we neverneed to explicitly know what the nonlinear map is. Even do not know the dimensionality of feature space
There is no free lunch Deal with a huge and dense kernel matrix
Reduced kernel can avoid this difficulty
f (x) =ð P
j=1
?
wjþj(x)ñ
+ b
Linear Machine in Feature Space
Let þ : X ! Fbe a nonlinear map from the
input space to some feature space
The classifier will be in the form (Primal):
Make it in the dual form:
f (x) =ð P
i=1
lë iyi
êþ(xi) áþ(x)
ëñ+ b
The Perceptron Algorithm (Dual Form)
Given a linearly separable training setS ë = 0; ë 2 R l
b= 0;R = max16 i6 l jjxijjand
Repeat: for i = 1 to l
if yi(P
j=1
l
ë jyjêxj áxi
ë+ b)60 then
ë i ë i + 1; b b+ yiR2
end if
until no mistakes made within the for loop return:
end for
w =P
i=1l ë iyixi
(ë;b)
K (x;z) =êþ(x) áþ(z)
ë
Kernel: Represent Inner Product in Feature Space
The classifier will become:
f (x) =ð P
i=1
lë iyiK (xi;x)
ñ+ b
Definition: A kernel is a functionK : X â X ! Rsuch thatfor all x;z 2 X
where þ : X ! F
A Simple Example of Kernel
Polynomial Kernel of Degree 2:K (x;z) =êx;z
ë2
Let
x = x1
x2
ô õ
;z = z1
z2
ô õ2 R2and the nonlinear map
þ : R27! R3 defined by
þ(x) =x2
1
x22
2p
x1x2
2
4
3
5 .
Then
êþ(x);þ(z)
ë=
êx;z
ë2= K (x;z).
There are many other nonlinear maps, (x), that
satisfy the relation:ê (x); (z)
ë=
êx;z
ë2= K (x;z)
Power of the Kernel Technique
Consider a nonlinear map
þ : Rn7! Rp that consists
of distinct features of all the monomials of degree d.Then p = n + dà 1
d
ð ñ.
For example:n = 11; d = 10; p = 92378
Is it necessary? We only need to know êþ(x);þ(z)
ë!
This can be achieved
K (x;z) =êx;z
ëd
Basic Properties of Kernel Function
K (x;z)26K (x;x)K (z;z)
Symmetric (inherit from inner product)
K (x;z) =êþ(x) áþ(z)
ë
=êþ(z) áþ(x)
ë= K (z;x)
Cauchy-Schwarz inequality
These conditions are not sufficient to guarantee
the existence of a feature space
Characterization of KernelsMotivation in Finite Input Space
Consider a finite spaceX = f x1; x2; . . .; xng
and k(x;z0)is a symmetric function onX .
Let K 2 Rnâ nbe a matrix defined as following:
K i j = k(xi;xj) ( symmetric
There is an orthogonal matrixV such that:
K = VËV0= [v1;v2; . . .;vn]
"õ1 ...
õn
#"v10
...vn0
#
Characterization of KernelsAssume:
Let
Be Positive Semi-definite
where
K = VËV0= [v1;v2; . . .;vn]
"õ1 ...
õn
#"v10
...vn0
#
K 2 Rnâ n
õi>0; i = 1;2; . . .;n
=êþ(xi);þ(xj)
ë
þ : xi7! [ õ1p
v1i ; õ2
pv2
i ; . . . õnp
vni ]
[ õ1p
v1;. . .; õnp
vn]iïõ1
pv10
...õn
pvn0
2
4
3
5
ï j
Mercer’s Conditions:Guarantee the Existence of Feature Space
and k(x;z)is a symmetric function onX .
K 2 Rnâ n
be a finite spaceX = f x1; x2; . . .; xngLet
Then k(x;z)is a kernel function if and only if
is positive semi-definite.;K ij = k(xi;xj)
What if Xis infinite (but compact)?
Mercer’s conditions:
Any finite subset ofXthe corresponding matrix ispositive semi-definite.
Making KernelsKernels Satisfy a Number of Closure Properties
Let
Rmâ m
Then the following functions are kernels:
k1; k2be kernels overX â X ; X ò Rn; a > 0
f : X7! R and þ : X7! Rm; k3be a kernel overandB 2 Rnâ nbe a symmetric positive semi-definite.
1: k(x;z) = k1(x;z) + k2(x;z) 2: k(x;z) = ak1(x;z)
3: k(x;z) = k1(x;z)k2(x;z) 4: k(x;z) = f (x)f (z)
5: k(x;z) = k3(þ(x);þ(z)) 6: k(x;z) = x0Bz
Translation Invariant Kernels
two inputs is unchanged if both are translatedby the same vector.
The inner product (in the feature space) of
The kernels are in the form:k(x;z) = k(x à z) Some examples:
k(x;z) = "à öjjxà zjj22
k(x;z) = (jjx à zjj22 + c2)1=2
Gaussian RBF:
Multiquadric:
Fourier: see Example 3.9 on p. 37