examples lecture 7: kernels for classiﬁcation and...

CS 194-10, F’11Lect. 7

Motivations

Linear classificationand regressionExamples

Generic form

The kernel trickLinear case

Nonlinear case

ExamplesPolynomial kernels

Other kernels

Kernels in practice

Lecture 7: Kernels for Classification and RegressionCS 194-10, Fall 2011

Laurent El Ghaoui

EECS DepartmentUC Berkeley

September 15, 2011

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Outline

Motivations

Linear classification and regressionExamplesGeneric form

The kernel trickLinear caseNonlinear case

ExamplesPolynomial kernelsOther kernelsKernels in practice

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

A linear regression problem

Linear auto-regressive model for time-series: yt linear function ofyt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2, t = 1, . . . ,T .

This writes yt = wT xt , with xt the “feature vectors”

xt := (1, yt−1, yt−2) , t = 1, . . . ,T .

Model fitting via least-squares:

minw‖X T w − y‖2

2

Prediction rule : yT+1 = w1 + w2yT + w3yT−1 = wT xT+1.

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Nonlinear regression

Nonlinear auto-regressive model for time-series: yt quadratic functionof yt−1, yt−2

yt = w1 + w2yt−1 + w3yt−2 + w4y2t−1 + w5yt−1yt−2 + w6y2

t−2.

This writes yt = wTφ(xt), with φ(xt) the augmented feature vectors

φ(xt) :=(

1, yt−1, yt−2, y2t−1, yt−1yt−2, y2

t−2

).

Everything the same as before, with x replaced by φ(x).

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Nonlinear classification

Non-linear (e.g., quadratic) decision boundary

w1x1 + w2x2 + w3x21 + w4x1x2 + w5x2

2 + b = 0.

Writes wTφ(x) + b = 0, with φ(x) := (x1, x2, x21 , x1x2, x2

2 ).

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Challenges

In principle, it seems can always augment the dimension of thefeature space to make the data linearly separable. (See the video athttp://www.youtube.com/watch?v=3liCbRZPrZA)

How do we do it in a computationally efficient manner?

http://www.youtube.com/watch?v=3liCbRZPrZA

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Outline

Motivations




CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Linear least-squares

minw‖X T w − y‖2

2 + λ‖w‖22

whereI X = [x1, . . . , xn] is the m × n matrix of data points.I y ∈ Rm is the “response” vector,I w contains regression coefficients.I λ ≥ 0 is a regularization parameter.

Prediction rule: y = wT x , where x ∈ Rn is a new data point.

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Support vector machine (SVM)

minw

m∑i=1

(1− yi(wT xi + b)) + λ‖w‖22

whereI X = [x1, . . . , xm] is the n ×m matrix of data points in Rn.I y ∈ {−1, 1}m is the label vector.I w , b contain classifer coefficients.I λ ≥ 0 is a regularization parameter.

In the sequel, we’ll ignore the bias term (for simplicity only).

Classification rule: y = sign(wT x + b), where x ∈ Rn is a new datapoint.

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Generic form of problem

Many classification and regression problems can be written

minw

L(X T w , y) + λ‖w‖22

whereI X = [x1, . . . , xn] is a m × n matrix of data points.I y ∈ Rm contains a response vector (or labels).I w contains classifier coefficients.I L is a “loss” function that depends on the problem considered.I λ ≥ 0 is a regularization parameter.

Prediction/classification rule: depends only on wT x , where x ∈ Rn isa new data point.

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Loss functionsI Squared loss: (for linear least-squares regression)

L(z, y) = ‖z − y‖22.

I Hinge loss: (for SVMs)

L(z, y) =m∑

i=1

max(0, 1− yizi)

I Logistic loss: (for logistic regression)

L(z, y) = −m∑

i=1

log(1 + e−yi zi ).

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Outline

Motivations




CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Key result

For the generic problem:

minw

L(X T w) + λ‖w‖22

the optimal w lies in the span of the data points (x1, . . . , xm):

w = Xv

for some vector v ∈ Rm.

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Proof

Any w ∈ Rn can be written as the sum of two orthogonal vectors:

w = Xv + r

where X T r = 0 (that is, r is in the nullspace N (X T )).

vectors and matrices 55

Since for any subspace S it holds that S⊥⊥ = S , taking the orthogonalcomplement of both sides in the previous equation, we obtain thatN (A)⊥ = R(A�), hence

Rn = N (A) ⊕ N (A)⊥ = N (A) ⊕ R(A�).

This means that the input space Rn can be decomposed as the directsum of two orthogonal subspaces N (A) and R(A�). Since dim R(A�) =

dim R(A) = rank(A), and obviously dim Rn = n, we also obtain that

dim N (A) + rank(A) = n. (2.10)

With a similar reasoning, we argue that

R(A)⊥ = {y ∈ Rm : y�z = 0, ∀z ∈ R(A)}= {y ∈ Rm : y�Ax = 0, ∀x ∈ Rn} = N (A�),

hence by taking orthogonal complement of both sides, we see that

R(A) = N (A�)⊥.

Therefore, the output space Rm is decomposed as

Rm = R(A) ⊕ R(A)⊥ = R(A) ⊕ N (A�),

and

dim N (A�) + rank(A) = m,

see a graphical exemplification in R3 in Figure 2.2.2.

Figure 2.26: Illustrationof the fundamentaltheorem of linearAlgebra in R3; A =[a1 a2].

Figure shows the case X = A = (a1, a2).

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Consequence of key result

For the generic problem:

minw

L(X T w) + λ‖w‖22

the optimal w can be written as w = Xv for some vector v ∈ Rm.

Hence training problem depends only on K := X T X :

minv

L(Kv) + λvT Kv .

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Kernel matrix

The training problem depends only on the “kernel matrix” K = X T X

Kij = xTi xj

K contains the scalar products between all data point pairs.

The prediction/classification rule depends on the scalar productsbetween new point x and the data points x1, . . . , xm:

wT x = vT X T x = vT k , k := X T x = (xT x1, . . . , xT xm).

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Computational advantages

Once K is formed (this takes O(n)), then the training problem has onlym variables.

When n >> m, this leads to a dramatic reduction in problem size.

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

How about the nonlinear case?

In the nonlinear case, we simply replace the feature vectors xi bysome “augmented” feature vectors φ(xi), with φ a non-linear mapping.

Example : in classification with quadratic decision boundary, we use

φ(x) := (x1, x2, x21 , x1x2, x2

2 ).

This leads to the modified kernel matrix

Kij = φ(xi)Tφ(xj), 1 ≤ i , j ≤ m.

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

The kernel function

The kernel function associated with mapping φ is

k(x , z) = φ(x)Tφ(z).

It provides information about the metric in the feature space, e.g.:

‖φ(x)− φ(z)‖22 = k(x , x)− 2k(x , z) + k(z, z).

The computational effort involved inI solving the training problem;I making a prediction,

depends only on our ability to quickly evaluate such scalar products.

We can’t choose k arbitrarily; it has to satisfy the above for some φ.

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Outline

Motivations




CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Quadratic kernels

Classification with quadratic boundaries involves feature vectors

φ(x) = (1, x1, x2, x21 , x1x2, x2

2 ).

Fact : given two vectors x , z ∈ R2, we have

φ(x)Tφ(z) = (1 + xT z)2.

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Polynomial kernels

More generally when φ(x) is the vector formed with all the productsbetween the components of x ∈ Rn, up to degree d , then for any twovectors x , z ∈ Rn,

φ(x)Tφ(z) = (1 + xT z)d .

Computational effort grows linearly in n.

This represents a dramatic reduction in speed over the “brute force”approach:

I Form φ(x), φ(z);I evaluate φ(x)Tφ(z).

Computational effort grows as nd .

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

Other kernels

Gaussian kernel function:

k(x , z) = exp(−‖x − z‖2

2

2σ2

),

where σ > 0 is a scale parameter. Allows to ignore points that are toofar apart. Corresponds to a non-linear mapping φ toinfinite-dimensional feature space.

There is a large variety (a zoo?) of other kernels, some adapted tostructure of data (text, images, etc).

CS 194-10, F’11Lect. 7

Motivations


Generic form


Nonlinear case


Other kernels

Kernels in practice

In practiceI Kernels need to be chosen by the user.I Choice not always obvious; Gaussian or polynomial kernels are

popular.I Control over-fitting via cross validation (wrt say, scale parameter

of Gaussian kernel, or degree of polynomial kernel).I Kernel methods not well adapted to l1-norm regularization.

examples lecture 7: kernels for classiﬁcation and...

Documents