examples lecture 7: kernels for classification and...
TRANSCRIPT
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Lecture 7: Kernels for Classification and RegressionCS 194-10, Fall 2011
Laurent El Ghaoui
EECS DepartmentUC Berkeley
September 15, 2011
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Outline
Motivations
Linear classification and regressionExamplesGeneric form
The kernel trickLinear caseNonlinear case
ExamplesPolynomial kernelsOther kernelsKernels in practice
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Outline
Motivations
Linear classification and regressionExamplesGeneric form
The kernel trickLinear caseNonlinear case
ExamplesPolynomial kernelsOther kernelsKernels in practice
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
A linear regression problem
Linear auto-regressive model for time-series: yt linear function ofyt−1, yt−2
yt = w1 + w2yt−1 + w3yt−2, t = 1, . . . ,T .
This writes yt = wT xt , with xt the “feature vectors”
xt := (1, yt−1, yt−2) , t = 1, . . . ,T .
Model fitting via least-squares:
minw‖X T w − y‖2
2
Prediction rule : yT+1 = w1 + w2yT + w3yT−1 = wT xT+1.
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Nonlinear regression
Nonlinear auto-regressive model for time-series: yt quadratic functionof yt−1, yt−2
yt = w1 + w2yt−1 + w3yt−2 + w4y2t−1 + w5yt−1yt−2 + w6y2
t−2.
This writes yt = wTφ(xt), with φ(xt) the augmented feature vectors
φ(xt) :=(
1, yt−1, yt−2, y2t−1, yt−1yt−2, y2
t−2
).
Everything the same as before, with x replaced by φ(x).
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Nonlinear classification
Non-linear (e.g., quadratic) decision boundary
w1x1 + w2x2 + w3x21 + w4x1x2 + w5x2
2 + b = 0.
Writes wTφ(x) + b = 0, with φ(x) := (x1, x2, x21 , x1x2, x2
2 ).
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Challenges
In principle, it seems can always augment the dimension of thefeature space to make the data linearly separable. (See the video athttp://www.youtube.com/watch?v=3liCbRZPrZA)
How do we do it in a computationally efficient manner?
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Outline
Motivations
Linear classification and regressionExamplesGeneric form
The kernel trickLinear caseNonlinear case
ExamplesPolynomial kernelsOther kernelsKernels in practice
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Linear least-squares
minw‖X T w − y‖2
2 + λ‖w‖22
whereI X = [x1, . . . , xn] is the m × n matrix of data points.I y ∈ Rm is the “response” vector,I w contains regression coefficients.I λ ≥ 0 is a regularization parameter.
Prediction rule: y = wT x , where x ∈ Rn is a new data point.
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Support vector machine (SVM)
minw
m∑i=1
(1− yi(wT xi + b)) + λ‖w‖22
whereI X = [x1, . . . , xm] is the n ×m matrix of data points in Rn.I y ∈ {−1, 1}m is the label vector.I w , b contain classifer coefficients.I λ ≥ 0 is a regularization parameter.
In the sequel, we’ll ignore the bias term (for simplicity only).
Classification rule: y = sign(wT x + b), where x ∈ Rn is a new datapoint.
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Generic form of problem
Many classification and regression problems can be written
minw
L(X T w , y) + λ‖w‖22
whereI X = [x1, . . . , xn] is a m × n matrix of data points.I y ∈ Rm contains a response vector (or labels).I w contains classifier coefficients.I L is a “loss” function that depends on the problem considered.I λ ≥ 0 is a regularization parameter.
Prediction/classification rule: depends only on wT x , where x ∈ Rn isa new data point.
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Loss functionsI Squared loss: (for linear least-squares regression)
L(z, y) = ‖z − y‖22.
I Hinge loss: (for SVMs)
L(z, y) =m∑
i=1
max(0, 1− yizi)
I Logistic loss: (for logistic regression)
L(z, y) = −m∑
i=1
log(1 + e−yi zi ).
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Outline
Motivations
Linear classification and regressionExamplesGeneric form
The kernel trickLinear caseNonlinear case
ExamplesPolynomial kernelsOther kernelsKernels in practice
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Key result
For the generic problem:
minw
L(X T w) + λ‖w‖22
the optimal w lies in the span of the data points (x1, . . . , xm):
w = Xv
for some vector v ∈ Rm.
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Proof
Any w ∈ Rn can be written as the sum of two orthogonal vectors:
w = Xv + r
where X T r = 0 (that is, r is in the nullspace N (X T )).
vectors and matrices 55
Since for any subspace S it holds that S⊥⊥ = S , taking the orthogonalcomplement of both sides in the previous equation, we obtain thatN (A)⊥ = R(A�), hence
Rn = N (A) ⊕ N (A)⊥ = N (A) ⊕ R(A�).
This means that the input space Rn can be decomposed as the directsum of two orthogonal subspaces N (A) and R(A�). Since dim R(A�) =
dim R(A) = rank(A), and obviously dim Rn = n, we also obtain that
dim N (A) + rank(A) = n. (2.10)
With a similar reasoning, we argue that
R(A)⊥ = {y ∈ Rm : y�z = 0, ∀z ∈ R(A)}= {y ∈ Rm : y�Ax = 0, ∀x ∈ Rn} = N (A�),
hence by taking orthogonal complement of both sides, we see that
R(A) = N (A�)⊥.
Therefore, the output space Rm is decomposed as
Rm = R(A) ⊕ R(A)⊥ = R(A) ⊕ N (A�),
and
dim N (A�) + rank(A) = m,
see a graphical exemplification in R3 in Figure 2.2.2.
Figure 2.26: Illustrationof the fundamentaltheorem of linearAlgebra in R3; A =[a1 a2].
Figure shows the case X = A = (a1, a2).
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Consequence of key result
For the generic problem:
minw
L(X T w) + λ‖w‖22
the optimal w can be written as w = Xv for some vector v ∈ Rm.
Hence training problem depends only on K := X T X :
minv
L(Kv) + λvT Kv .
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Kernel matrix
The training problem depends only on the “kernel matrix” K = X T X
Kij = xTi xj
K contains the scalar products between all data point pairs.
The prediction/classification rule depends on the scalar productsbetween new point x and the data points x1, . . . , xm:
wT x = vT X T x = vT k , k := X T x = (xT x1, . . . , xT xm).
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Computational advantages
Once K is formed (this takes O(n)), then the training problem has onlym variables.
When n >> m, this leads to a dramatic reduction in problem size.
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
How about the nonlinear case?
In the nonlinear case, we simply replace the feature vectors xi bysome “augmented” feature vectors φ(xi), with φ a non-linear mapping.
Example : in classification with quadratic decision boundary, we use
φ(x) := (x1, x2, x21 , x1x2, x2
2 ).
This leads to the modified kernel matrix
Kij = φ(xi)Tφ(xj), 1 ≤ i , j ≤ m.
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
The kernel function
The kernel function associated with mapping φ is
k(x , z) = φ(x)Tφ(z).
It provides information about the metric in the feature space, e.g.:
‖φ(x)− φ(z)‖22 = k(x , x)− 2k(x , z) + k(z, z).
The computational effort involved inI solving the training problem;I making a prediction,
depends only on our ability to quickly evaluate such scalar products.
We can’t choose k arbitrarily; it has to satisfy the above for some φ.
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Outline
Motivations
Linear classification and regressionExamplesGeneric form
The kernel trickLinear caseNonlinear case
ExamplesPolynomial kernelsOther kernelsKernels in practice
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Quadratic kernels
Classification with quadratic boundaries involves feature vectors
φ(x) = (1, x1, x2, x21 , x1x2, x2
2 ).
Fact : given two vectors x , z ∈ R2, we have
φ(x)Tφ(z) = (1 + xT z)2.
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Polynomial kernels
More generally when φ(x) is the vector formed with all the productsbetween the components of x ∈ Rn, up to degree d , then for any twovectors x , z ∈ Rn,
φ(x)Tφ(z) = (1 + xT z)d .
Computational effort grows linearly in n.
This represents a dramatic reduction in speed over the “brute force”approach:
I Form φ(x), φ(z);I evaluate φ(x)Tφ(z).
Computational effort grows as nd .
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
Other kernels
Gaussian kernel function:
k(x , z) = exp(−‖x − z‖2
2
2σ2
),
where σ > 0 is a scale parameter. Allows to ignore points that are toofar apart. Corresponds to a non-linear mapping φ toinfinite-dimensional feature space.
There is a large variety (a zoo?) of other kernels, some adapted tostructure of data (text, images, etc).
CS 194-10, F’11Lect. 7
Motivations
Linear classificationand regressionExamples
Generic form
The kernel trickLinear case
Nonlinear case
ExamplesPolynomial kernels
Other kernels
Kernels in practice
In practiceI Kernels need to be chosen by the user.I Choice not always obvious; Gaussian or polynomial kernels are
popular.I Control over-fitting via cross validation (wrt say, scale parameter
of Gaussian kernel, or degree of polynomial kernel).I Kernel methods not well adapted to l1-norm regularization.