knn, kernels, margins · knn classi er k-nearest neighbors idea when classifying a new example, nd...

37
KNN, Kernels, Margins Grzegorz Chrupala and Nicolas Stroppa Google Saarland University META Stroppa and Chrupala (UdS) ML for NLP 2010 1 / 35

Upload: others

Post on 27-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

KNN, Kernels, Margins

Grzegorz Chrupa la and Nicolas Stroppa

GoogleSaarland University

META

Stroppa and Chrupala (UdS) ML for NLP 2010 1 / 35

Page 2: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Outline

1 K-Nearest neighbors classifier

2 Kernels

3 SVM

Stroppa and Chrupala (UdS) ML for NLP 2010 2 / 35

Page 3: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Outline

1 K-Nearest neighbors classifier

2 Kernels

3 SVM

Stroppa and Chrupala (UdS) ML for NLP 2010 3 / 35

Page 4: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

KNN classifier

K-Nearest neighbors idea

When classifying a new example, find k nearest training example, andassign the majority label

Also known asI Memory-based learningI Instance or exemplar based learningI Similarity-based methodsI Case-based reasoning

Stroppa and Chrupala (UdS) ML for NLP 2010 4 / 35

Page 5: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Distance metrics in feature space

Euclidean distance or L2 norm in d dimensional space:

D(x, x′) =

√√√√ d∑i=1

(xi − x ′i )2

L1 norm (Manhattan or taxicab distance)

L1(x, x′) =d∑

i=1

|xi − x ′i |

L∞ or maximum norm

L∞(x, x′) =d

maxi=1|xi − x ′i |

In general, Lk norm:

Lk(x, x′) =

(d∑

i=1

|xi − x ′i |k)1/k

Stroppa and Chrupala (UdS) ML for NLP 2010 5 / 35

Page 6: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Hamming distance

Hamming distance used to compare strings of symbolic attributes

Equivalent to L1 norm for binary strings

Defines the distance between two instances to be the sum ofper-feature distances

For symbolic features the per-feature distance is 0 for an exact matchand 1 for a mismatch.

Hamming(x , x ′) =d∑

i=1

δ(xi , x′i ) (1)

δ(xi , x′i ) =

{0 if xi = x ′i1 if xi 6= x ′i

(2)

Stroppa and Chrupala (UdS) ML for NLP 2010 6 / 35

Page 7: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

IB1 algorithm

For a vector with a mixture of symbolic and numeric values, the abovedefinition of per feature distance is used for symbolic features, while fornumeric ones we can use the scaled absolute difference

δ(xi , x′i ) =

xi − x ′imaxi −mini

. (3)

Stroppa and Chrupala (UdS) ML for NLP 2010 7 / 35

Page 8: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

IB1 with feature weighting

The per-feature distance is multiplied by the weight of the feature forwhich it is computed:

Dw(x , x ′) =d∑

i=1

wiδ(xi , x′i ) (4)

where wi is the weight of the i th feature.

We’ll describe two entropy-based methods and a χ2-based method tofind a good weight vector w.

Stroppa and Chrupala (UdS) ML for NLP 2010 8 / 35

Page 9: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Information gain

A measure of how much knowing the value of a certain feature for anexample decreases our uncertainty about its class, i.e. difference in classentropy with and without information about the feature value.

wi = H(Y )−∑v∈Vi

P(v)H(Y |v) (5)

where

wi is the weight of the i th feature

Y is the set of class labels

Vi is the set of possible values for the i th feature

P(v) is the probability of value v

class entropy H(Y ) = −∑

y∈Y P(y) log2 P(y)

H(Y |v) is the conditional class entropy given that feature value = vNumeric values need to be temporarily discretized for this to work

Stroppa and Chrupala (UdS) ML for NLP 2010 9 / 35

Page 10: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Gain ratio

IG assigns excessive weight to features with a large number of values.To remedy this bias information gain can be normalized by the entropy ofthe feature values, which gives the gain ratio:

wi =H(Y )−

∑v∈Vi

P(v)H(Y |v)

H(Vi )(6)

For a feature with a unique value for each instance in the training set, theentropy of the feature values in the denominator will be maximally high,and will thus give it a low weight.

Stroppa and Chrupala (UdS) ML for NLP 2010 10 / 35

Page 11: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

χ2

The χ2 statistic for a problem with k classes and m values for feature F :

χ2 =k∑

i=1

m∑j=1

(Eij − Oij)2

Eij(7)

where

Oij is the observed number of instances with the i th class label andthe j th value of feature F

Eij is the expected number of such instances in case the nullhypothesis is true: Eij =

n·jni·n··

nij is the frequency count of instances with the i th class label and thej th value of feature F

I n·j =∑k

i=1 nij

I ni· =∑m

j=1 nij

I n·· =∑k

i=1

∑mj=0 nij

Stroppa and Chrupala (UdS) ML for NLP 2010 11 / 35

Page 12: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

χ2 example

Consider a spam detection task: your features are wordspresent/absent in email messages

Compute χ2 for each word to use a weightings for a KNN classifier

The statistic can be computed from a contingency table. Eg. thoseare (fake) counts of rock-hard in 2000 messages

rock-hard ¬ rock-hard

ham 4 996

spam 100 900

We need to sum (Eij − Oij)2/Eij for the four cells in the table:

(52− 4)2

52+

(948− 996)2

948+

(52− 100)2

52+

(948− 900)2

948= 93.4761

Stroppa and Chrupala (UdS) ML for NLP 2010 12 / 35

Page 13: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Distance-weighted class voting

So far all the instances in the neighborhood are weighted equally forcomputing the majority class

We may want to treat the votes from very close neighbors as moreimportant than votes from more distant ones

A variety of distance weighting schemes have been proposed toimplement this idea

Stroppa and Chrupala (UdS) ML for NLP 2010 13 / 35

Page 14: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

KNN – summary

Non-parametric: makes no assumptions about the probabilitydistribution the examples come from

Does not assume data is linearly separable

Derives decision rule directly from training data

“Lazy learning”:I During learning little “work” is done by the algorithm: the training

instances are simply stored in memory in some efficient manner.I During prediction the test instance is compared to the training

instances, the neighborhood is calculated, and the majority labelassigned

No information discarded: “exceptional” and low frequency traininginstances are available for prediction

Stroppa and Chrupala (UdS) ML for NLP 2010 14 / 35

Page 15: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Outline

1 K-Nearest neighbors classifier

2 Kernels

3 SVM

Stroppa and Chrupala (UdS) ML for NLP 2010 15 / 35

Page 16: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Perceptron – dual formulation

The weight vector ends up being a linear combination of trainingexamples:

w =n∑

i=1

αiyx(i)

where αi = 1 if the i th was misclassified, and = 0 otherwise.

The discriminant function then becomes:

g(x) =

(n∑

i=1

αiy(i)x(i)

)· x + b (8)

=n∑

i=1

αiy(i)x(i) · x + b (9)

Stroppa and Chrupala (UdS) ML for NLP 2010 16 / 35

Page 17: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Dual Perceptron training

DualPerceptron(x1:N , y 1:N , I ):

1: α← 02: b ← 03: for j = 1...I do4: for k = 1...N do5: if y (k)

[∑Ni=1 αiy

(i)x(i) · x(k) + b]≤ 0 then

6: αi ← αi + 17: b ← b + y (k)

8: return (α, b)

Stroppa and Chrupala (UdS) ML for NLP 2010 17 / 35

Page 18: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Kernels

Note that in the dual formulation there is no explicit weight vector:the training algorithm and the classification are expressed in terms ofdot products between training examples and the test example

We can generalize such dual algorithms to use Kernel functionsI A kernel function can be thought of as dot product in some

transformed feature space

K (x, z) = φ(x) · φ(z)

where the map φ projects the vectors in the original feature space ontothe transformed feature space

I It can also be thought of as a similarity function in the input objectspace

Stroppa and Chrupala (UdS) ML for NLP 2010 18 / 35

Page 19: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Kernel – example

Consider the following kernel function K : Rd × Rd → R

K (x, z) = (x · z)2 (10)

=

(d∑

i=1

xizi

)(d∑

i=1

xizi

)(11)

=d∑

i=1

d∑j=1

xixjzizj (12)

=d∑

i ,j=1

(xixj)(zizj) (13)

Stroppa and Chrupala (UdS) ML for NLP 2010 19 / 35

Page 20: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Kernel vs feature map

Feature map φ corresponding to K for d = 2 dimensions

φ

(x1

x2

)=

x1x1

x1x2

x2x1

x2x2

(14)

Computing feature map φ explicitly needs O(d2) time

Computing K is linear O(d) in the number of dimensions

Stroppa and Chrupala (UdS) ML for NLP 2010 20 / 35

Page 21: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Why does it matter

If you think of features as binary indicators, then the quadratic kernelabove creates feature conjunctions

E.g. in NER if x1 indicates that word is capitalized and x2 indicatesthat the previous token is a sentence boundary, with the quadratickernel we efficiently compute the feature that both conditions are thecase.

Geometric intuition: mapping points to higher dimensional spacemakes them easier to separate with a linear boundary

Stroppa and Chrupala (UdS) ML for NLP 2010 21 / 35

Page 22: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Separability in 2D and 3D

Figure: Two dimensional classification example, non-separable in two dimensions,becomes separable when mapped to 3 dimensions by (x1, x2) 7→ (x2

1 , 2x1x2, x22 )

Stroppa and Chrupala (UdS) ML for NLP 2010 22 / 35

Page 23: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Outline

1 K-Nearest neighbors classifier

2 Kernels

3 SVM

Stroppa and Chrupala (UdS) ML for NLP 2010 23 / 35

Page 24: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Support Vector Machines

Margin: a decision boundary which is as far away from the traininginstances as possible improves the chance that if the position of thedata points is slightly perturbed, the decision boundary will still becorrect.

Results from Statistical Learning Theory confirm these intuitions:maintaining large margins leads to small generalization error (Vapnik1995)

A perceptron algorithm finds any hyperplane which separates theclasses: SVM finds the one that additionally has the maximum margin

Stroppa and Chrupala (UdS) ML for NLP 2010 24 / 35

Page 25: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Quadratic optimization formulation

Functional margin can be made larger just by rescaling the weights bya constant

Hence we can fix the functional margin to be 1 and minimize thenorm of the weight vector

This is equivalent to maximizing the geometric margin

For linearly separable training instances ((x1, y1), ..., (xn, yn)) find thehyperplane (w, b) that solves the optimization problem:

minimizew,b1

2||w||2

subject to yi (w · xi + b) ≥ 1 ∀i∈1..n

(15)

This hyperplane separates the examples with geometric margin 2/||w||

Stroppa and Chrupala (UdS) ML for NLP 2010 25 / 35

Page 26: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Support vectors

SVM finds a separating hyperplane with the largest margin to thenearest instance

This has the effect that the decision boundary is fully determined by asmall subset of the training examples (the nearest ones on both sides)

Those instances are the support vectors

Stroppa and Chrupala (UdS) ML for NLP 2010 26 / 35

Page 27: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Separating hyperplane and support vectors

−4 −2 0 2 4

−4

−2

02

4

x

y

Stroppa and Chrupala (UdS) ML for NLP 2010 27 / 35

Page 28: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Soft margin

SVM with soft margin works by relaxing the requirement that all datapoints lie outside the margin

For each offending instance there is a “slack variable” ξi whichmeasures how much it would have move to obey the marginconstraint.

minimizew,b1

2||w||2 + C

n∑i=1

ξi

subject to yi (w · xi + b) ≥ 1− ξi∀i∈1..nξi > 0

(16)

whereξi = max(0, 1− yi (w · xi + b))

The hyper-parameter C trades off minimizing the norm of the weightvector versus classifying correctly as many examples as possible.

As the value of C tends towards infinity the soft-margin SVMapproximates the hard-margin version.

Stroppa and Chrupala (UdS) ML for NLP 2010 28 / 35

Page 29: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Dual formThe dual formulation is in terms of support vectors, where SV is the set of theirindices:

f (x , α∗, b∗) = sign

(∑i∈SV

yiα∗i (xi · x) + b∗

)(17)

The weights in this decision function are the α∗.

minimize W (α) =n∑

i=1

αi −1

2

n∑i,j=1

yiyjαiαj(xi · xj)

subject ton∑

i=1

yiαi = 0 ∀i∈1..nαi ≥ 0

(18)

The weights together with the support vectors determine (w, b):

w =∑i∈SV

αiyixi (19)

b = yk −w · xk for any k such that αk 6= 0 (20)

Stroppa and Chrupala (UdS) ML for NLP 2010 29 / 35

Page 30: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Summary

With the kernel trick we canI use linear models with non-linearly separable dataI use polynomial kernels to create implicit feature conjunctions

Large margin methods help us choose a linear separator with goodgeneralization properties

Stroppa and Chrupala (UdS) ML for NLP 2010 30 / 35

Page 31: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Efficient averaged perceptron algorithm

Perceptron(x1:N , y 1:N , I ):

1: w← 0 ; wa ← 02: b ← 0 ; ba ← 03: c ← 14: for i = 1...I do5: for n = 1...N do6: if y (n)(w · x(n) + b) ≤ 0 then7: w← w + y (n)x(n) ; b ← b + y (n)

8: wa ← wa + cy (n)x(n) ; ba ← ba + cy (n)

9: c ← c + 110: return (w −wa/c, b − ba/c)

Stroppa and Chrupala (UdS) ML for NLP 2010 31 / 35

Page 32: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Problem: Average perceptron

Weight averaging

Show that the above algorithm performs weight averaging.Hints:

In the standard perceptron algorithm, the final weight vector (andbias) is the sum of the updates at each step.

In average perceptron, the final weight vector should be the mean ofthe sum of partial sums of updates at each step

Stroppa and Chrupala (UdS) ML for NLP 2010 32 / 35

Page 33: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Solution

Let’s formalize:

Basic perceptron: final weights are the sum of updates at each step:

w =n∑

i=1

f (x(i)) (21)

Naive weight averaging: final weights are the mean of the sum ofpartial sums:

w =1

n

n∑i=1

i∑j=1

f (x(j)) (22)

Efficient weight averaging:

w =n∑

i=1

f (x(i))−n∑

i=1

if (x(i))/n (23)

Stroppa and Chrupala (UdS) ML for NLP 2010 33 / 35

Page 34: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Solution

Let’s formalize:

Basic perceptron: final weights are the sum of updates at each step:

w =n∑

i=1

f (x(i)) (21)

Naive weight averaging: final weights are the mean of the sum ofpartial sums:

w =1

n

n∑i=1

i∑j=1

f (x(j)) (22)

Efficient weight averaging:

w =n∑

i=1

f (x(i))−n∑

i=1

if (x(i))/n (23)

Stroppa and Chrupala (UdS) ML for NLP 2010 33 / 35

Page 35: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Solution

Let’s formalize:

Basic perceptron: final weights are the sum of updates at each step:

w =n∑

i=1

f (x(i)) (21)

Naive weight averaging: final weights are the mean of the sum ofpartial sums:

w =1

n

n∑i=1

i∑j=1

f (x(j)) (22)

Efficient weight averaging:

w =n∑

i=1

f (x(i))−n∑

i=1

if (x(i))/n (23)

Stroppa and Chrupala (UdS) ML for NLP 2010 33 / 35

Page 36: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Show that equations 22 and 23 are equivalent. Note that we canrewrite the sum of partial sums by multiplying the update at each stepby the factor indicating in how many of the partial sums it appears

w =1

n

n∑i=1

i∑j=1

f (x(j)) (24)

=1

n

n∑i=1

(n − i)f (x(i)) (25)

=1

n

[n∑

i=1

nf (x(i))− if (x(i))

](26)

=1

n

[n

n∑i=1

f (x(i))−n∑

i=1

if (x(i))

](27)

=n∑

i=1

f (xi )−n∑

i=1

if (x(i))/n (28)

Stroppa and Chrupala (UdS) ML for NLP 2010 34 / 35

Page 37: KNN, Kernels, Margins · KNN classi er K-Nearest neighbors idea When classifying a new example, nd k nearest training example, and assign the majority label Also known as I Memory-based

Viterbi algorithm for second-order HMM

In the lecture we saw the formulation of the Viterbi algorithm for afirst-order Hidden Markov Model. In a second-order HMM transitionprobabilities depend on the two previous states, instead on just the singleprevious state, that is we use the following independence assumption:

P(yi |x1, . . . , xi−1, y1, . . . , yi−1) = P(yi |yi−2, yi−1).

For this exercise you should formulate the Viterbi algorithm for a decodinga second-order HMM.

Stroppa and Chrupala (UdS) ML for NLP 2010 35 / 35