k nearest neighbor and rocchio algorithm ling 572 fei xia 1/11/2007

K nearest neighborand Rocchio algorithm

LING 572

Fei Xia

1/11/2007

Announcement

• Hw2 is online now. It is due on Jan 20.

• Hw1 is due at 11pm on Jan 13 (Sat).

• Lab session after the class.

• Read DT Tutorial before next Tues’ class.

K-Nearest Neighbor (kNN)

Instance-based (IB) learning

• No training: store all training instances. “Lazy learning”

• Examples:– kNN– Locally weighted regression– Radial basis functions– Case-based reasoning– …

• The most well-known IB method: kNN

kNN

• For a new document d,– find k training documents that are closest to d.– perform majority voting or weighted voting.

• Properties:– A “lazy” classifier. No training.– Feature selection and distance measure are

crucial.

The algorithm

• Determine parameter K• Calculate the distance between query-

instance and all the training instances• Sort the distances and determine K

nearest neighbors • Gather the labels of the K nearest

neighbors • Use simple majority voting or weighted

voting.

Picking K

• Use N-fold cross validation: pick the one that minimizes cross validation error.

Normalizing attribute values

• Distance could be dominated by some attributes with large numbers:– Ex: features: age, income

– Original data: x1=(35, 76K), x2=(36, 80K), x3=(70, 79K)

– Assume: age 2 [0,100], income 2 [0, 200K]

– After normalization: x1=(0.35, 0.38),

x2=(0.36, 0.40), x3 = (0.70, 0.395).

The Choice of Features

• Imagine there are 100 features, and only 2 of them are relevant to the target label.

• kNN is easily misled in high-dimensional space.

Feature weighting or feature selection

Feature weighting

• Stretch j-th axis by weight wj,

• Use cross-validation to automatically

choose weights w1, …, wn

• Setting wj to zero eliminates this dimension altogether.

Similarity measure

• Euclidean distance:

• Weighted Euclidean distance:

• Similarity measure: cosine

Voting

• Majority voting:

c* = arg maxc i (c, fi(x))

• Weighted voting: weighting is on each neighbor c* = arg maxc i wi (c, fi(x))

wi = 1/dist(x, xi)

We can use all the training examples.

Summary of kNN

• Strengths:– Simplicity (conceptual)– Efficiency at training: no training– Handling multi-class– Stability and robustness: averaging k neighbors– Predication accuracy: when the training data is large

• Weakness:– Efficiency at testing time: need to calc all distances– Theoretical validity– It is not clear which types of distance measure and

features to use.

Rocchio Algorithm

Relevance Feedback for IR

• The issue: “plane” vs. “aircraft”

• Take advantage of user feedback on relevance of docs to improve IR results: – User issues a short, simple query– The user marks returned documents as relevant or non-relevant.– The system computes a better representation of the information

need based on feedback.– Relevance feedback can go through one or more iterations.

• Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate

Rocchio Algorithm

• The Rocchio algorithm incorporates relevance feedback information into the vector space model.

• Want to maximize sim (Q, Cr) − sim (Q, Cnr)• The optimal query vector for separating relevant

and non-relevant documents (with cosine sim.):

• Qopt= optimal query; Cr = set of relevant doc vectors; N = collection size

Rocchio 1971 Algorithm (SMART)

qm = modified query vector; q0 = original query vector; , : weights Dr = set of known relevant doc vectors;Dnr= set of known irrelevant doc vectors

Relevance feedback assumptions

Relevance prototypes are “well-behaved”.• Term distribution in relevant documents will be

similar• Term distribution in non-relevant documents will

be different from those in relevant documents:– Either: All relevant documents are tightly clustered

around a single prototype.– Or: There are different prototypes, but they have

significant vocabulary overlap.– Similarities between relevant and irrelevant

documents are small.

Rocchio Algorithm for text classification

• Training time: construct a set of prototype vectors, one vector per class.

• Testing time: for a new document, find the most similar prototype vector.

Training time

Cj: the set of positive examples for class j.D: the set of positive and negative examples

: weights

Why this formula?

• Rocchio shows when ==1, each prototype vector maximizes

• How maximizing this formula connects to the classification accuracy?

Testing time

• Given a new document d

kNN vs. Rocchio

• kNN:– Lazy learning: no training– Use all the training instances at testing time.

• Rocchio algorithm:– At the training time, calculate prototype vectors.– At the test time, instead of using all the training

instances, use only prototype vectors.

– Linear classifier: not as expressive as kNN.

Summary of Rocchio

• Strengths:– Simplicity (conceptual)– Efficiency at training– Efficiency at testing time– Handling multi-class

• Weakness:– Theoretical validity– Stability and robustness– Predication accuracy: it does not work well when the

categories are not linearly separable.

Additional slides

Three major design choices

• The weight of feature fk in document di: e.g., tf-idf

• The document length normalization

• The similarity measure: e.g., cosine

Extending Rocchio?

• Generalized instance set (GIS) algorithm (Lam and Ho, 1998).

• …

k nearest neighbor and rocchio algorithm ling 572 fei xia 1/11/2007

Documents