k nearest neighbor and rocchio algorithm ling 572 fei xia 1/11/2007
Post on 19-Dec-2015
217 views
TRANSCRIPT
K nearest neighborand Rocchio algorithm
LING 572
Fei Xia
1/11/2007
Announcement
• Hw2 is online now. It is due on Jan 20.
• Hw1 is due at 11pm on Jan 13 (Sat).
• Lab session after the class.
• Read DT Tutorial before next Tues’ class.
K-Nearest Neighbor (kNN)
Instance-based (IB) learning
• No training: store all training instances. “Lazy learning”
• Examples:– kNN– Locally weighted regression– Radial basis functions– Case-based reasoning– …
• The most well-known IB method: kNN
kNN
kNN
• For a new document d,– find k training documents that are closest to d.– perform majority voting or weighted voting.
• Properties:– A “lazy” classifier. No training.– Feature selection and distance measure are
crucial.
The algorithm
• Determine parameter K• Calculate the distance between query-
instance and all the training instances• Sort the distances and determine K
nearest neighbors • Gather the labels of the K nearest
neighbors • Use simple majority voting or weighted
voting.
Picking K
• Use N-fold cross validation: pick the one that minimizes cross validation error.
Normalizing attribute values
• Distance could be dominated by some attributes with large numbers:– Ex: features: age, income
– Original data: x1=(35, 76K), x2=(36, 80K), x3=(70, 79K)
– Assume: age 2 [0,100], income 2 [0, 200K]
– After normalization: x1=(0.35, 0.38),
x2=(0.36, 0.40), x3 = (0.70, 0.395).
The Choice of Features
• Imagine there are 100 features, and only 2 of them are relevant to the target label.
• kNN is easily misled in high-dimensional space.
Feature weighting or feature selection
Feature weighting
• Stretch j-th axis by weight wj,
• Use cross-validation to automatically
choose weights w1, …, wn
• Setting wj to zero eliminates this dimension altogether.
Similarity measure
• Euclidean distance:
• Weighted Euclidean distance:
• Similarity measure: cosine
Voting
• Majority voting:
c* = arg maxc i (c, fi(x))
• Weighted voting: weighting is on each neighbor c* = arg maxc i wi (c, fi(x))
wi = 1/dist(x, xi)
We can use all the training examples.
Summary of kNN
• Strengths:– Simplicity (conceptual)– Efficiency at training: no training– Handling multi-class– Stability and robustness: averaging k neighbors– Predication accuracy: when the training data is large
• Weakness:– Efficiency at testing time: need to calc all distances– Theoretical validity– It is not clear which types of distance measure and
features to use.
Rocchio Algorithm
Relevance Feedback for IR
• The issue: “plane” vs. “aircraft”
• Take advantage of user feedback on relevance of docs to improve IR results: – User issues a short, simple query– The user marks returned documents as relevant or non-relevant.– The system computes a better representation of the information
need based on feedback.– Relevance feedback can go through one or more iterations.
• Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate
Rocchio Algorithm
• The Rocchio algorithm incorporates relevance feedback information into the vector space model.
• Want to maximize sim (Q, Cr) − sim (Q, Cnr)• The optimal query vector for separating relevant
and non-relevant documents (with cosine sim.):
• Qopt= optimal query; Cr = set of relevant doc vectors; N = collection size
Rocchio 1971 Algorithm (SMART)
qm = modified query vector; q0 = original query vector; , : weights Dr = set of known relevant doc vectors;Dnr= set of known irrelevant doc vectors
Relevance feedback assumptions
Relevance prototypes are “well-behaved”.• Term distribution in relevant documents will be
similar• Term distribution in non-relevant documents will
be different from those in relevant documents:– Either: All relevant documents are tightly clustered
around a single prototype.– Or: There are different prototypes, but they have
significant vocabulary overlap.– Similarities between relevant and irrelevant
documents are small.
Rocchio Algorithm for text classification
• Training time: construct a set of prototype vectors, one vector per class.
• Testing time: for a new document, find the most similar prototype vector.
Training time
Cj: the set of positive examples for class j.D: the set of positive and negative examples
: weights
Why this formula?
• Rocchio shows when ==1, each prototype vector maximizes
• How maximizing this formula connects to the classification accuracy?
Testing time
• Given a new document d
kNN vs. Rocchio
• kNN:– Lazy learning: no training– Use all the training instances at testing time.
• Rocchio algorithm:– At the training time, calculate prototype vectors.– At the test time, instead of using all the training
instances, use only prototype vectors.
– Linear classifier: not as expressive as kNN.
Summary of Rocchio
• Strengths:– Simplicity (conceptual)– Efficiency at training– Efficiency at testing time– Handling multi-class
• Weakness:– Theoretical validity– Stability and robustness– Predication accuracy: it does not work well when the
categories are not linearly separable.
Additional slides
Three major design choices
• The weight of feature fk in document di: e.g., tf-idf
• The document length normalization
• The similarity measure: e.g., cosine
Extending Rocchio?
• Generalized instance set (GIS) algorithm (Lam and Ho, 1998).
• …