newsweeder: learning to filter netnews by: ken lang presented by salah omer

NewsWeeder: Learning to Filter Netnews

By: Ken Lang

Presented by Salah Omer

IntroductionTheoretical FrameworkApproachAnalysis & Evaluation

Introduction:

Newsweeder is a Netnews filtering system that addresses the classification problem by letting the user rates his or her interest level for each article being read.

The goal of an information filtering system is to sort through large volumes of information and present to the user those documents that are likely to satisfy his or her information requirement.

Introduction:

Several learning techniques are used in information filtering

Rule-based user-driven mode. Explicit-learning mode. Implicit-learning mode.

Introduction:

Rule-based user-driven mode: This mode requires direct and explicit user input

in establishing and maintaining the user model. SIFT (Yan, & Garcia-Molina, 1995), Lens (Malone et al, 1987), and Infoscope (Fischer, & Steven, 1991) are some examples of systems that use this approach. Some research systems have extended by making complex rule based key word matching profile. However, average users will not be interested to follow these complex systems.

Introduction:

Explicit-learning mode: This mode creates and updates the user model by eliciting explicit user feedback. Ratings on presented items, provided by users based on one or more ordinal or qualitative scales, are often employed to directly indicate users’ preference.

Introduction:

Implicit-learning mode: This mode aims to acquire users’ preferences with minimal or no additional effort from them. Kass, & Finin (1988) defined implicit mode as “observing the behavior of the user and inferring facts about the user from the observed behavior”.

Theoretical Framework

TD-IDF MDL

Theoretical FrameworkTD-IDFUse the Information Retrieval techniques

of term frequency inverse document frequency tf-idf.

The technique as we learn in the class is based on two empirical observations.

Theoretical FrameworkTF-IDFFirst, the more times a token appears in a

document i.e. term frequency (TF), the more likely the term is relevant to the document.

Second, the more time the term is observe the throughout the documents in the documents’ set, the less it discriminates between documents (IDF).

Theoretical Framework Minimum Description Length

MDL is based on the following insight: Any regularity in the data can be used to

compress the data, i.e. to describe it using fewer symbols than the number of symbols needed to describe the data literally.

The more we are able to compress the data, the more we have learned about the data.


MDL provide the frame work for balancing the tradeoff between model complexity and training error.

In the NewsWeeder domain, the tradeoff involves the importance of each token and how to decide which one to drop and which one to keep.


MDL is based on the Bayes’s Rules

P(H\D) = P(D\H)P(H)

P(D)

In the Bayes rule in order to maximize H in P(H\D), we need to maximize P(D\H)P(H)

Or


P(H\D) = P(D\H)P(H)

P(D)

to minimize

- Log(P(D\H)P(H))

which is equal to

- log(P(D\H)) - log(P(H))


MDL interpretation of

-log(P(D\H)) - log(P(H))

as per Shannon information theory is to find the hypothesis which minimizes the total encoding length.

Approach:

Representation: First: Raw text is the parsed into tokens.

For NewsWeeder tokens are kept to the word level.

Second: create vectors of the token counts for the document. Using the bag of word approach and keeping the word in their unstemmed form.

Approach:

Learning:TF-IDFNewsWeeder uses the weight derived

from the multiplication of tf and idf as express in the formula

w(t,d)=tft,dlog(|N|dft)The logarithm is used to normalize large

values.

Approach:

Learning:TF-IDFIn the NewsWeeder the documents in

each category are converted into tf-idf vector.

Approach

Learning:TF-IDFTo classify a new document, first, the new

document is compared to the prototype vector then given a predicted rating based on the cosine to each category.

Second, the result is converted from the categorization procedure to a continuous value using a linear regression.

ApproachLearning:MDL First, perform the categorization step then

convert the categorization similarities into continuous rating prediction using.

argmax{p(ci\Td, ld, Dtrain} =

argmin{-log(p(Td\ci,ld,Dtrain)) – log(pci\ld, Dtrain))}

Given d = document, Td = token vector, Dtrain = training data

The most probable category ci for d is that which minimizes the bits needed to encode Td plus ci

ApproachLearning:MDLSecond, apply the probabilistic model to

compute the similarity for each training document.

The probability of the data in a document given it length and category is the product of the individual token probabilities.

p(ci\Td, ld, Dtrain) = ∏p(td\ci,ld,Dtrain)

ApproachLearning:MDL Drive the probability estimate for ti,d

Computed the number of document contains token

ti = ∑ti,j

jЄN

The correlation ri,l between ti,d and Id

Combine the token probability distribution that is independent of the document length with the one that is dependent on the document length weighted by ri,l

ApproachLearning:MDL Based on the hypothesis that each token

either has specialized distribution for a category or that it is unrelated. The MDL chooses the category specific hypothesis is the total bits saved in using the hypothesis

∑ -log(p(ti,d\ld)) – [- log(p(ti,d\ld, ck))]

dЄNck

is greater than the complexity of cost of including the extra parameters.

Learning Algorithm Summary for TF IDF and MDL

For the two approaches:

1. Divide the articles into training and test (unseen) sets.

2. Parse the training articles, throwing out tokens occurring less than three times total.

3. Compute ti and ri,l for each token.


For TF-IDF

1. throw out the M most frequent tokens over the entire training set.

2. compute the term weights, normalize the weight vector for each article, and find the average of the vectors for each rating category.

3. compute the similarity of each training document to each rating category prototype using the cosine similarity metric.


For MDL1. Decide for each token t and category c

whether to use p(t|l,c)=p(t|l), or to use a category dependent model for when t occurs in c. Then pre-compute the encoding lengths for no tokens occurring for documents in each category.

2. Compute the similarity of each training document to each rating category by taking the inverse of the number of bits needed to encode Td under the category’s probabilistic model.


For the two approaches:1. Using the similarity measurements

computed in steps 3-TD-IDF or 2-MDL on the training data, compute a linear regression from rating category similarities to continuous rating predictions.

2. Apply the model obtained in steps 6 and step 9 similarly measurement to test articles.

Result Evaluation:

Two methods to evaluate the performance on the result:

1. The precision metrics (the ratio of relevant documents retrieved to all documents retrieved.

2. Confusion matrix (column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class) of error generated by text classifier.

Data

The “interesting” label indicates the articles is rated 2 or better.

RatingLabel Intended Use

1 Essential For articles not to be missed if at all possible

2 Interesting For articles of definite interest

3 Borderline For articles the user is uncertain about his interest in and would rather not make a commitment

4 Boring For articles not interesting

5 Gong For articles the user wants to heavily weight against seeing again, perhaps because they are so clearly irritating to have in the list

Skip Skip For articles the user does not even want to read (note that this category may cover several of the above ratings, as they can only be used if the user actually requests to see the article)

Table 1: Rating Labels

Data

Table 2 summarized the data collection over one year. NewsWeeder experiment used 40 individual over 1 year. Since our model assumes a stable distribution pool, and because it has no temporal dependence, users’ interests that lasted less than the one-year period will add some amount of error to the performance.

User B has rated 16% of the articles as interesting (a rating of 1 or 2), it is possible for a considerably smaller percentage of interesting articles to be in the newsgroups read by User A.

Table 2: Article/Rating Data for Two Users

Rating User A User B

1 27 (1%) 29 (3%)

2 475 (11%) 43 (14%)

3 854 (20%) 67 (6%)

4 935 (22%) 56 (5%)

5 57 (1%) 17 (2%)

Skip 1995 (46%) 732 (70%)

Total 4343 1044

Total Interesting (1 or 2) 502 (12%) 172 (16%)

TF-IDF Performance Analysis:

Graph 1 show the effect of removing the top N most frequent words in the top 10%,; the top predicted rating articles.

Five trials with training/test data split of 80/20 for each trial removed 100 to 400 words the best precision 43% reached at 300 words removed.

MDL Performance Analysis:

Graph 2 shows that MDL outperform TF-IDF for both user A and B. Performance is measured in the percentage of the interesting articles found in the top 10% with highest predicted rating.

MDL reach a precision of 44% for User A and 59% for user B compared to TF-IDF (37% for A and 49 % for B)

Table 3 shows the confusion Matrix for MDL categorization of articles for user A in a single trial

newsweeder: learning to filter netnews by: ken lang presented by salah omer

Documents

explicitlearning mode

user model

implicitlearning mode

user rates

explicit user input

explicit user feedback

defined implicit mode

learning techniques