text categorization

35
Text Categorization Karl Rees Ling 580 April 2, 2001

Upload: bill

Post on 31-Jan-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Text Categorization. Karl Rees Ling 580 April 2, 2001. What is Text Categorization?. Classify the topic or theme of a document by categories/classes based on its content. Humans can do this intuitively, but how do you teach a machine to classify a text?. Useful Applications. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text Categorization

Text Categorization

Karl Rees

Ling 580

April 2, 2001

Page 2: Text Categorization

What is Text Categorization?

Classify the topic or theme of a document by categories/classes based on its content.

Humans can do this intuitively, but how do you teach a machine to classify a text?

Page 3: Text Categorization

Useful Applications

Filter a stream of news for a particular interest group.

Spam vs. Interesting mail.

Determining Authorship.

Poetry vs. Fiction vs. Essay, etc…

Page 4: Text Categorization

So, For Example:

Ch. 16 of Foundations of Statistical Language Processing.

We want to create an agent that can give us the probability of this document belonging to certain categories:

P(Poetry) = .015

P(Mathematics) = .198

P(Artificial Intelligence) = .732

P(Utterly Confusing) = .989

Page 5: Text Categorization

Training Sets

Corpus of documents for which we already know the category.

Essentially, we use this to teach a computer. Like showing a child a few pictures of a dog and a few pictures of a cat and then pointing to your neighbors pet and asking her/him what kind of animal it is.

Page 6: Text Categorization

Data Representation Model

Represent each object (document) in the training set in the form (x, c), where x is a vector of measurements and c is the class label.

In other words, each document is represented as a vector of potentially weighted word counts.

Page 7: Text Categorization

Training ProcedureA procedure/function that chooses a document’s category from a family of classifiers (model class). Typically, the model class consists of two classifiers: c1 and c2, where c2 is NOT(c1).For example:

g(x) = w * x + w0 x is the vector of word counts, w * x is the dot product of x and a vector of weights (because we may attach more importance to certain words) and w0 is some threshold.Choose c1 for g(x) > 0, otherwise, c2.

Page 8: Text Categorization

Test SetAfter training the classifier, we want to test its accuracy on a test set.Accuracy = Number of Objects Correctly Classified / Number of Objects Examined.Precision = Number of Objects Correctly Assigned to a Specific Category / Number of Objects Assigned to a CategoryFallout = Number of Objects Incorrectly Assigned to a Category / Number of Objects NOT Belonging to that Category

Page 9: Text Categorization

ModelingHow should texts be represented?

Using all words leads to sparse statistics. Some words are indicative of a label.

One approach: For each label, collect all words in texts with that label. Apply a mean square error test to determine whether a word occurs by chance in the texts. Sort all words by the mean square error test and take the top n (say 20). Idea is to select words that are correlated with a label. Examples: for label earnings, words such as “profit.”

Page 10: Text Categorization

Reuters Collection

A common dataset in text classification is the Reuters collection:

Articles categorized into about 100 topics. 9603 training examples, 3299 test examples. Short texts, annotated with SGML.

Available: http://www.research.att.com/~lewis/reuters21578.html

Page 11: Text Categorization

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5554" NEWID="11">

<DATE>26-FEB-1987 15:18:59.34</DATE>

<TOPICS><D>earn</D></TOPICS>

<TEXT>&#2;

<TITLE>COBANCO INC &lt;CBCO> YEAR NET</TITLE>

<DATELINE> SANTA CRUZ, Calif., Feb 26 -

</DATELINE><BODY>Shr 34 cts vs 1.19 dlrs

Net 807,000 vs 2,858,000

Assets 510.2 mln vs 479.7 mln

Deposits 472.3 mln vs 440.3 mln

Loans 299.2 mln vs 327.2 mln

Note: 4th qtr not available. Year includes 1985

extraordinary gain from tax carry forward of 132,000 dlrs, or

five cts per shr.

Reuter

&#3;</BODY></TEXT></REUTERS>

Page 12: Text Categorization

For Reuters, Label = Earnings:Format of document vector fileEach entry consists of 25 lines:

1. the document id 2. is the document in the training set (T) or in the evaluation set (E)? 3. is the document in the core training set (C) or in the validation set (V)? (X

where this doesn't apply.) 4. is the document in the earnings category (Y) or not (N)?5. feature weight for "vs" 6. feature weight for "mln" 7. feature weight for "cts" 8. feature weight for ";" 9. feature weight for "&" 10. feature weight for "000"

Page 13: Text Categorization

For Reuters, Label = Earnings11. feature weight for "loss" 12. feature weight for "'" 13. feature weight for " 14. feature weight for "3" 15. feature weight for "profit" 16. feature weight for "dlrs" 17. feature weight for "1" 18. feature weight for "pct" 19. feature weight for "is" 20. feature weight for "s" 21. feature weight for "that" 22. feature weight for "net" 23. feature weight for "lt" 24. feature weight for "at" 25. semicolon (separator between entries)

Page 14: Text Categorization

Vector For Example Document:

{ docid 11 T C Y 5 5 3 3 3 4 0 0 0 4 0 3 2 0 0 0 0 3 2 0 ; }

Page 15: Text Categorization

Classification Techniques

Decision Trees

Maximum Entropy Modeling

Perceptrons (Neural Networks)

K-Nearest Neighbor Classification (kNN)

Naïve Bayes

Support Vector Machines

Page 16: Text Categorization

Decision Trees

Page 17: Text Categorization

Decision Trees

Page 18: Text Categorization

Information

Measure of how much we “know” about an object, document, decision, etc…

At each successive node we have more information about the object’s classification.

Page 19: Text Categorization

Information

p = Number of objects in a set that belong to a certain category.

n = Number of objects in a set that don’t belong to that category.

I = Measure of the amount of information that we have about an object that is not in the set given.

I( p/(p+n) , n/(p+n) ) =

)(log)(log 22 npn

npn

npp

npp

Page 20: Text Categorization

Information Gain

The amount of Information we gain from making a decision.

Each decision we make will give us two new sets, each with its own distinct Information value. There should be more Information in these sets than in the previous set, thus we build our tree based on Information Gain. Those decisions with the highest gain come first.

Page 21: Text Categorization

Information Gain

Gain(A) = I( p/(p+n) , n/(p+n) ) –

Remainder(A)

A is the resulting state.

Remainder(A) is the average of the information in the resulting sets I = 1…v:

)(* ,

1nipi

n

nipi

p

npnp ii

v

i

ii I

Page 22: Text Categorization

Decision Trees

At the bottom of our trees, we have leaf nodes. At each of these nodes, we compute the percentage of objects belonging to the node that fit into the category we are looking at. If it is greater than a certain percentage (say 50%), we say that all documents that fit into this node are in this category. Hopefully, though, the tree will give us more confidence than 50%.

Page 23: Text Categorization

PruningAfter growing a tree, we want to prune it down to a smaller size. We may want to get rid of nodes/decisions that don’t contribute any significant information (possibly node 6 and 7 in our example). We also want to get rid of decisions that are based on possibly insignificant details. These “overfit” the training set. For example, if there is only one document in the set that has both dlrs and pct, and this is in the earnings category, it would probably be a mistake to assume that all such documents are in earnings.

Page 24: Text Categorization

Decision Trees

Page 25: Text Categorization

Bagging / Boosting

Obviously, there are many different ways to prune. Also, there are many other algorithms besides Information Gain for growing a decision tree.

Bagging or Boosting means generating many decision trees and averaging the results of running an object through each of these trees.

Page 26: Text Categorization

Decision Trees

Page 27: Text Categorization

Maximum Entropy Modeling

Consult Slides at http://www.cs.jhu.edu/~hajic/courses/cs465/cs46520/ppframe.htm for more information about Maximum Entropy.

Page 28: Text Categorization

Perceptrons

Page 29: Text Categorization
Page 30: Text Categorization
Page 31: Text Categorization
Page 32: Text Categorization

kNN

Basic idea: Keep training set in memory.

Define a similarity metric.

At classification time, match unseen example against all examples in memory.

Select k best matches.

Predict unseen example label as majority label of k retrieved example.

Example similarity metric:

Page 33: Text Categorization

kNN

Many variants on kNN.

Underlying idea is that abstraction (rules, parameters etc) is likely to loose information.

No abstraction, case based reasoning.

Training fast, testing can be slow.

Potentially large memory requirements.

Page 34: Text Categorization

Naïve BayesAssumption that are features are independent of each other:

Here A is a document consisting of features A1 … Anl is the document label. Fast training, fast evaluation. Good when features are independent.

Page 35: Text Categorization

Support Vector MachinesSVMs are an interesting new classifier:

Similar to kNN. Similar (ish) to maxent.

Idea: Transform examples into a new space where they can be linearly separated. Group examples into regions that all share the same label. Base grouping in terms of training items (support vectors) that lie on the boundary. Best grouping found automatically.