text categorization

Text Categorization

Karl Rees

Ling 580

April 2, 2001

What is Text Categorization?

Classify the topic or theme of a document by categories/classes based on its content.

Humans can do this intuitively, but how do you teach a machine to classify a text?

Useful Applications

Filter a stream of news for a particular interest group.

Spam vs. Interesting mail.

Determining Authorship.

Poetry vs. Fiction vs. Essay, etc…

So, For Example:

Ch. 16 of Foundations of Statistical Language Processing.

We want to create an agent that can give us the probability of this document belonging to certain categories:

P(Poetry) = .015

P(Mathematics) = .198

P(Artificial Intelligence) = .732

P(Utterly Confusing) = .989

Training Sets

Corpus of documents for which we already know the category.

Essentially, we use this to teach a computer. Like showing a child a few pictures of a dog and a few pictures of a cat and then pointing to your neighbors pet and asking her/him what kind of animal it is.

Data Representation Model

Represent each object (document) in the training set in the form (x, c), where x is a vector of measurements and c is the class label.

In other words, each document is represented as a vector of potentially weighted word counts.

Training ProcedureA procedure/function that chooses a document’s category from a family of classifiers (model class). Typically, the model class consists of two classifiers: c1 and c2, where c2 is NOT(c1).For example:

g(x) = w * x + w0 x is the vector of word counts, w * x is the dot product of x and a vector of weights (because we may attach more importance to certain words) and w0 is some threshold.Choose c1 for g(x) > 0, otherwise, c2.

Test SetAfter training the classifier, we want to test its accuracy on a test set.Accuracy = Number of Objects Correctly Classified / Number of Objects Examined.Precision = Number of Objects Correctly Assigned to a Specific Category / Number of Objects Assigned to a CategoryFallout = Number of Objects Incorrectly Assigned to a Category / Number of Objects NOT Belonging to that Category

ModelingHow should texts be represented?

Using all words leads to sparse statistics. Some words are indicative of a label.

One approach: For each label, collect all words in texts with that label. Apply a mean square error test to determine whether a word occurs by chance in the texts. Sort all words by the mean square error test and take the top n (say 20). Idea is to select words that are correlated with a label. Examples: for label earnings, words such as “profit.”

Reuters Collection

A common dataset in text classification is the Reuters collection:

Articles categorized into about 100 topics. 9603 training examples, 3299 test examples. Short texts, annotated with SGML.

Available: http://www.research.att.com/~lewis/reuters21578.html

http://www.research.att.com/~lewis/reuters21578.html

http://www.research.att.com/~lewis/reuters21578.html

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5554" NEWID="11">

<DATE>26-FEB-1987 15:18:59.34</DATE>

<TOPICS><D>earn</D></TOPICS>

<TEXT>

<TITLE>COBANCO INC <CBCO> YEAR NET</TITLE>

<DATELINE> SANTA CRUZ, Calif., Feb 26 -

</DATELINE><BODY>Shr 34 cts vs 1.19 dlrs

Net 807,000 vs 2,858,000

Assets 510.2 mln vs 479.7 mln

Deposits 472.3 mln vs 440.3 mln

Loans 299.2 mln vs 327.2 mln

Note: 4th qtr not available. Year includes 1985

extraordinary gain from tax carry forward of 132,000 dlrs, or

five cts per shr.

Reuter

</BODY></TEXT></REUTERS>

For Reuters, Label = Earnings:Format of document vector fileEach entry consists of 25 lines:

1. the document id 2. is the document in the training set (T) or in the evaluation set (E)? 3. is the document in the core training set (C) or in the validation set (V)? (X

where this doesn't apply.) 4. is the document in the earnings category (Y) or not (N)?5. feature weight for "vs" 6. feature weight for "mln" 7. feature weight for "cts" 8. feature weight for ";" 9. feature weight for "&" 10. feature weight for "000"

For Reuters, Label = Earnings11. feature weight for "loss" 12. feature weight for "'" 13. feature weight for " 14. feature weight for "3" 15. feature weight for "profit" 16. feature weight for "dlrs" 17. feature weight for "1" 18. feature weight for "pct" 19. feature weight for "is" 20. feature weight for "s" 21. feature weight for "that" 22. feature weight for "net" 23. feature weight for "lt" 24. feature weight for "at" 25. semicolon (separator between entries)

Vector For Example Document:

{ docid 11 T C Y 5 5 3 3 3 4 0 0 0 4 0 3 2 0 0 0 0 3 2 0 ; }

Classification Techniques

Decision Trees

Maximum Entropy Modeling

Perceptrons (Neural Networks)

K-Nearest Neighbor Classification (kNN)

Naïve Bayes

Support Vector Machines

Decision Trees

Information

Measure of how much we “know” about an object, document, decision, etc…

At each successive node we have more information about the object’s classification.

Information

p = Number of objects in a set that belong to a certain category.

n = Number of objects in a set that don’t belong to that category.

I = Measure of the amount of information that we have about an object that is not in the set given.

I( p/(p+n) , n/(p+n) ) =

)(log)(log 22 npn

npn

npp

npp

Information Gain

The amount of Information we gain from making a decision.

Each decision we make will give us two new sets, each with its own distinct Information value. There should be more Information in these sets than in the previous set, thus we build our tree based on Information Gain. Those decisions with the highest gain come first.

Information Gain

Gain(A) = I( p/(p+n) , n/(p+n) ) –

Remainder(A)

A is the resulting state.

Remainder(A) is the average of the information in the resulting sets I = 1…v:

)(* ,

1nipi

n

nipi

p

npnp ii

v

i

ii I

Decision Trees

At the bottom of our trees, we have leaf nodes. At each of these nodes, we compute the percentage of objects belonging to the node that fit into the category we are looking at. If it is greater than a certain percentage (say 50%), we say that all documents that fit into this node are in this category. Hopefully, though, the tree will give us more confidence than 50%.

PruningAfter growing a tree, we want to prune it down to a smaller size. We may want to get rid of nodes/decisions that don’t contribute any significant information (possibly node 6 and 7 in our example). We also want to get rid of decisions that are based on possibly insignificant details. These “overfit” the training set. For example, if there is only one document in the set that has both dlrs and pct, and this is in the earnings category, it would probably be a mistake to assume that all such documents are in earnings.

Decision Trees

Bagging / Boosting

Obviously, there are many different ways to prune. Also, there are many other algorithms besides Information Gain for growing a decision tree.

Bagging or Boosting means generating many decision trees and averaging the results of running an object through each of these trees.

Decision Trees

Maximum Entropy Modeling

Consult Slides at http://www.cs.jhu.edu/~hajic/courses/cs465/cs46520/ppframe.htm for more information about Maximum Entropy.

http://www.cs.jhu.edu/~hajic/courses/cs465/cs46520/ppframe.htm

http://www.cs.jhu.edu/~hajic/courses/cs465/cs46520/ppframe.htm

Perceptrons

kNN

Basic idea: Keep training set in memory.

Define a similarity metric.

At classification time, match unseen example against all examples in memory.

Select k best matches.

Predict unseen example label as majority label of k retrieved example.

Example similarity metric:

kNN

Many variants on kNN.

Underlying idea is that abstraction (rules, parameters etc) is likely to loose information.

No abstraction, case based reasoning.

Training fast, testing can be slow.

Potentially large memory requirements.

Naïve BayesAssumption that are features are independent of each other:

Here A is a document consisting of features A1 … Anl is the document label. Fast training, fast evaluation. Good when features are independent.

Support Vector MachinesSVMs are an interesting new classifier:

Similar to kNN. Similar (ish) to maxent.

Idea: Transform examples into a new space where they can be linearly separated. Group examples into regions that all share the same label. Base grouping in terms of training items (support vectors) that lie on the boundary. Best grouping found automatically.

text categorization

Documents

mln vs

certain words

class label

x w0 x

label earnings

test examples

classified number of

training examples