text categorization updated 11/1/2006. performance measures – binary classification accuracy: acc...

16
text categorization Updated 11/1/2006

Upload: phillip-sherman

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

text categorization

Updated 11/1/2006

Page 2: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Performance measures – binary classification

Accuracy: acc = (a+d)/(a+b+c+d)

Precision: p = a/(a+b) Recall: r = a/(a+c) F F = (2+1) pr/(2p +r)

Ususally one uses F1 = 2pr/(p +r) Break-even point

Ground truth

True False

True a b

False c d

Cla

ssifi

er

ass

ign

ed

Contigency table

Page 3: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Performance measures – multiple categories

Micro averaging Macro averaging

Page 4: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Reuters 21578 Reuters collection contains 9603 training

articles and 3299 test articles. Were sent over the Reuters newswire in 1987. Contains about 100 categories such as

‘mergers and acquisitions’, ‘interset rates’, ‘wheat’, ‘silver’ etc.

Distribution of articles among categories is highly non-uniform.

‘earning’ contains 2709 docs 75 categories contain less than 10 docs each.

Page 5: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Example of a Reuters news story from category ‘earning’<DATE>26-FEB-1987 15:18:59.34</DATE><TOPICS><D>earn</D></TOPICS><TEXT><TITLE>COBANCO INC &lt;CBCO> YEAR NET</TITLE><DATELINE> SANTA CRUZ, Calif., Feb 26 - </DATELINE><BODY>Shr 34 cts vs 1.19 dlrs Net 807,000 vs 2,858,000 Assets 510.2 mln vs 479.7 mln Deposits 472.3 mln vs 440.3 mln Loans 299.2 mln vs 327.2 mln Note: 4th qtr not available. Year includes 1985 extraordinary

gain from tax carry forward of 132,000 dlrs, or five cts per shr. Reuter</BODY></TEXT></REUTERS>

Page 6: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Categorization methods Decision trees Naïve bayes K-nearest neighbors (KNN) Neural networks Support Vector Machines (SVM)

Page 7: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Representation of documents The most popular representation is ‘Bag of

Words’, which ignores all structure of documents.

Document I will be represented by a vector Xi Rn (n is the number of word types), where the j’th coordinate is just the number of times word wj appears in the document. (so called ‘term frequency – tfj).

Page 8: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Decision trees

1607/1704 = 0.943 694/5977 = 0.116

Earnings?

2301/7681 = 0.3 of all docs

contains “cents” < 2 times contains “cents” 2 times

contains “versus” < 2 times

contains “versus”

2 times

contains “net”

< 1 time

contains “net”

1 time

1398/1403 = 0.996

209/301 = 0.694

“yes”

422/541 = 0.780

272/5436 = 0.050“no”

Page 9: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Building decision trees Information gain

Page 10: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Decision Tree Pruning

Page 11: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Naïve bayes Multivariate Bernoulli model Multinomial model

Page 12: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Precision recall curve

Page 13: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

K-nearest neighbor

Page 14: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Neural network Perceptrons Multi-layer perceptrons

Page 15: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

SVM

Page 16: Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

reuters 21578 – comparison*

*Yiming-Yang & Xin Liu, A re-examination of text categorization methods, SIGIR99)