data mining lectures lecture 15: text classification padhraic smyth, uc irvine ics 278: data mining...

Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 15: Text Classification

Padhraic SmythDepartment of Information and Computer Science

University of California, Irvine


RoadMap for Lectures

• Lecture 15 (today): text classification

• Lectures 16, 17, 18, 19:– Unsupervised learning from text – clustering and topic

modeling– Recommender systems– Credit scoring applications– Pattern-finding algorithms

• Lecture 20– Thursday June 8th (2 weeks from Thursday)– 5-minute project summary from each student– More details on format to come later…..


Text Classification

• Text classification has many applications– Spam email detection– Automated tagging of streams of news articles, e.g., Google News– Automated creation of Web-page taxonomies

• Data Representation– “Bag of words” most commonly used: either counts or binary– Can also use “phrases” for commonly occuring combinations of words

• Classification Methods– Naïve Bayes widely used (e.g., for spam email)

• Fast and reasonably accurate– Support vector machines (SVMs)

• Typically the most accurate method in research studies• But more complex computationally

– Logistic Regression (regularized)• Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles,

2002)


Further Reading on Text Classification

• Web-related text mining in general– S. Chakrabarti, Mining the Web: Discovering Knowledge from

Hypertext Data, Morgan Kaufmann, 2003.– See chapter 5 for discussion of text classification

• General references on text and language modeling– Foundations of Statistical Language Processing, C. Manning

and H. Schutze, MIT Press, 1999.– Speech and Language Processing: An Introduction to Natural

Language Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000.

• SVMs for text classification– T. Joachims, Learning to Classify Text using Support Vector

Machines: Methods, Theory and Algorithms, Kluwer, 2002


Common Data Sets used for Evaluation

• Reuters– 10700 labeled documents – 10% documents with multiple class labels

• Yahoo! Science Hierarchy – 95 disjoint classes with 13,598 pages

• 20 Newsgroups data– 18800 labeled USENET postings– 20 leaf classes, 5 root level classes

• WebKB– 8300 documents in 7 categories such as “faculty”, “course”, “student”.

• Industry– 6449 home pages of companies partitioned into 71 classes


Trimming the Vocabulary

• Stopword removal: – remove “non-content” words

• very frequent “stop words” such as “the”, “and”….– remove very rare words, e.g., that only occur a few times in 100k

documents– Can remove 30% or more of the original unique words

• Stemming:– Reduce all variants of a word to a single term– E.g., {draw, drawing, drawings} -> “draw”– Porter stemming algorithm (1980)

• relies on a preconstructed suffix list with associated rules• e.g. if suffix=IZATION and prefix contains at least one vowel followed

by a consonant, replace with suffix=IZE– BINARIZATION => BINARIZE

• This still often leaves p ~ O(104) terms => a very high-dimensional classification problem!


Feature Selection

• Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms– See classification results later in these slides

• Greedy search– Start from empty set or full set and add/delete one at a time– Heuristics for adding/deleting

• Information gain (mutual information of term with class)• Chi-square• Other ideas

– Methods tend not to be particularly sensitive to the specific heuristic used for feature selection, but some form of feature selection often improves performance


Example of Role of Feature Selection (from Chakrabarti, Chapter 5)

9600 documents from US Patent database20,000 raw features (terms)


Classifying Term Vectors

• Typically multiple different words may be helpful in classifying a particular class, e.g.,– Class = “finance”– Words = “stocks”, “return”, “interest”, “rate”, etc.– Thus, classifiers that combine multiple features often do well, e.g,

• Naïve Bayes, Logistic regression, SVMs, etc

• Linear classifiers often perform well in high-dimensions– In many cases fewer documents in training data than

dimensions,• i.e., n < p => training data are linearly separable

– So again, naïve Bayes, logistic regression, linear SVMS, are all useful

– Question becomes: which linear discriminant to select?


Classification Issues

• Typically many features, p ~ O(104) terms

• Consider n sample points in p dimensions– Binary labels => 2n possible labelings (or dichotomies) – A labeling is linearly separable if we can separate the labels with

a hyperplane

– Let f(n,p) = fraction of the 2n possible labelings that are linearly separable

f(n, p) = 1 n <= p + 1

2/ 2n (n-1 choose i) n > p+1


If n < p+1, then pointswill be linearlyseparable (for large p)


Types of Classifiers

• Generative/Probabilistic– Model p(x | c) for each class, then estimate p(c | x)– e.g., naïve Bayes model

• Conditional Probability/Regression– Model p(c | x) directly, e.g., – e.g., logistic regression

• Discriminative– Look for decision boundaries in input space x directly

• No probabilities

– e.g., perceptron, linear discriminants, SVMs, etc


Probabilistic “Generative” Classifiers

• Model p( x | ck ) for each class and perform classification via Bayes rule,

c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) }

• How to model p( x | ck )?

– p( x | ck ) = probability of a “bag of words” x given a class ck

– Two commonly used approaches (for text):• Naïve Bayes: treat each term xj as being conditionally independent, given ck

• Multinomial: model a document with N words as N tosses of a p-sided die

– Other models possible but less common,• E.g., model word order by using a Markov chain for p( x | ck )


Naïve Bayes Classifier for Text

• Naïve Bayes classifier = conditional independence model

– Assumes conditional independence assumption given the class: p( x | ck ) = p( xj | ck )

– Note that we model each term xj as a discrete random variable

– Binary terms (Bernoulli): p( x | ck ) = p( xj = 1 | ck ) p( xj = 0 | ck )

– Non-binary terms (counts):

p( x | ck ) = p( xj = k | ck ) can use a parametric model (e.g., Poisson) or non-parametric

model (e.g., histogram) for p(xj = k | ck ) distributions.


Multinomial Classifier for Text• Multinomial Classification model

– Assume that the data are generated by a p-sided die (multinomial model)

– where Nx = number of terms (total count) in document x nj = number of times term j occurs in the document

– p(Nx| ck) = probability a document has length Nx, e.g., Poisson model• Can be dropped if thought not to be class dependent

– Here we have a single random variable for each class, and the p( x j = i | ck ) probabilities sum to 1 over i (i.e., a multinomial model)

– Probabilities typically only defined and evaluated for i=1, 2, 3…

– But “zero counts” could also be modeled if desired• This would be equivalent to a Naïve Bayes model with a geometric distribution

on counts

jNx

j

kjkxk

ncxpcNpcp

1

)|()|()|(x


Highest Probability Terms in Multinomial Distributions


Comparing Naïve Bayes and Multinomial models

McCallum and Nigam (1998) Found that multinomial outperformed naïve Bayes (with binary features) in text classification experiments

(however, may be more a result of using counts vs. binary)

Note on names used in the literature- Bernoulli (or multivariate Bernoulli) sometimes used for binary version of Naïve Bayes model

- multinomial model is also referred to as “unigram” model

- multinomial model is also sometimes (confusingly) referred to as naïve Bayes


WebKB Data Set

• Train on ~5,000 hand-labeled web pages– Cornell, Washington, U.Texas, Wisconsin

• Crawl and classify a new site (CMU)• Results:

Student Faculty Person Project Course DepartmtExtracted 180 66 246 99 28 1Correct 130 28 194 72 25 1Accuracy: 72% 42% 79% 73% 89% 100%


Comparing Bernoulli and Multinomial on Web KB Data


Comparing Multinomial and Bernoulli on Reuter’s Data

(from McCallum and Nigam, 1998)


Comparing Bernoulli and Multinomial (slide from Chris Manning, Stanford)

Results from classifying 13,589 Yahoo! Web pages in Science subtree of hierarchy into 95 different topics


Comments on Generative Models for Text(Comments applicable to both Naïve Bayes and Multinomial classifiers)

• Simple and fast => popular in practice– e.g., linear in p, n, M for both training and prediction

• Training = “smoothed” frequency counts, e.g.,

– e.g., easy to use in situations where classifier needs to be updated

regularly (e.g., for spam email)

• Numerical issues– Typically work with log p( ck | x ), etc., to avoid numerical underflow– Useful trick:

• when computing log p( xj | ck ) , for sparse data, it may be much faster to

– precompute log p( xj = 0| ck ) – and then subtract off the log p( xj = 1| ck ) terms

• Note: both models are “wrong”: but for classification are often sufficient

mk

kjkjk

n

nxcp

,)1|(


Beyond independence• Naïve Bayes and multinomial both assume conditional

independence of words given class

• Alternative approaches try to account for higher-order dependencies– Bayesian networks:

• p(x | c) = x p(x | parents(x), c) • Equivalent to directed graph where edges represent direct dependencies• Various algorithms that search for a good network structure• Useful for improving quality of distribution model• ….however, this does not always translate into better classification

– Maximum entropy models• p(x | c) = 1/Z subsets f( subsets(x) | c)• Equivalent to undirected graph model• Estimation is equivalent to maximum entropy assumption• Feature selection is crucial (which f terms to include) – • can provide high accuracy classification• …. however, tends to be computationally complex to fit (estimating Z is

difficult)


Linear Classifiers

• Linear classifier (two-class case) wT x + w0 > 0

– w is a p-dimensional vector of weights (learned from the data)

– w0 is a threshold (also learned from the data)

– Equation of linear hyperplane (decision boundary) wT x + w0 = 0

- Distance of a point x from hyperplane = )(||w||

10wxwT


Geometry of Linear Classifiers

wT x + w0 = 0

Direction of w vector

Distance of x fromthe boundary is 1/||w|| (wT x + w0 )


Optimal Hyperplane, Support Vectors, and Margin

Circles = support vectors = points on convex hull that are closest to hyperplane

M = margin = distance of support vectors from hyperplane

Goal is to find weight vectorthat maximizes M

Theory tells us that max-marginhyperplane leads to good generalization (see workby Vapnik in 1990’s)


Optimal Separating Hyperplane

• Solution to constrained optimization problem:

(Here yi {-1, 1} is the binary class label for example i)

• wlog, let ||w|| = 1/M

• Unique solution for a linearly separable data set • Margin M of the classifier

– the distance between the separating hyperplane and the closest training samples

– optimal separating hyperplane maximum margin

n1,...,i ,M)(||w||

1 subject to M max 0

ww, 0

wxwy iT

i

n1,...,i ,1)( subject to ||w|| min 0 ww, 0

wxwy iT

i


Sketch of Optimization Problem

• Define Lagrangian as a function of w vector, and ’s

• The solution must satisfy

• Points with i > 0 are called support vectors and distance from hyperplane = M

• This results in a quadratic programming optimization problem– Good news:

• convex function of unknowns, unique optimum• Variety of well-known algorithms for finding this optimum

– Bad news:• Quadratic programming in general scales as O(n3), • In practice takes O(na), where a ~ 1.6 to 2 (see Chakrabarti, Chapter 5, p166)

1]- )([ ||w|| 2

1 )L(w,

n

1i0

2

wxwy iT

ii

0]1)([ 0 wxwy iT

ii


From Chakrabarti, Chapter 5, 2002Timing results on text classification


Support Vector Machines• If i > 0 then the distance of xi from the separating hyperplane is M

– Support vectors - points with associated I > 0

• The decision function f(x) is computed from support vectors as

=> prediction can be fast

• Non-linearly-separable case: can generalize to allow “slack” constraints

• Non-linear SVMs: replace original x vector with non-linear functions of x– “kernel trick” : can solve high-d problem without working directly in high d

• Computational speedups: can reduce training time to near- linear– e.g Platt’s SMO algorithm, Joachim’s SVMLight

n

ii

Tii xxyxf

1

)(


• 21578 documents, labeled manually• 9603 training, 3299 test articles (“ModApte” split)• 118 categories

– An article can be in more than one category– Learn 118 binary category distinctions

• Example “interest rate” article2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a

spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct.

Common categories(#train, #test)

Classic Reuters Data Set

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)


Dumais et al. 1998: Reuters - Accuracy

NBayes Trees LinearSVMearn 95.9% 97.8% 98.2%acq 87.8% 89.7% 92.8%money-fx 56.6% 66.2% 74.0%grain 78.8% 85.0% 92.4%crude 79.5% 85.0% 88.3%trade 63.9% 72.5% 73.5%interest 64.9% 67.1% 76.3%ship 85.4% 74.2% 78.0%wheat 69.7% 92.5% 89.7%corn 65.3% 91.8% 91.1%

Avg Top 10 81.5% 88.4% 91.4%Avg All Cat 75.2% na 86.4%


Precision-Recall for SVM (linear),Naïve Bayes, and NN (from Dumais 1998)using the Reuters data set


Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory, and University Web pages from WebKB. From Chakrabarti, 2003, Chapter 5.


Comparing Text Classifiers

• Naïve Bayes models (Bernoulli or Multinomial)– Low time complexity (single linear pass through the data)– Generally good, but not always best– Widely used for spam email filtering

• Linear SVMs– Often produce best results in research studies– But computationally complex to train – not so widely used in practice as naïve Bayes

• Others– Logistic regression, decision trees: less widely used, but can be

useful


Learning with Labeled and Unlabeled documents

• In practice, obtaining labels for documents is time-consuming, expensive, and error prone– Typical application: small number of labeled docs and a very large

number of unlabeled docs

• Idea:– Build a probabilistic model on labeled docs– Classify the unlabeled docs, get p(class | doc) for each class and

doc• This is equivalent to the E-step in the EM algorithm

– Now relearn the probabilistic model using the new “soft labels”• This is equivalent to the M-step in the EM algorithm

– Continue to iterate until convergence (e.g., class probabilities do not change significantly)

– This EM approach to classification shows that unlabeled data can help in classification performance, compared to labeled data alone


Learning with Labeled and Unlabeled DataGraph from “Semi-supervised text classification using EM”, Nigam, McCallum, and Mitchell, 2006


Other issues in text classification• Real-time constraints:

– Being able to update classifiers as new data arrives– Being able to make predictions very quickly in real-time

• Document length– Varying document length can be a problem for some classifiers– Multinomial tends to be better than Bernoulli for example

• Multi-labels and multiple classes– Text documents can have more than one label– SVMs for example can only handle binary data

• Feature selection– Experiments have shown that feature selection (e.g., by greedy algorithms using

information gain) can often improve results

• Linked documents– Can view Web documents as nodes in a directed graph– Classification can now be performed that leverages the link structure,

• Heuristic = class labels of linked pages are more likely to be the same– Optimal solution is to classify all documents jointly rather than individually– Resulting “global classification” problem is typically computationally complex


Further Reading on Text Classification

• Web-related text mining in general– S. Chakrabarti, Mining the Web: Discovering Knowledge from

Hypertext Data, Morgan Kaufmann, 2003.– See chapter 5 for discussion of text classification

• General references on text and language modeling– Foundations of Statistical Language Processing, C. Manning

and H. Schutze, MIT Press, 1999.– Speech and Language Processing: An Introduction to Natural

Language Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000.

• SVMs for text classification– T. Joachims, Learning to Classify Text using Support Vector

Machines: Methods, Theory and Algorithms, Kluwer, 2002

data mining lectures lecture 15: text classification padhraic smyth, uc irvine ics 278: data mining...

Documents