text categorization with support vector machines: learning with many relevant features
DESCRIPTION
Text Categorization With Support Vector Machines: Learning With Many Relevant Features. By Thornsten Joachims Presented By Meghneel Gore. Goal of Text Categorization. Classify documents into a number of pre-defined categories. Documents can be in multiple categories - PowerPoint PPT PresentationTRANSCRIPT
Text Categorization With Support Vector Machines: Learning With Many Relevant Features
By Thornsten JoachimsPresented By Meghneel Gore
Goal of Text Categorization
Classify documents into a number of pre-defined categories. Documents can be in multiple
categories Documents can be in none of the
categories
Applications of Text Categorization Categorization of news stories for
online retrieval Finding interesting information from
the WWW Guiding a user's search through
hypertext
Representation of Text
Removal of stop words Reduction of word to its stem Preparation of feature vector
Representation of Text
..........................................................................................................................................................
2 Comput1 Process2 Buy3 Memory....
This is a Document Vector
What's Next...
Appropriateness of support vector machines for this application
Support vector machine theory Conventional learning methods Experiments Results Conclusions
Why SVMs?
High dimensional input space Few irrelevant features Sparse document vectors Text categorization problems are
linearly separable
Support Vector Machines
Visualization of a Support Vector Machine
Support Vector Machines Structural risk minimization
ndn
dherrortrainherrorP 4
ln)12
(ln2)(_))((
Support Vector Machines We define a structure of hypothesis
spaces Hi such that their respective VC dimensions di increases
Support Vector Machines Lemma [Vapnik, 1982]
Consider hyperplanes
}{)( bdwsigndh
As hypotheses
Support Vector Machines
Awwithbdw
,1
If all example vectors are contained in A hypersphere of radius R and it is Required that
Support Vector Machines Then this set of hyperplane has a
VC dimension d bounded by
1)],min([ 22 nARd
Minimize
Support Vector Machines
Such that
w
ibdwy ii ,1][
Conventional Learning Methods Naïve Bayes classifier Rocchio algorithm K-nearest Neighbors Decision tree classifier
Naïve Bayes Classifier Consider a document vector with
attributes a1, a2… an with target values v Bayesian approach:
),,,(maxarg 21 njVv
map aaavPvj
Naïve Bayes Classifier We can rewrite that using Bayes
theorem as
)()...,(maxarg
)...,(
)()...,(maxarg
21
21
21
jjnVv
n
jjn
Vvmap
vPvaaaP
aaaP
vPvaaaPv
j
j
Naïve Bayes Classifier Naïve Bayes method assumes that
the attributes are independent
)""(
...)""()""()(maxarg
)()(maxarg
11
21},{
1},{
j
jjjdislikelikev
n
ijij
dislikelikevNB
vsnowaP
vhadaPvMaryaPvP
vaPvPv
j
j
Experiments
Datasets Performance measures Results
Datasets Reuters-21578 dataset
9603 training examples 3299 testing documents
Ohsumed Corpus 10000 training documents 10000 testing examples
Performance Measures
Precision Probability that a document predicted
to be in class ‘x’ truly belongs to that class
Recall Probability that a document belonging
to class ‘x’ is classified into that class Precision/recall breakeven point
Results
Precision/recall break-even point on Ohsumed dataset
Results
Precision/recall break-even point on Reuters dataset
Conclusions
Introduces SVMs for text categorization
Theoretical and empirical evidence that SVMs are well suited for text categorization
Consistent improvement in accuracy over other methods