smart rss aggregator a text classification problem alban scholer & markus kirsten 2005
Post on 28-Dec-2015
224 Views
Preview:
TRANSCRIPT
Smart RSS Aggregator
A text classification problem
Alban Scholer & Markus Kirsten2005
Introduction
● Smart RSS aggregator
● Predicts how interesting a user finds an unread article
● Presents news articles depending on the prediction
Issues
● Extremely high dimensional data
● Lots of unlabeled data
● Few training examples
● Only clickthrough information
● Multiuser environment
Support Vector Machine
● Support Vector Machine
● Max-margin for generalization
● Linear but easily extended to non-linear classification
Max-margin separator
++
++
++
++++
++
++++
++
++
++
++
++
SVM
● The problem of finding the optimal w can be reduced to the following QP
Transductive SVM (TSVM)
● Semi-supervised learning VS supervised learning.
● TSVM is well suited for problem where:– There are few labeled data available – There are lots of unlabeled data.
● Information lying in the unlabeled data is captured and modifies the decision surface.
TSVM VS SVM
TSVM optimization problem
● New optimized variable set : yi*
● New set of slack variables● New user-specified variable : C*
● Very difficult optimization problem:– Intractable when the number of unlabeled
data is greater than 10– Approximative solution proposed by
Johachims.
Text Classification
● Joachims T. Transductive “Inference for Text Classification using SVM”
● Characteristics of the Text Classification problem
● Why are SVM and TSVM well suited for this kind of problem?
● Feature selection for text classification using SVM
Characteristics of the Text Classification problem
● High dimensional input space– One dimension for each word in the vocabulary
(10 000 words)
● Sparse input vector– In one text, a tiny proportion of the full
vocabulary is used
Why (T)SVM?
● SVM has been shown to perform well in these conditions and can outperform other classifiers.
● Transductive SVM, exploiting information in test data, can outperform SVM when few training samples but lots of test data are available.
Feature selection for Text Classification using SVM
● Feature selection is the main problem in many machine learning applications.
● A poor feature selection leads to poor accuracy.
Feature selection (cont)
● For the text classification problem:
– The number of dimensions of the document vector is the number of words in the vocabulary. (Huge number of dimensions!)
– Each component of the document vector is the
count of the number of word in the document.
Feature selection (cont)
● Refinement of the feature selection:
– Johachims add to this document vector the Inverse Document Frequency of each relevant word in the document.
– The IDF can be computed using the Document Frequency DF(w)
● IDF(w) = log(n/DF(w)) ● Where n is the total number of document
Feature selection (cont)
● Other refinements :– Stopword elimination– Word stemmer
Feature selection (cont)
● Ex : “the text classification task is characterized by a special set of characteristics. The text classification problem....”
● Transformation of the above text into a feature vector
Feature selection (example)
text
2
classification 2
task 1
charact 2
● The document vector isvery sparse
● The words characteristicsand characterized have thesame stemmer charact
Smart stuff
● Wordnet
● Combinations of words
● Putting users into clusters
● Using additional features (links, dates, author, source etc.)
● Active learning
Conclusion
● TSVM is well suited for text classification problems
● Feature selection is crucial
● To boost accuracy to a reasonable level, we have to combine techniques.
References
● Simon Haykin, Neural Networks, Second Edition, Pearson Education, chapter 6 1999
● Joachims Thorsten, Transductive Inference for Text Classification using SVM, Proceedings of ICML-99, 16th International Conference on Machine Learning, 1999
References (cont)
● Tom M. Mitchell, Machine Learning, chapter 6 Mc Graw-Hill international editions, 1997
● K. Nigam, A. K. Mccallum, S. ThMachine Learningrun, T. Mitchell, Text Classification from Labeled and Unlabeled Documents using EM, Kluwer Academic Publishers, Boston, 1999
top related