pebl: web page classification without negative examples hwanjo yu, jiawei han, kevin chen- chuan...
TRANSCRIPT
PEBL: Web Page Classification without Negative Examples
Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, JAN 2004
Outline
Problem statement Motivation Related work Main contribution Technical details Experiments Summary
Problem Statement
To classify web pages into “user-interesting” classes.
E.g. “Home-Page Classifier” “Call for Papers Classifier”
Negative Samples are not given specifically.
Positive and Unlabeled Samples.
Motivation
Collecting Negative Samples may be delicate and arduous Negative samples must uniformly represent
the universal set. Manually collected negative training examples
could be biased. Predefined classes usually do not match
users’ diverse and changing search targets.
Challenges Collecting unbiased unlabeled data from
universal set. Random Sampling of web pages on Internet.
Achieving classification accuracy from positive unlabeled data as high as from labeled data. PEBL framework (Mapping-Convergence
Algorithm using SVM).
Related Work Semisupervised Learning
Requires sample of labeled (+/-) and unlabeled data EM algorithm Transductive SVM
Single-Class Learning or Classification Rule-based (k-DNF)
Not tolerant to sparse, high-dimensionality. Requires knowledge of proportion of positive instances in the
universal set. Probability-based
Requires prior probabilities for each class. Assumes linear separation.
OSVM, Neural Networks
Main Contribution Collection of just positive samples speeds
up the process of building classifiers. The universal set of unlabeled samples can
be reused for training different classifiers. This supports example based query on
internet. PEBL achieves accuracy as high as that of
a typical framework w/o loss of efficiency in testing.
SVM Overview
Mapping-Convergence Algorithm Mapping Stage
A weak classifier (1) that draws an initial approximation of “strong” negative data.
1 must not generate false negatives. Convergence Stage
Runs in iteration using a second base classifier (2) that maximizes the margin to make progressively better approximation of negative data.
2 must maximize margin.
Mapping Stage
Checking the frequency of the features within positive and unlabeled samples gives us a list of positive features.
Filter out all the samples having positive features leaving behind just the “strong” negative samples.
Mapping-Convergence Algorithm
Mapping-Convergence Algorithm
Experiments LIBSVM for SVM implementation. Gaussian Kernels for better text categorization
accuracy. Experiment1: The Internet
2388 pages from DMOZ - unlabeled dataset 368 personal homepages, 449 non-homepages 192 college admission pages, 450 non-admission 188 resume pages, 533 non-resume pages
Experiments Experiment2: University CS Department
4093 pages from WebKB - unlabeled dataset 1641 student pages, 662 non-student pages 504 project pages, 753 non-project pages 1124 faculty pages, 729 non-faculty pages
Precision-Recall (P-R) breakeven point is used as the performance measure.
Compared against TSVM: Traditional SVM OSVM: One-Class SVM UN: treating unlabeled data as negative instances
Experiments
Experiments
Summary
Classifying web pages of interesting class requires laborious preprocessing.
PEBL framework eliminates the need for negative training samples.
M-C algorithm achieves accuracy as high as traditional SVM.
Additional multiplicative logarithmic factor in training time on top of SVM.