Download - Combining labeled and unlabeled data for text categorization with a large number of categories
Combining labeled and unlabeled data for text categorization with a large number of
categories
Rayid Ghani
KDD Lab Project
Supervised Learning with Labeled Data
Labeled data is required in large quantities and can be very expensive to collect.
Why use Unlabeled data?
Very Cheap in the case of text Web Pages Newsgroups Email Messages
May not be equally useful as labeled data but is available in enormous quantities
Goal
Make learning more efficient and easy by reducing the amount of labeled data required for text classification with a large number of categories
•ECOCvery accurate and efficient for text categorization with a large number of classes
•Co-Traininguseful for combining labeled and unlabeled data with a small number of classes
Related research with unlabeled data
Using EM in a generative model (Nigam et al. 1999)
Transductive SVMs (Joachims 1999) Co-Training type algorithms (Blum &
Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)
What is ECOC?
Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)
Use a learner to learn the binary problems
Training ECOC
0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1
ABCD
f1 f2 f3 f4 f5
X 1 1 1 1 0
Testing ECOC
The Co-training algorithm
Loop (while unlabeled documents remain): Build classifiers A and B
Use Naïve Bayes Classify unlabeled documents with A & B
Use Naïve Bayes Add most confident A predictions and most
confident B predictions as labeled training examples
[Blum & Mitchell 1998]
The Co-training Algorithm
Naïve Bayeson B
Naïve Bayeson A
Learn from labeled data
Estimate labels
Estimate labels
Select most confident
Select most confident
Add to labeled data
[Blum & Mitchell, 1998]
One Intuition behind co-training
A and B are redundant A features independent of B features Co-training like learning with random
classification noise Most confident A prediction gives random B Small misclassification error for A
ECOC + CoTraining = ECoTrain
ECOC decomposes multiclass problems into binary problems
Co-Training works great with binary problems
ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training
SPORTSSPORTS SCIENCESCIENCEARTSARTS
HEALTHHEALTH POLITICSPOLITICS
LAWLAW
What happens with sparse data?
Percent Decrease in Error with Training size and length of code
30
35
40
45
50
55
60
65
70
0 20 40 60 80 100
Training Size
% D
ecre
ase
in E
rro
r
15bit
31bit
63bit
Datasets
Hoovers-255 Collection of 4285 corporate websites Each company is classified into one of 255 categories Baseline 2%
Jobs-65 (from WhizBang) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11%
0
2
4
6
8
10
12
Class
Perc
enta
ge
Results
Dataset Naïve Bayes
(No UnLabeled Data)
ECOC
(No UnLabeled Data)
EM Co-Trainin
g
ECOC + Co-
Training
10% Labeled
100% Labeled
10% Labeled
100% Labeled
10% Labeled
10% Labeled
10% Labeled
Jobs-65 50.1 68.2 59.3 71.2 58.2 54.1 64.5
Hoovers-255
15.2 32.0 24.8 36.5 9.1 10.2 27.6
Results
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100
Recall
Pre
cisi
on
ECOC + CoTrainNaive BayesEM
z
What Next?
Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration
Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training
Summary
The Co-training setting
…My advisor…
…Professor Blum…
…My grandson…
Tom Mitchell
Fredkin Professor of AI…
Avrim Blum
My research interests are…
JohnnyI like horsies!
Classifier A Classifier B
Learning from Labeled and Unlabeled Data:
Using Feature Splits
Co-training [Blum & Mitchell 98] Meta-bootstrapping [Riloff & Jones 99] coBoost [Collins & Singer 99] Unsupervised WSD [Yarowsky 95]
Consider this the co-training setting
Learning from Labeled and Unlabeled Data:
Extending supervised learning
MaxEnt Discrimination [Jaakkola et al. 99]
Expectation Maximization [Nigam et al. 98]
Transductive SVMs [Joachims 99]
Using Unlabeled Data with EMEstimate labels of the unlabeled documents
Use all documents to build anew naïve Bayes classifier
Naïve Bayes
Co-training vs. EM
Co-training Uses feature split Incremental labeling Hard labels
EM Ignores feature split Iterative labeling Probabilistic labels
Which differences matter?
Hybrids of Co-training and EM
Yes No
Incremental co-training self-training
Iterative co-EM EM
Uses Feature Split?
Labeling
Naïve Bayeson A
Naïve Bayeson B
Label all Learn from all
Naïve Bayes
on A & B
Label all
Add only bestLabel allLearn from all
Learning from Unlabeled Datausing Feature Splits
coBoost [Collins & Singer 99] Meta-bootstrapping [Riloff & Jones 99] Unsupervised WSD [Yarowsky 95] Co-training [Blum & Mitchell 98]
Intuition behind Co-training
A and B are redundant A features independent of B features Co-training like learning with random
classification noise Most confident A prediction gives random B Small misclassification error for A
Using Unlabeled Data with EM
Estimate labels of unlabeled documents
Use all documents to rebuild naïve Bayes classifier
Naïve Bayes
[Nigam, McCallum, Thrun & Mitchell, 1998]
Initially learn from labeled
only
Co-EM
Naïve Bayeson A
Naïve Bayeson B
Estimate labels Build naïve Bayes with all data
Estimate labelsBuild naïve Bayes with all data
Use Feature Split?
EMco-EMLabel All
co-trainingLabel Few
NoYes
Initialize withlabeled data