combining labeled and unlabeled data for text categorization with a large number of categories rayid...
TRANSCRIPT
![Page 1: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/1.jpg)
Combining labeled and unlabeled data for text categorization with a large number of
categories
Rayid Ghani
KDD Lab Project
![Page 2: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/2.jpg)
Supervised Learning with Labeled Data
Labeled data is required in large quantities and can be very expensive to collect.
![Page 3: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/3.jpg)
Why use Unlabeled data?
Very Cheap in the case of text Web Pages Newsgroups Email Messages
May not be equally useful as labeled data but is available in enormous quantities
![Page 4: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/4.jpg)
Goal
Make learning more efficient and easy by reducing the amount of labeled data required for text classification with a large number of categories
![Page 5: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/5.jpg)
•ECOCvery accurate and efficient for text categorization with a large number of classes
•Co-Traininguseful for combining labeled and unlabeled data with a small number of classes
![Page 6: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/6.jpg)
Related research with unlabeled data
Using EM in a generative model (Nigam et al. 1999)
Transductive SVMs (Joachims 1999) Co-Training type algorithms (Blum &
Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)
![Page 7: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/7.jpg)
What is ECOC?
Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)
Use a learner to learn the binary problems
![Page 8: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/8.jpg)
Training ECOC
0 0 1 1 01 0 1 0 00 1 1 1 00 1 0 0 1
ABCD
f1 f2 f3 f4 f5
X 1 1 1 1 0
Testing ECOC
![Page 9: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/9.jpg)
The Co-training algorithm
Loop (while unlabeled documents remain): Build classifiers A and B
Use Naïve Bayes Classify unlabeled documents with A & B
Use Naïve Bayes Add most confident A predictions and most
confident B predictions as labeled training examples
[Blum & Mitchell 1998]
![Page 10: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/10.jpg)
The Co-training Algorithm
Naïve Bayeson B
Naïve Bayeson A
Learn from labeled data
Estimate labels
Estimate labels
Select most confident
Select most confident
Add to labeled data
[Blum & Mitchell, 1998]
![Page 11: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/11.jpg)
One Intuition behind co-training
A and B are redundant A features independent of B features Co-training like learning with random
classification noise Most confident A prediction gives random B Small misclassification error for A
![Page 12: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/12.jpg)
ECOC + CoTraining = ECoTrain
ECOC decomposes multiclass problems into binary problems
Co-Training works great with binary problems
ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training
![Page 13: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/13.jpg)
SPORTSSPORTS SCIENCESCIENCEARTSARTS
HEALTHHEALTH POLITICSPOLITICS
LAWLAW
![Page 14: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/14.jpg)
What happens with sparse data?
Percent Decrease in Error with Training size and length of code
30
35
40
45
50
55
60
65
70
0 20 40 60 80 100
Training Size
% D
ecre
ase
in E
rro
r
15bit
31bit
63bit
![Page 15: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/15.jpg)
Datasets
Hoovers-255 Collection of 4285 corporate websites Each company is classified into one of 255 categories Baseline 2%
Jobs-65 (from WhizBang) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11%
![Page 16: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/16.jpg)
0
2
4
6
8
10
12
Class
Perc
enta
ge
![Page 17: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/17.jpg)
Results
Dataset Naïve Bayes
(No UnLabeled Data)
ECOC
(No UnLabeled Data)
EM Co-Trainin
g
ECOC + Co-
Training
10% Labeled
100% Labeled
10% Labeled
100% Labeled
10% Labeled
10% Labeled
10% Labeled
Jobs-65 50.1 68.2 59.3 71.2 58.2 54.1 64.5
Hoovers-255
15.2 32.0 24.8 36.5 9.1 10.2 27.6
![Page 18: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/18.jpg)
Results
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100
Recall
Pre
cisi
on
ECOC + CoTrainNaive BayesEM
z
![Page 19: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/19.jpg)
What Next?
Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration
Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training
![Page 20: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/20.jpg)
Summary
![Page 21: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/21.jpg)
The Co-training setting
…My advisor…
…Professor Blum…
…My grandson…
Tom Mitchell
Fredkin Professor of AI…
Avrim Blum
My research interests are…
JohnnyI like horsies!
Classifier A Classifier B
![Page 22: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/22.jpg)
Learning from Labeled and Unlabeled Data:
Using Feature Splits
Co-training [Blum & Mitchell 98] Meta-bootstrapping [Riloff & Jones 99] coBoost [Collins & Singer 99] Unsupervised WSD [Yarowsky 95]
Consider this the co-training setting
![Page 23: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/23.jpg)
Learning from Labeled and Unlabeled Data:
Extending supervised learning
MaxEnt Discrimination [Jaakkola et al. 99]
Expectation Maximization [Nigam et al. 98]
Transductive SVMs [Joachims 99]
![Page 24: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/24.jpg)
Using Unlabeled Data with EMEstimate labels of the unlabeled documents
Use all documents to build anew naïve Bayes classifier
Naïve Bayes
![Page 25: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/25.jpg)
Co-training vs. EM
Co-training Uses feature split Incremental labeling Hard labels
EM Ignores feature split Iterative labeling Probabilistic labels
Which differences matter?
![Page 26: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/26.jpg)
Hybrids of Co-training and EM
Yes No
Incremental co-training self-training
Iterative co-EM EM
Uses Feature Split?
Labeling
Naïve Bayeson A
Naïve Bayeson B
Label all Learn from all
Naïve Bayes
on A & B
Label all
Add only bestLabel allLearn from all
![Page 27: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/27.jpg)
Learning from Unlabeled Datausing Feature Splits
coBoost [Collins & Singer 99] Meta-bootstrapping [Riloff & Jones 99] Unsupervised WSD [Yarowsky 95] Co-training [Blum & Mitchell 98]
![Page 28: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/28.jpg)
Intuition behind Co-training
A and B are redundant A features independent of B features Co-training like learning with random
classification noise Most confident A prediction gives random B Small misclassification error for A
![Page 29: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/29.jpg)
Using Unlabeled Data with EM
Estimate labels of unlabeled documents
Use all documents to rebuild naïve Bayes classifier
Naïve Bayes
[Nigam, McCallum, Thrun & Mitchell, 1998]
Initially learn from labeled
only
![Page 30: Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project](https://reader033.vdocument.in/reader033/viewer/2022051415/56649f4d5503460f94c6e1f2/html5/thumbnails/30.jpg)
Co-EM
Naïve Bayeson A
Naïve Bayeson B
Estimate labels Build naïve Bayes with all data
Estimate labelsBuild naïve Bayes with all data
Use Feature Split?
EMco-EMLabel All
co-trainingLabel Few
NoYes
Initialize withlabeled data