partially supervised classification of text documents by bing liu, philip yu, and xiaoli li...
TRANSCRIPT
Partially Supervised Classification of Text DocumentsbyBing Liu, Philip Yu, and Xiaoli Li
Presented by: Rick Knowles7 April 2005
Agenda Problem Statement Related Work Theoretical Foundations Proposed Technique Evaluation Conclusions
Problem Statement: Common Approach
Text categorization: automated assigning of text documents to pre-defined classes
Common Approach: Supervised Learning Manually label a set of documents to pre-defined
classes Use a learning algorithm to build a classifier
+
+
+
+
+
++
+
+
++
+
+
+
+
_
_
_
_
_
_
___
_
_
_
_
_
_
Problem Statement: Common Approach (cont.)
Problem: bottleneck associated with large number of labeled training documents to build the classifier Nigram, et al, have shown that using a
large dose of unlabeled data can help
+
.
.
.
+
++
+
+
+.
.
.
+
+
.
.
_
_
_
__
.
_
_
_
_
.
_
.
..
.
.
.
A different approach:Partially supervised classification Two class problem: positive and unlabeled
Key feature is that there is no labeled negative document Can be posed as a constrained optimization problem
Develop a function that correctly classifies all positive docs and minimizes the number of mixed docs classified as positive will have an expected error rate of no more than
Examplar: Finding matching (i.e., positive documents) from a large collection such as the Web.
Matching documents are positive All others are negative
+
+
+
+
+
+
++
+
+
+
+ .
..
..
.
.
.
.
.
.
.
..
..
.
.
..
..
.
.
.
..
.
.
Related Work Text Classification techniques
Naïve Bayesian K-nearest neighbor Support vector machines
Each requires labeled data for all classes
Problem similar to traditional information retrieval Rank orders documents according to their
similarities to the query document Does not perform document classification
Theoretical Foundations Some discussion regarding the theoretical
foundations. Focused primarily on Minimization of the probability of error Expected recall and precision of functions
Pr[f(X)=Y] = Pr[f(X)=1] - Pr[Y=1] + 2Pr Pr[f(X)=0 | Y=1]Pr[Y=1]
Painful, painful… but it did show you can build accurate classifiers with high probability when sufficient documents in P (the positive document set) and M (the unlabeled set) are available.
/(1)
Theoretical Foundations (cont.)
Two serious practical drawbacks to the theoretical method Constrained optimization problem may
not be easy to solve for the function class in which we are interested
Not easy to choose a desired recall level that will give a good classifier using the function class we are using
Proposed Technique Theory be darned! Paper introduces a practical
technique based on the naïve Bayes classifier and the Expectation-Maximization (EM) algorithm
After introducing a general technique, the authors offer an enhancement using spies
Proposed Technique:Terms D is the set of training documents V = < w1, w2, …, w|V| > is the set of all words
considered for classification wdi,k is the word in position k in document di
N(wt, di) is the number of times wt occurs in di
C = {c1, c2} is the set of predefined classes P is the set of positive documents M is the set of unlabeled set of documents S is the set of spy documents Posterior probability Pr[cj | di] {0,1} depends on
the class label of the document
Proposed Technique:naïve Bayesian classifer (NB-C)
Pr[cj] = i Pr[cj|di] / |D|
Pr[wt|cj] = 1 + i=1P[cj|di] N(wt, di)
|V| + s=1 i=1 P[cj|di] N(ws, di)
and assuming the words are independent given the class
Pr[cj|di] = Pr[cj] k=1Pr[wdi,k|cj]
r=1Pr[cr] k=1Pr[wdi,k|cr]
The class with the highest Pr[cj|di] is assigned as the class of the doc
|D|
|D||V|
|C|
|di|
|di|
(2)
(3)
(4)
Proposed Technique:EM algorithm Popular class of iterative algorithms for
maximum likelihood estimation in problems with incomplete data.
Two steps Expectation: fills in the missing data Maximization: parameters are estimated Rinse and repeat
Using a NB-C, (2) and (3) equate to the E step, and (4) is the M step Probability of a class now takes the value in
[0,1] instead of {0,1}
Proposed Technique:EM algorithm (cont.) All positive documents have the class
value c1
Need to determine class value of each doc in mixed set.
EM can help assign a probabilistic class label to each document dj in the mixed set Pr[c1|dj] and Pr[c2|dj] After a number of iterations, all the
probabilities will converge
Proposed Technique:Step 1 - Reinitialization (I-EM) Reinitialization
Build an initial NB-C using the documents sets M and P For class P, Pr[c1|dj] = 1 and Pr[c2|dj] = 0 For class M, Pr[c1|dj] = 0 and Pr[c2|dj] = 1
Loop while classifier parameters change For each document dj M
Compute Pr[c1|dj] using the current NB-C Pr[c2|dj] = 1 - Pr[c1|dj]
Update Pr[wt|c1] and Pr[c1] given the probabilistically assigned class for dj (Pr[c1|dj]) and P (a new NB-C is being built in the process
Works well on easy datasets Problem is that our initialization is strongly biased
towards positive documents
Proposed Technique:Step 1 - Spies Problem is that our initialization is strongly biased
towards positive documents Need to identify some very likely negative documents
from the mixed set We do this by sending “spy” documents from the
positive set P and put in the mixed set M (10% was used)
A threshold t is set and those documents with a probabilistic label less than t are identified as negative
15% was the threshold used
c2
c1positive
mix
spies
c1positive
spies
c2likelynegative
unlabeled
Proposed Technique:Step 1 - Spies (cont)
N (most likely negative docs) = U (unlabeled docs) = S (spies) = sample(P,s%) MS = M U S P = P - S Assign every document di in P the class c1
Assign every document dj in MS the class c2
Run I-EM(MS,P) Classify each document dj in MS Determine the probability threshold t using S For each document dj in M
If its probability Pr[c1|dj] < t N = N U {dj}
Else U = U U {dj}
Proposed Technique:Step 2 - Building the final classifier
Using P, N and U as developed in the previous step Put all the spy documents S back in P Assign Pr[c1 | di] =1 for all documents in P Assign Pr[c2 | di] =1 for all documents in N. This will
change with each iteration of EM Each doc dk in U is not assigned a label initially. At the
end of the first iteration, it will have a probabilistic label Pr[c1 | dk]
Run EM using the document sets P, N and U until it converges
When EM stops, the final classifier has been produced.
This two step technique is called S-EM (Spy EM)
Proposed TechniqueSelecting a classifier The local maximum that the final
classifier may not cleanly separate the positive and negative documents Likely if there are many local clusters If so, from the set of classifiers developed
over each iteration, select the one with the least probability of error
Refer to (1)Pr[f(X)=Y] = Pr[f(X)=1] - Pr[Y=1] + 2Pr Pr[f(X)=0 |
Y=1]Pr[Y=1] /
EvaluationMeasurements Breakeven Point
0 = p - r, where p is precision and r is recall Only evaluates sorting order of class
probabilities of documents Not appropriate
F score F = 2pr / (p+r) Measures performance on a particular class
Reflects average effect of both precision and recall Only when both p and r are large will F be large
Accuracy
EvaluationResults 2 large document corpora
20NG Removed UseNet headers and subject lines
WebKB HTML tags removed
8 iterations
Pos Size M size Pos in M NB(F)
Average 405 4471 811 43.93
NB(A) 1-EM8(F) 1-EM8(A) S-EM(F) S-EM(A)
84.52 68.58 87.54 76.61 92.16
EvaluationResults (cont)
Also varied the % of positive documents both in P (%a) and in M (%b)
Pos Size M size Pos in M NB(F)
a=20% b=20%
405 3985 324 60.66
a=50% b=20%
1013 3863 203 72.09
a=50% b=50%
1013 4167 507 73.81
NB(A) 1-EM8(F) 1-EM8(A) S-EM(F) S-EM(A)
94.41 68.08 91.96 76.93 95.96
95.94 63.63 86.81 73.61 95.28
93.12 71.25 85.79 81.85 94.32
Conclusions This paper studied the problem of
classification with only partial information: one class and a set of mixed documents
Technique Naïve Bayes classifier Expectation Maximization algorithm
Reinitialized using the positive documents and the most likely negative documents to compensate bias
Use estimate of classification error to select a good classifier
Extremely accurate results