A Simple Probabilistic Approach to Learning from Positive and Unlabeled Examples
Dell Zhang (BBK) and Wee Sun Lee (NUS)
Problem
Supervised Learning
Problem
Semi-Supervised Learning
Problem
PU Learning
Problem
Unlabeled Examples Help
Problem
PU Learning To distinguish
the interesting instances (the positive class C+) with
other instances (the negative class C-)
by learning a classifier from a set of positive examples P and a set of unlabeled examples U
There is no labeled negative example!
Applications To automatically filter web pages according to a user's
preference the browsed or bookmarked pages can be used as positive examples while unlabeled examples can be easily collected from the web
To automatically find machine learning literature the ICML papers can be used as positive examples while unlabeled examples can be easily collected from the ACM or IEEE
digital library To automatically identify cancer patients
the patients known to have cancers can be used as positive examples while unlabeled examples can be easily collected from the patient
database To automatically discover future customers for direct
marketing the current customers of the company can be used as positive examples while unlabeled examples can be purchased at a low cost compared with
obtaining negative examples ……
Approaches Existing Approaches
PNB (Denis et al. 2002); PNCT (Denis et al. 2003)
S-EM (Liu et al. 2002); RC-SVM (Li & Liu 2003)
PEBL (Yu et al. 2004); SVMC (Yu 2005) PN-SVM (Fung et al. 2005) W-LR (Lee & Liu 2003); B-SVM (Liu et al.
2003) Our Proposed Approach
B-Pr
Our Approach
Cx
Cx
p
1 pP
U1
Pr[ | ] Pr[ | ](1 )P C p x x
Pr[ | ] Pr[ | ] Pr[ | ]U C p C x x x
A Probabilistic Model
Our Approach
1Pr[ | ] Pr[ | ] Pr[ | ] Pr[ | ]
1
pC C P U
p
x x x x
( ) sgn Pr[ | ] Pr[ | ]f b P U x x x
( ) sgn Pr[ | ] Pr[ | ]f x C C x x
(1 ) (1 )b p p
Our Approach
Biased PrTFIDF (B-Pr) Estimate
PrTFIDF (Joachims 1997) Estimmate
Maximize On a held-out validation set (Lee & Liu 2003)
Linear Time Complexity!
b2Pr[ ] Pr[ ( ) 1]pr C r f x
Pr[ | ] and Pr[ | ]P Ux x
Experiments
Reuters-21578
B-Pr>RC-SVM>PEBL (p=0.55)
RC-SVM>B-Pr>PEBL (p=0.85)
Experiments
20NewsGroups
B-Pr>W-LR>S-EM (p=0.3)
B-Pr>W-LR>S-EM (p=0.7)
Conclusion
A New Approach to Learning from Positive and Unlabeled Examples As effective as the state-of-the-art
approaches Yet simpler and faster
Thank you
Questions? Comments? Suggestions? ……