ida 2015: efficient model selection for regularized classification by exploiting unlabeled data
TRANSCRIPT
Introduction Quantification The proposed approach Experiment Framework Conclusion
Efficient Model Selection for RegularizedClassification by Exploiting Unlabeled Data
Georgios Balikas1 Ioannis Partalas2 Eric Gaussier1 RohitBabbar3 Massih-Reza Amini1
1University Grenoble, Alpes
2Viseo R&D
3Max-Plank Institute for Intelligent Systems
Intelligent Data Analysis 2015, Saint-Etienne
1/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Outline
1 Introduction
2 Quantification
3 The proposed approach
4 Experiment Framework
5 Conclusion
2/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Model selection for text classification
Doc1
DocN
d1 ∈ Rd
dN ∈ Rd
Feature
Extraction
Select hθ ∈ H.
θ: hyper-parametersR(θ) ∈ R
Learning
θ ?
The task
Efficiently select the hyper-parameter value which minimizes thegeneralization error (using the empirical error as a proxy).
3/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Traditional Model Selection Methods
Valid. Train Train Train Train
Train Valid. Train Train Train
Train Train Train Train Valid.
Figure : 5-fold Cross Validation
Train Valid.
Figure : Hold-out
Extensions of the above such as Leave-one-out, etc.
M. Mohri et al.Foundations of Machine Learning, MIT press 2012
4/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
The issues
In large scale problems:
Resource intensive: ∼ 106 − 108 free parameters. Optimizedk-CV can take up to several days.
Power law distribution ofexamples. Only a fewinstances for smallclasses, splitting themresults in loss ofinformation.
Labeled Documents/class
R. Babbar, I. Partalas, E. Gaussier, M-R. AminiRe-ranking approach to classification in large-scale power-law distributedcategory systems, SIGIR 2014
5/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Our contribution
We propose a bound that motivates efficient model selection.
Leverages unlabeled data for model selection
Performs on par (if not better) with traditional methods
Is k times faster than k-cross validation.
6/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Quantification
Definition
In many classification scenarios, the real goal is determining theprevalence of each class in the test, a task called quantification.
Given a dataset:
? How many people liked the new iPhone?
? How many instances belong to yi class?
A. Esuli and F. SebastianiOptimizing text quantifiers for multivariate loss functions, arXiv preprintarXiv:1502.05491
7/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Quantification using general purpose learners
Classify and Count
Aggregative method
Classify each instancefirst
Count instances/class
Probabilistic Classify and Count
Non-aggregative method
Get scores/probabilities for eachinstance
Sum over probabilities/class
G. FormanCounting positives accurately despite inaccurate classification, ECML 2005
8/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Our setting
Mono-label, multi-class classification
Observations x ∈ X ⊆ Rd , labels y ∈ Y, |Y | > 2
(x, y) i.i.d. according to a fixed, unknown D over X × YStrain = {(x(i), y (i))}N
i=1, S = {(x(i))}Mi=N+1
Regularized classification: w = arg min Remp(w) + λReg(w)
hθ ∈ H, e.g., for SVMs the θ = λ from a set λvalues
py , pC(S)y : prior on Strain, estimated using quantification on S
9/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Accuracy bound
Theorem
Let S = {(x(j))}Mj=1 be a set generated i.i.d. with respect to DX , py the true prior
probability for category y ∈ Y andNy
N, py its empirical estimate obtained on Strain.
We consider here a classifier C trained on Strain and we assume that the quantificationmethod used is accurate in the sense that:
∃ε, ε� min{py , py , pC(S)y }, ∀y ∈ Y : |pC(S)
y −M
C(S)y
|S || ≤ ε
Let BC(S)A , be defined as:
∑y∈Y
min{py × |S|, pC(S)y × |S|}
|S| , BC(S)A
Then for any δ ∈]0, 1], with probability at least (1− δ):
AC(S) ≤ BC(S)A + |Y|(
√log |Y|+ log 1
δ
2N+ ε)
10/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Intuition
Estimated prob. of y on |S |prior prob. of y
BC(S)A ,
∑y∈Y
min{ py × |S |, pC(S)y × |S |}
|S |
In a power-law distributed category systems this is an upperbound:
– py will be used for large classes due to false positives, and
– pC(S)y will be used for small classes due to false negatives.
11/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Model selection using the bound
Training Data
Estimate class priorsQuantification on unseen data
12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Model selection using the bound
Training Data
Estimate class priorsQuantification on unseen data
for λ in λvalues doTrain on Strain
Estimate pC(S)y on S
end for
12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Model selection using the bound
Training Data
Estimate class priorsQuantification on unseen data
Calculate the Bound
Select hyper-parameter value
12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Datesets
Dataset #Training #Quantification #Test #Features # Parameters
dmoz250 1,542 2,401 1,023 55,610 13,9Mdmoz500 2,137 3,042 1,356 77,274 38,6Mdmoz1000 6,806 10,785 4,510 138,879 138,8Mdmoz1500 9,039 14,002 5,958 170,828 256,2Mdmoz2500 12,832 19,188 8,342 212,073 530,1M
– Similar experimental settings on wikipedia data
– SVMs and Log. Regression, λ ∈ {10−4, . . . , 104}– 5-CV, Held out (70%-30%), BoundUN, BoundTest
13/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Results (1/2)
10−4 10−3 10−2 10−1 1 10 102 103
λ values
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8A
ccura
cy
5-CV
H out
MaF
CC
PCC
Figure : MaF measure optimization for wiki1500 for SVM.
14/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Results (2/2)
BoundUn BoundTest Hold-out 5-CV
Dataset Acc MaF Acc MaF Acc MaF Acc MaF
dmoz250 .8260 .6242 .8270 .6243 .8260 (±.0000) .6242 (±.0000) .8260 .6242dmoz500 .7227 .5584 .7227 .5584 .7221 (±.0005) .5558 (±.0022) .7220 .5562dmoz1000 .7302 .4883 .7302 .4892 .7301 (±.0001) .4835 (±.0155) .7299 .4883dmoz1500 .7132 .4715 .7132 .4715 .6958 (±.0457) .4065 (±.0998) .7132 .4715dmoz2500 .6352 .4301 .6350 .4306 .6350 (±.0001) .3949 (±.0686) .6352 .4301
wiki1500 for SVM on 4 cores: BoundUn (302 sec), 5-CV (1310 sec).
15/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Conclusions
? Performs equally well or better than traditional modelselection methods for model selection.
? Is k times faster than k-CV.
? It requires unlabeled data from the same distribution as thetraining data.
16/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data
Introduction Quantification The proposed approach Experiment Framework Conclusion
Thank you
Georgios [email protected]
Ioannis [email protected]
Eric [email protected]
Rohit [email protected]
Massih-Reza [email protected]
This work is partially supported by the CIFRE N 28/2015 and bythe LabEx PERSYVAL Lab ANR-11-LABX-0025.
17/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data