full versus incomplete cross-validation: measuring the ...€¦ · 3 full versus incomplete cv 4...
TRANSCRIPT
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
1/25
Full versus incomplete cross-validation:measuring the impact of imperfect separation
between training and test sets in predictionerror estimation
Roman Hornung
Joint work with Christoph Bernau, Caroline Truntzer,Thomas Stadler and Anne-Laure Boulesteix
LMU MunichDepartment of Medical Informatics, Biometry and Epidemiology
August, 28th, 2014
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
2/25
1 Introduction
2 Addon procedures
3 Full versus incomplete CV
4 New measure CVIIM
5 Illustration
6 Summary & Conclusion
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
3/25
Introduction
Modern technologies, most prominently microarrays,enable the measurement of the expression of thousandsor many thousands of genes for each unit ofinvestigation.
Data sets are generated in which such measurements areagglomerated for units (patients, tissues,...) affected by adisease of interest and for unaffected controls.
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
4/25
Introduction
This data contains empirical information on (previously)unknown systematic differences between the two groups.
By the aid of classification methods we can thusempirically construct prediction rules for the purpose ofpredicting the status (diseased or healthy) of new units.
Due to limited sample sizes and information contained ingene expression such prediction rules make errors.
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
5/25
Introduction
Our general interest lies in a correct estimation of theexpected error frequency.
Procedures which tend to result in too small estimatederror frequencies can result in overoptimistic (⇒dangerous!) conclusions regarding the performance of aprediction rule.
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
6/25
Prediction rule in general
Sample: S = {(X 1,Y1), . . . , (X n,Yn)} ∼ Pn
X i : (LONG) vector of (raw!) gene expressions (“covariates”)
Yi : status of patient (“class”) (i = 1, . . . , n)
Prediction rule fitted on “training data” S :
gS : X 7→ Y = {1, 2}
Predicted status of new patient with gene expression vectorx ∈ X :
gS(x) = y
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
7/25
Exemplary analysis for obtaining a prediction rule
Data material: 100 DNA microarrays of breast tissues, 50affected with early stage breast cancer and 50 unaffected
Analysis:
1 Normalization using the RMA method. ⇒ 47,000 differentexpression variables
2 t-test based selection of the 500 most informative variables3 Cross-validation based selection of the optimal cost
parameter for the Support Vector Machine (SVM)classification method
4 Fitting the SVM classification method using the optimizedcost parameter
⇒ Three preliminary steps before fitting the actualclassification method - these steps are part of predictionrule gs(·).
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
8/25
“Addon procedures” for preliminary steps
In general: for the purpose of prediction each stepperformed for obtaining a prediction rule has to be doneon new units as well
Naive approach: 1) Pool training data with new units, 2)re-perform all preliminary steps, 3) fit the classificationmethod anew on the training data
a) Impossible for steps the fitting of which requires thetarget variable; b) prediction rule is commonly kept fixed
⇒ All steps have to be integrated into the constructedprediction rule gs(·).
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
9/25
“Addon procedures” for preliminary steps
“Addon procedures”: New units made subject to exactlythe same procedure as those in the training data, but newunits not involved in the adaption of the procedure to thetraining data.
Example - Addon procedure for variable selection:
The same variables are chosen for new units than for thosein the training data, but only the training data is used todetermine, which variables are used.
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
10/25
Estimating the error of prediction rules bycross-validation (CV)
Procedure (In practice: repeat and take the average of theresults):
1 Split the data set s into K (approximately) equally sizedfolds s1, . . . , sK
2 For k = 1, . . . ,K : Use the units in s/sk for constructingthe prediction rule and the units in sk as test data.
3 Average the misclassification rates out of the K splittingsin (2).
Incomplete CV: One or more preliminary steps performedbefore CV
Correct CV is termed as Full CV.
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
11/25
Incomplete cross-validation
Implication: Part of the prediction rule already constructedon the whole data set⇒ Violation of the training and test set principle of CV
Can lead to severe underestimation of the true error onindependent data
Incomplete CV known to be severely downwardly biasedfor the case of variable selection
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
12/25
Incomplete cross-validation
Issue previously unexamined for other preliminary steps inthe literature (to our knowledge)
Preliminary steps almost always conducted before CV
Examples: normalization of gene expression data,imputation of missing values, variable filtering by variance,dichotomization of continuous variables, data-drivendetermination of powers of fractional polynomials,sophisticated preprocessing steps for imaging data, ...
No certainty on the extent of downward bias throughincomplete CV with respect to such steps
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
13/25
A new measure of the impact of CVincompleteness (CVIIM)
Full CV can be computationally intensive andprocedures to integrate certain steps into CV are often notimplemented.
For some steps the extent to which the true error isunderestimated by incomplete CV is marginal.
Desirable: Spot cases, where full CV can be avoidedgenerally and cases where incomplete CV is especiallydangerous.
⇒ Development of simple measure for the degree of biasinduced by incomplete CV with respect to specific steps.
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
14/25
A new measure of the impact of CVincompleteness (CVIIM)
Our new measure CVIIM (standing for “Cross-ValidationIncompleteness Impact Measure”) is estimated by
CVIIMs,n,K = 1 − Incomplete CV error estimate
Full CV error estimate
Set to zero if Incomplete CV error > Full CV error orFull CV error = 0
CVIIMs,n,K ∈ [0, 1]. Larger values of CVIIMs,n,K areassociated with a stronger underestimation of the true error.
Interpretation: Relative reduction of estimated error whenperforming incomplete CV in comparison to full CV
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
15/25
A new measure of the impact of CVincompleteness (CVIIM)
Rules of thumb: CVIIMs,n,K ∈[0, 0.02] ⇒ ∼ no bias, ]0.02, 0.1] ⇒ weak bias,]0.1, 0.2] ⇒ medium bias, ]0.2, 0.4] ⇒ strong bias,]0.4, 1] ⇒ very strong bias.
CVIIMs,n,K dependent on data distribution P⇒ Calculate CVIIMs,n,K for several data sets beforedrawing general conclusions regarding the investigatedstep.
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
16/25
Illustration
Various real-life data sets, mostly gene-expression data
Investigated preliminary steps:
1 Variable selection based on t-tests
2 Variable filtering by variance
3 Choice of tuning parameters for various classificationmethods
4 Imputation using a variant of k-Nearest-Neighbors
5 Normalization with the RMA method
6 Principal Component Analysis
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
17/25
Illustration
Considered classification methods: Nearest ShrunkenCentroids, (Diagonal) Linear Discriminant Analysis,Random Forests
In each case (incomplete) CVs repeated 300 times and theresults averaged.
Splitting ratios between the sizes of the training and testsets: 2:1 (3-fold CV), 4:1 (5-fold CV) and 9:1 (10-fold CV)
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
18/25
Overview of used data sets
Name number of number of % diseased type of diseasesamples variables variables
ProstatecTranscr 102 12,625 51% transcriptomic prostate cancerHeadNeckcTranscr 50 22,011 50% transcriptomic head and neck squamousLungcTranscr 100 22,277 49% transcriptomic lung AdenocarcinomaSLETranscr 36 47,231 56% transcriptomic systemic lupus erythematosusGenitInfCoww0 51 21 71% various genital infection in cowsGenitInfCoww1 51 24 71% various genital infection in cowsGenitInfCoww2 51 27 71% various genital infection in cowsGenitInfCoww3 51 26 71% various genital infection in cowsGenitInfCoww4 51 27 71% various genital infection in cowsProstatecMethyl 70 222 41% methylation prostate cancerColoncTranscr 47 22,283 53% transcriptomic colon cancerWilmsTumorTranscr 100 22,283 42% transcriptomic Wilms’ tumor
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
19/25
Results: Variable selection and variable filtering byvariance
1 1 1
0.0
0.2
0.4
0.6
0.8
1.0
CV
IIMs,
n,K
1 1 1
psel = 5
K = 3 K = 5 K = 10
1
11
2
221 1 13 3 3
1 1 14 4 4
11 10.
00.
20.
40.
60.
81.
0
CV
IIMs,
n,K
11 1
psel = 10
K = 3 K = 5 K = 10
1 1 12 2 2
1 1 13 3 3
1 1 14 4 4
1 1 1
0.0
0.2
0.4
0.6
0.8
1.0
CV
IIMs,
n,K
1 1 1
psel = 20
K = 3 K = 5 K = 10
11
1
22
21 1 13 3 3
1
11
4
44
1 1 1
0.0
0.2
0.4
0.6
0.8
1.0
CV
IIMs,
n,K
1 1 1
psel = p/2
K = 3 K = 5 K = 10
1 1 12 2 21 1 13 3 3
1 1 14 4 4
1 1 10.0
0.2
0.4
0.6
0.8
1.0
CV
IIMs,
n,K
1 1 1
psel = p/2 − filtering by variance
K = 3 K = 5 K = 10
1 1 12 2 21 1 13 3 31 1 14 4 4
●
1
2
3
4
ProstatecTranscr
HeadNeckcTranscr
LungcTranscr
SLETranscr
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
20/25
Results: Optimization of tuning parameters
1 11
−0.
20.
00.
2
CV
IIMs,
n,K
1 11
Boosting
K = 3 K = 5 K = 10
1 1 12 2 21 1 13 3 31 1
14 4
4●
● ●
−0.
20.
00.
2
CV
IIMs,
n,K
1 1 1
kNN
K = 3 K = 5 K = 10
1 1 12 2 21 1 13 3 31 1
14 4
4●
●●
−0.
20.
00.
2
CV
IIMs,
n,K
1 1 1
Lasso
K = 3 K = 5 K = 10
11 1
22 2
1 1 13 3 31 1 14 4 4
●●
●
−0.
20.
00.
2
CV
IIMs,
n,K
11
1
NSC
K = 3 K = 5 K = 10
1 1 12 2 21 1 13 3 3
1 1 14 4 4
●●
●
−0.
20.
00.
2
CV
IIMs,
n,K
1 1 1
PLS−LDA
K = 3 K = 5 K = 10
11 1
22 2
1 1 13 3 31 1 14 4 4
● ● ●
−0.
20.
00.
2
CV
IIMs,
n,K
1 1 1
Random Forest
K = 3 K = 5 K = 10
1 1 12 2 21 1 13 3 31 1 14 4 4 ● ● ●
−0.
20.
00.
2
CV
IIMs,
n,K
1 1 1
SVM
K = 3 K = 5 K = 10
11
122
21 1 13 3 31 1 14 4 4
●
1
2
3
4
ProstatecTranscr
HeadNeckcTranscr
LungcTranscr
SLETranscr
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
21/25
Results: Imputation of missing values, RMAnormalization and principal component analysis
1 1 1
−0.
10.
00.
10.
20.
3
CV
IIMs,
n,K
1 1 1
Imputation
K = 3 K = 5 K = 10
1 1 12 2 21 1 13 3 31 1 14 4 41 1 15 5 51 1 16 6 6 1 1 1
−0.
10.
00.
10.
20.
3
CV
IIMs,
n,K
1 1 1
Normalization
K = 3 K = 5 K = 10
1 1 12 2 2
1 1 1
−0.
10.
00.
10.
20.
3
CV
IIMs,
n,K
1 1 1
PCA with 5 components
K = 3 K = 5 K = 10
1
1 1
2
2 211 1
33 3
1 11
4 44
●
1
2
3
4
5
6
Imputation
GenitInfCoww0
GenitInfCoww1
GenitInfCoww2
GenitInfCoww3
GenitInfCoww4
ProstatecMethyl
1
2
Normalization
ColoncTranscr
WilmsTumorTranscr
1
2
3
4
PCA
ProstatecTranscr
HeadNeckcTranscr
LungcTranscr
SLETranscr
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
22/25
Summary & Conclusion
Data preparation steps very often done before CV violation of the separation of training and test data.⇒ Over-optimistic conclusions possible
Impact very different for different steps - expected to besmaller for steps disregarding the target variable, but notnecessarily the case - relatively high for PCA in ourillustration
New measure CVIIM to assess this impact
Constantly arising new types of molecular data will requirespecialized data preparation steps, for which the impact ofCV incompleteness will have to be assessed.
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
23/25
.
Thank you for your attention!
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
24/25
References
Ambroise, C. and McLachlan, G. J. (2002).
Selection bias in gene extraction on the basis of microarraygene-expression data.
Proc. Nat. Acad. Sci. 99, 6562–6566.
Kostka, D. and Spang, R. (2008).
Microarray based diagnosis profits from better documentation of geneexpression signatures.
PLoS Computational Biology 4, e22.
Simon, R., Radmacher, M. D., Dobbin, K., and McShane, L. M.(2003).
Pitfalls in the use of dna microarray data for diagnostic andprognostic classification.
Journal of the National Cancer Institute 95, 14–18.
Full versusincomplete
cross-validation
RomanHornunget al.
Introduction
Addonprocedures
Full versusincomplete CV
New measureCVIIM
Illustration
Summary &Conclusion
25/25
References
Varma, S. and Simon, R. (2006).
Bias in error estimation when using cross-validation for modelselection.
BMC Bioinformatics 7, 91.
Technical Report available online:
Roman Hornung, Christoph Bernau, Caroline Truntzer, ThomasStadler, and Anne-Laure Boulesteix (2014).
Full versus incomplete cross-validation: measuring the impact ofimperfect separation between training and test sets in prediction errorestimation.
Department of Statistics, LMU, Technical Report 159.