full versus incomplete cross-validation: measuring the ...€¦ · 3 full versus incomplete cv 4...

Full versusincomplete

cross-validation

RomanHornunget al.

Introduction

Addonprocedures

Full versusincomplete CV

New measureCVIIM

Illustration

Summary &Conclusion

1/25

Full versus incomplete cross-validation:measuring the impact of imperfect separation

between training and test sets in predictionerror estimation

Roman Hornung

Joint work with Christoph Bernau, Caroline Truntzer,Thomas Stadler and Anne-Laure Boulesteix

LMU MunichDepartment of Medical Informatics, Biometry and Epidemiology

August, 28th, 2014


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

2/25

1 Introduction

2 Addon procedures

3 Full versus incomplete CV

4 New measure CVIIM

5 Illustration

6 Summary & Conclusion


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

3/25

Introduction

Modern technologies, most prominently microarrays,enable the measurement of the expression of thousandsor many thousands of genes for each unit ofinvestigation.

Data sets are generated in which such measurements areagglomerated for units (patients, tissues,...) affected by adisease of interest and for unaffected controls.


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

4/25

Introduction

This data contains empirical information on (previously)unknown systematic differences between the two groups.

By the aid of classification methods we can thusempirically construct prediction rules for the purpose ofpredicting the status (diseased or healthy) of new units.

Due to limited sample sizes and information contained ingene expression such prediction rules make errors.


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

5/25

Introduction

Our general interest lies in a correct estimation of theexpected error frequency.

Procedures which tend to result in too small estimatederror frequencies can result in overoptimistic (⇒dangerous!) conclusions regarding the performance of aprediction rule.


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

6/25

Prediction rule in general

Sample: S = {(X 1,Y1), . . . , (X n,Yn)} ∼ Pn

X i : (LONG) vector of (raw!) gene expressions (“covariates”)

Yi : status of patient (“class”) (i = 1, . . . , n)

Prediction rule fitted on “training data” S :

gS : X 7→ Y = {1, 2}

Predicted status of new patient with gene expression vectorx ∈ X :

gS(x) = y


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

7/25

Exemplary analysis for obtaining a prediction rule

Data material: 100 DNA microarrays of breast tissues, 50affected with early stage breast cancer and 50 unaffected

Analysis:

1 Normalization using the RMA method. ⇒ 47,000 differentexpression variables

2 t-test based selection of the 500 most informative variables3 Cross-validation based selection of the optimal cost

parameter for the Support Vector Machine (SVM)classification method

4 Fitting the SVM classification method using the optimizedcost parameter

⇒ Three preliminary steps before fitting the actualclassification method - these steps are part of predictionrule gs(·).


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

8/25

“Addon procedures” for preliminary steps

In general: for the purpose of prediction each stepperformed for obtaining a prediction rule has to be doneon new units as well

Naive approach: 1) Pool training data with new units, 2)re-perform all preliminary steps, 3) fit the classificationmethod anew on the training data

a) Impossible for steps the fitting of which requires thetarget variable; b) prediction rule is commonly kept fixed

⇒ All steps have to be integrated into the constructedprediction rule gs(·).


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

9/25

“Addon procedures” for preliminary steps

“Addon procedures”: New units made subject to exactlythe same procedure as those in the training data, but newunits not involved in the adaption of the procedure to thetraining data.

Example - Addon procedure for variable selection:

The same variables are chosen for new units than for thosein the training data, but only the training data is used todetermine, which variables are used.


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

10/25

Estimating the error of prediction rules bycross-validation (CV)

Procedure (In practice: repeat and take the average of theresults):

1 Split the data set s into K (approximately) equally sizedfolds s1, . . . , sK

2 For k = 1, . . . ,K : Use the units in s/sk for constructingthe prediction rule and the units in sk as test data.

3 Average the misclassification rates out of the K splittingsin (2).

Incomplete CV: One or more preliminary steps performedbefore CV

Correct CV is termed as Full CV.


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

11/25

Incomplete cross-validation

Implication: Part of the prediction rule already constructedon the whole data set⇒ Violation of the training and test set principle of CV

Can lead to severe underestimation of the true error onindependent data

Incomplete CV known to be severely downwardly biasedfor the case of variable selection


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

12/25

Incomplete cross-validation

Issue previously unexamined for other preliminary steps inthe literature (to our knowledge)

Preliminary steps almost always conducted before CV

Examples: normalization of gene expression data,imputation of missing values, variable filtering by variance,dichotomization of continuous variables, data-drivendetermination of powers of fractional polynomials,sophisticated preprocessing steps for imaging data, ...

No certainty on the extent of downward bias throughincomplete CV with respect to such steps


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

13/25

A new measure of the impact of CVincompleteness (CVIIM)

Full CV can be computationally intensive andprocedures to integrate certain steps into CV are often notimplemented.

For some steps the extent to which the true error isunderestimated by incomplete CV is marginal.

Desirable: Spot cases, where full CV can be avoidedgenerally and cases where incomplete CV is especiallydangerous.

⇒ Development of simple measure for the degree of biasinduced by incomplete CV with respect to specific steps.


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

14/25


Our new measure CVIIM (standing for “Cross-ValidationIncompleteness Impact Measure”) is estimated by

CVIIMs,n,K = 1 − Incomplete CV error estimate

Full CV error estimate

Set to zero if Incomplete CV error > Full CV error orFull CV error = 0

CVIIMs,n,K ∈ [0, 1]. Larger values of CVIIMs,n,K areassociated with a stronger underestimation of the true error.

Interpretation: Relative reduction of estimated error whenperforming incomplete CV in comparison to full CV


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

15/25


Rules of thumb: CVIIMs,n,K ∈[0, 0.02] ⇒ ∼ no bias, ]0.02, 0.1] ⇒ weak bias,]0.1, 0.2] ⇒ medium bias, ]0.2, 0.4] ⇒ strong bias,]0.4, 1] ⇒ very strong bias.

CVIIMs,n,K dependent on data distribution P⇒ Calculate CVIIMs,n,K for several data sets beforedrawing general conclusions regarding the investigatedstep.


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

16/25

Illustration

Various real-life data sets, mostly gene-expression data

Investigated preliminary steps:

1 Variable selection based on t-tests

2 Variable filtering by variance

3 Choice of tuning parameters for various classificationmethods

4 Imputation using a variant of k-Nearest-Neighbors

5 Normalization with the RMA method

6 Principal Component Analysis


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

17/25

Illustration

Considered classification methods: Nearest ShrunkenCentroids, (Diagonal) Linear Discriminant Analysis,Random Forests

In each case (incomplete) CVs repeated 300 times and theresults averaged.

Splitting ratios between the sizes of the training and testsets: 2:1 (3-fold CV), 4:1 (5-fold CV) and 9:1 (10-fold CV)


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

18/25

Overview of used data sets

Name number of number of % diseased type of diseasesamples variables variables

ProstatecTranscr 102 12,625 51% transcriptomic prostate cancerHeadNeckcTranscr 50 22,011 50% transcriptomic head and neck squamousLungcTranscr 100 22,277 49% transcriptomic lung AdenocarcinomaSLETranscr 36 47,231 56% transcriptomic systemic lupus erythematosusGenitInfCoww0 51 21 71% various genital infection in cowsGenitInfCoww1 51 24 71% various genital infection in cowsGenitInfCoww2 51 27 71% various genital infection in cowsGenitInfCoww3 51 26 71% various genital infection in cowsGenitInfCoww4 51 27 71% various genital infection in cowsProstatecMethyl 70 222 41% methylation prostate cancerColoncTranscr 47 22,283 53% transcriptomic colon cancerWilmsTumorTranscr 100 22,283 42% transcriptomic Wilms’ tumor


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

19/25

Results: Variable selection and variable filtering byvariance

1 1 1

0.0

0.2

0.4

0.6

0.8

1.0

CV

IIMs,

n,K

1 1 1

psel = 5

K = 3 K = 5 K = 10

1

11

2

221 1 13 3 3

1 1 14 4 4

11 10.

00.

20.

40.

60.

81.

0

CV

IIMs,

n,K

11 1

psel = 10

K = 3 K = 5 K = 10

1 1 12 2 2

1 1 13 3 3

1 1 14 4 4

1 1 1

0.0

0.2

0.4

0.6

0.8

1.0

CV

IIMs,

n,K

1 1 1

psel = 20

K = 3 K = 5 K = 10

11

1

22

21 1 13 3 3

1

11

4

44

1 1 1

0.0

0.2

0.4

0.6

0.8

1.0

CV

IIMs,

n,K

1 1 1

psel = p/2

K = 3 K = 5 K = 10

1 1 12 2 21 1 13 3 3

1 1 14 4 4

1 1 10.0

0.2

0.4

0.6

0.8

1.0

CV

IIMs,

n,K

1 1 1

psel = p/2 − filtering by variance

K = 3 K = 5 K = 10

1 1 12 2 21 1 13 3 31 1 14 4 4

●

1

2

3

4

ProstatecTranscr

HeadNeckcTranscr

LungcTranscr

SLETranscr


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

20/25

Results: Optimization of tuning parameters

1 11

−0.

20.

00.

2

CV

IIMs,

n,K

1 11

Boosting

K = 3 K = 5 K = 10

1 1 12 2 21 1 13 3 31 1

14 4

4●

● ●

−0.

20.

00.

2

CV

IIMs,

n,K

1 1 1

kNN

K = 3 K = 5 K = 10

1 1 12 2 21 1 13 3 31 1

14 4

4●

●●

−0.

20.

00.

2

CV

IIMs,

n,K

1 1 1

Lasso

K = 3 K = 5 K = 10

11 1

22 2

1 1 13 3 31 1 14 4 4

●●

●

−0.

20.

00.

2

CV

IIMs,

n,K

11

1

NSC

K = 3 K = 5 K = 10

1 1 12 2 21 1 13 3 3

1 1 14 4 4

●●

●

−0.

20.

00.

2

CV

IIMs,

n,K

1 1 1

PLS−LDA

K = 3 K = 5 K = 10

11 1

22 2

1 1 13 3 31 1 14 4 4

● ● ●

−0.

20.

00.

2

CV

IIMs,

n,K

1 1 1

Random Forest

K = 3 K = 5 K = 10

1 1 12 2 21 1 13 3 31 1 14 4 4 ● ● ●

−0.

20.

00.

2

CV

IIMs,

n,K

1 1 1

SVM

K = 3 K = 5 K = 10

11

122

21 1 13 3 31 1 14 4 4

●

1

2

3

4

ProstatecTranscr

HeadNeckcTranscr

LungcTranscr

SLETranscr


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

21/25

Results: Imputation of missing values, RMAnormalization and principal component analysis

1 1 1

−0.

10.

00.

10.

20.

3

CV

IIMs,

n,K

1 1 1

Imputation

K = 3 K = 5 K = 10

1 1 12 2 21 1 13 3 31 1 14 4 41 1 15 5 51 1 16 6 6 1 1 1

−0.

10.

00.

10.

20.

3

CV

IIMs,

n,K

1 1 1

Normalization

K = 3 K = 5 K = 10

1 1 12 2 2

1 1 1

−0.

10.

00.

10.

20.

3

CV

IIMs,

n,K

1 1 1

PCA with 5 components

K = 3 K = 5 K = 10

1

1 1

2

2 211 1

33 3

1 11

4 44

●

1

2

3

4

5

6

Imputation

GenitInfCoww0

GenitInfCoww1

GenitInfCoww2

GenitInfCoww3

GenitInfCoww4

ProstatecMethyl

1

2

Normalization

ColoncTranscr

WilmsTumorTranscr

1

2

3

4

PCA

ProstatecTranscr

HeadNeckcTranscr

LungcTranscr

SLETranscr


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

22/25

Summary & Conclusion

Data preparation steps very often done before CV violation of the separation of training and test data.⇒ Over-optimistic conclusions possible

Impact very different for different steps - expected to besmaller for steps disregarding the target variable, but notnecessarily the case - relatively high for PCA in ourillustration

New measure CVIIM to assess this impact

Constantly arising new types of molecular data will requirespecialized data preparation steps, for which the impact ofCV incompleteness will have to be assessed.


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

23/25

.

Thank you for your attention!


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

24/25

References

Ambroise, C. and McLachlan, G. J. (2002).

Selection bias in gene extraction on the basis of microarraygene-expression data.

Proc. Nat. Acad. Sci. 99, 6562–6566.

Kostka, D. and Spang, R. (2008).

Microarray based diagnosis profits from better documentation of geneexpression signatures.

PLoS Computational Biology 4, e22.

Simon, R., Radmacher, M. D., Dobbin, K., and McShane, L. M.(2003).

Pitfalls in the use of dna microarray data for diagnostic andprognostic classification.

Journal of the National Cancer Institute 95, 14–18.


cross-validation

RomanHornunget al.

Introduction

Addonprocedures


New measureCVIIM

Illustration

Summary &Conclusion

25/25

References

Varma, S. and Simon, R. (2006).

Bias in error estimation when using cross-validation for modelselection.

BMC Bioinformatics 7, 91.

Technical Report available online:

Roman Hornung, Christoph Bernau, Caroline Truntzer, ThomasStadler, and Anne-Laure Boulesteix (2014).

Full versus incomplete cross-validation: measuring the impact ofimperfect separation between training and test sets in prediction errorestimation.

Department of Statistics, LMU, Technical Report 159.

full versus incomplete cross-validation: measuring the ...€¦ · 3 full versus incomplete cv 4...

Documents