lmo & jackknife

1

LMO & Jackknife

If a QSPR/QSAR model has a high average q2 in LMO valiation, it can be reasonably concluded that obtained model is robust.

Leave-many-out (LMO) validation:

An internal validation procedure, like LOO.

LMO employs smaller training set than LOO and can be repeated many more times due to possibility of large combinations in leave many compounds out from training set.

n objects in data set,G cancellation groups of equal size, G =n/mj) 2<G<10

A large number of groups : n-mj objects in training set mj objects in validation set => q2(from mj

estim.s

2

Jackknife:

training set → a # Subsamples, SubSampNo>G

each SubSample → SubTrain and SubValid

+SubSampNo times estimation of parameters.(instead of time consuming repetition of the experiment)

Along with LMO cross validation(internal validation)

LMO & Jackknife

LMO & Jackknife

3

SubTrain1 SubValid1

233

211 )ˆ()ˆ( yyyy

q2TOT# subsamples >> # molec.s in training set

.…

255

222 )ˆ()ˆ( yyyy

244

233 )ˆ()ˆ( yyyy

244

222 )ˆ()ˆ( yyyy

255

211 )ˆ()ˆ( yyyy

LMO

n=6m=2G=3

SubTrain4 SubValid4

LMO & Jackknife

4 # subsamples >> # molec.s in training set

.…

Jackknife

n=6m=2G=3

b1

b2

b3

b4

bsn

SubTrain1

SubTrain4

5

>> for i=1:subSampNo PERMUT(i,:)=randperm(Dr); end

for i=1:9 % 9 Subsamples PERMUT(i,:)=randperm(6); % 6 molecules in train end

PERMUT = 6 5 2 4 3 1 1 6 3 5 2 4 5 2 6 4 3 1 5 4 2 1 6 3 5 4 1 6 2 3 2 6 5 1 3 4 1 2 6 5 3 4 6 2 1 5 4 3 4 5 1 6 3 2

SubValid

LMO & Jackknife

SubTrain

6

6 5 2 4 → b1 3 1 → q211 6 3 5 → b2 2 4 → q225 2 6 4 → b3 3 1 → q235 4 2 1 → b4 6 3 → q245 4 1 6 → b5 2 3 → q252 6 5 1 → b6 3 4 → q261 2 6 5 → b7 3 4 → q276 2 1 5 → b8 4 3 → q284 5 1 6 → b9 3 2 → q29

q2TOTDb=y b=D+ y

SubTrain sets: SubValid sets:

histogr

LMO & Jackknife

7

d1d2d3d4d5d6…dn

Distribution of b for 3rd Descriptor

LMO & Jackknife

6 5 2 4 → b11 6 3 5 → b25 2 6 4 → b35 4 2 1 → b45 4 1 6 → b52 6 5 1 → b61 2 6 5 → b76 2 1 5 → b84 5 1 6 → b9

8

Jackknife: on all 31 molec.s and all 53 desc.s 200 subsamples(using MLR)

0 10 20 30 40 50-20

-15

-10

-5

0

5

10

15

20

-0.04 -0.02 00

10

20

30

b

Fre

qu

en

cy

Desc No 25

-0.08 -0.06 -0.04 -0.02 00

20

40

60

80

b

Fre

qu

en

cy

Desc No 15

LMO & Jackknife

9

Jackknife: on all 31 samples and all 53 desc.s ( using MLR)

-0.04 -0.02 00

10

20

30

b

Fre

qu

en

cy

Desc No 25

-0.08 -0.06 -0.04 -0.02 00

20

40

60

80

b

Fre

qu

en

cy

Desc No 15

>> histfit(bJACK(:,15),20);

LMO & Jackknife

10

How much is the probability that 0.0 is different from the population by chance.

To determine the probability:

All data in population, and 0.0, should be standardized to z.

s

xxxz

11

>> disttool

z = -1.5

Probability that 1.5 is different from μ by chance

12

>> disttool

x2=0.134 =p2 tailed Probability that difference between -1.5 and μ is

because of random error is not < 0.05 (p>0.05) -1.5 is not significantly different from population

p< 0.05 => signif. difference

>>cdf Gives the area before z, from left.0.0668

LMO & Jackknife

13

10 20 30 40 50-20

-10

0

10

20

Descriptor Number

b

0 100 200

-10

0

10

20

Subsample No

b

All descriptors, MLR

-2 -1 0 10

50

100

150

q2 value

Fre

qu

en

cy

0 20 40 600

0.1

0.2

0.3

0.4

Descriptor number

p-v

alu

eq2TOT = -407.46

# p<0.05 =0

# signif descrip.s =0

14

0 20 40 60-0.4

-0.2

0

0.2

0.4

Descriptor Number

b

0 50 100 150 200-0.4

-0.2

0

0.2

0.4

Subsample No

b

All descriptors, PLS, lv=14

-2 -1 0 10

10

20

30

40

q2 value

Fre

qu

en

cy

0 20 40 600

0.2

0.4

0.6

0.8

Descriptor number

p-v

alu

e

q2TOT = -0.0988

# p<0.05 =28

LMO & Jackknife

# signif descrip.s =28

15

0 20 40 60-0.4

-0.2

0

0.2

0.4

Descriptor Number

bAll descriptors, PLS, lv=14

0 20 40 600

0.2

0.4

0.6

0.8

Descriptor number

p-v

alu

e

q2TOT = -0.0988

# p<0.05 =28

51 1.4002e-02237 1.383e-01035 8.605e-00938 9.1021e-00939 1.8559e-00836 8.7005e-00815 0.000276891 0.000388082 0.0004054745 0.0005967432 0.00063731

Significant descriptors with p<0.05 can be sorted (according to p value),For doing a forward selection---------------------------------Desc No p---------------------------------

LMO & Jackknife

16

q2TOT at different number of latent variables in PLS (applying all descriptors)4 times running the program

lv8 -.0411 .0776 -.0431 .02709 .2200 .2340 .3641 .257610 .1721 .1147 .2391 .1434 37 signif var11 .2855 .1948 .0667 .237212 .1847 .1275 .2390 .218413 -.0343 -.1439 .0120 .004914 -.2578 -.2460 -.3010 -.0989 28 signif var

8 9 10 11 12 13 14

-0.2

0

0.2

No of latent variables

q2

TO

T Overfitt

Inform ↓

LMO & Jackknife

17

for lv=6:13 % Number of latent var.s in PLS for i=lv:18 [p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,SRTDpDESC(1:i,1)), y, 150, 27,2,lv); end end

0

5

10

15

200 2 4 6 8 10 12 14

-1.5

-1

-0.5

0

0.5

1

lv

No of descriptors

q2TOT

Max q2TOT at lv=7 and #desc=7

LMO & Jackknife

18

1 1.5 2 2.5 3

0

2

4

6

8

10

12

1 1 1

-7

-6

-5

-4

-3

-2

-1

0x 10

-5D=Dini(:,[38 50 3]);

]q2, bJACK=[jackknife(D, y, 500, 27)

1.99922.0010

0.05

0.1

0.15

0.2

0.25

0.3

Three significant descriptors with q2 < 0.05, as example.

LMO & Jackknife

19

[p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,[34 38 45 51]), y, 150, 27,2,7);

[34 38 45 51]: Selected descriptors 150: #subset samples in Jackknife 27: #samples in training set of each subset 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS

Jackknife is a method for determining the significant descriptors beside LMO CV, as internal validation

…. and can be applied for descriptor selection...

LMO & Jackknife

function

20

Exercise:

Applying Jackknife of selected set of descriptors, using MLR and determining the results and significance of descriptors…

21

Cross model validation (CMV)Anderssen, et alReducing over-optimism in variable selection by cross model validation, Chemom Intell Lab Syst (2006) 84, 69-74.

Validation during variable selection, and not posterior to it.

Gidskehaug, et alCross model validation and optimization of bilinear regression models, Chemom Intell Lab Syst (2008) 93, 1-10.

CMV

Data set → a # train, and Test sets.

train → SubSample → SubTrain and SubValid

22

211 )ˆ( yy

222 )ˆ( yy

233 )ˆ( yy

211 )ˆ( yy

222 )ˆ( yy

233 )ˆ( yy

q2CMV1

q2CMV2

Jackknife-Selec Var.s- # latent var.s

PLS model( b1 ) predic

CMV

Test set: No contribution to var and lv sel process

n=15m=3G=3

Train Test

23

211 )ˆ( yy

222 )ˆ( yy

233 )ˆ( yy

q2CMVm

.

.

.

.

.

.

CMV

Effective external validation

24

[q2TOT,q2CMV]=crossmv(trainD,trainy,testD,testy,selVAR,7)

selVAR: set of selected descriptors (applied calibration method is PLS) 7: Number of latent variables in PLS

CMV is an effective external validation method ...

function CMV

25

Bootstrapping

Bootstrapping:

Bootstrap re-sampling, another approach to internal validation

Wehrens, et alThe Bootstrap: a tutorial, Chemom Intell Lab Syst (2002) 54, 35-52.

There is only one data set.

Data set should be representative of the population from which it was drawn.

Bootstr. is simulation of random selectionGeneration of K groups of size n, by a repeated random selection of n objects from original data set.

26

Some of the objects can be included in the same random sample several times, while other objects will never be selected.

The model obtained on the n randomly selected objects is used to predict the target properties for the excluded sample.+ q2 estimation, as in LMO.

Bootstrapping

27

for i=1:10 %No of subSamples in bootstr for j=1:6 %Dr=6 number of molec.s in Train RND=randperm(6); bootSamp(i,j)=RND(1); end end

Bootstrapping

bootSamp =

5 5 6 3 6 1 → b1 2 4 → q21 4 2 6 3 2 6 .. 1 5 .. 2 5 3 1 2 4 6 2 3 1 4 4 1 5 6 3 3 2 6 3 3 1 4 5 5 5 6 4 4 3 1 2 4 3 6 1 1 2 5 2 2 5 4 5 1 3 6 3 3 2 3 3 5 1 4 6 2 3 1 6 4 6 → b10 5 → q210

SubValidSubTrainNot present in SubTrainSame no of molec as Train

28

0 50 100 150 200-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Subsample Nob

1 2 3

0

0.2

0.4

0.6

0.8

Descr No

bB

OO

T

Bootstrapping

38

50

15

29

Distribution of b values are not normal

Nonparamestimation

Ofconfidence

limits0 100 200

-0.05

0

0.05

Sorted SubSampNo

bB

OO

T

0 100 200-0.5

0

0.5

1

Sorted SubSampNo

bB

OO

T

Sorted

0 100 200

-1

0

1x 10

-4

Sorted SubSampNo

bBO

OT

200 subsamples,200x0.025=5=> 5th from left and 5th from right are the 95% confidence limits.

signif

Not signif

signif0 0.2 0.4 0.6 0.8

0

2

4

6

8

10

b value

Fre

qu

en

cy Bootstrapping

30

1 2 3

0

0.1

0.2

0.3

0.4

0.5

Descriptor No

Co

nf L

imits

-12e-5 0.1113 -0.0181-1.5e-5 0.5131 0.0250

Bootstrapping

38

50

15

Small effectBut signif

Not signif

31

[bBOOT]=bootstrp(trainD, trainy,1000,2,7)

1000: #subset samples in bootstrapping (#molecules in SubTraining set = #molec.s in Train) 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS

Bootstrap is a method for determining the confidence interval for descriptors...

function Bootstrapping

32

Model validation

Y-randomization:

Random shuffling of dependent variable vector, and development of a new QSAR model using original dependent variable matrix.

Repeating the process a number of times,

chance correlation or structural redundancy of training set

Sometimes: High q2 values

Expected: QSAR models with low R2 and LOO q2 values

Acceptable model can not be obtained by this method.

33

Training and test

External validation:

Selecting training and test sets:

a. Finding new experimental tested set: not a simple taskb. Splitting data set into training and test set.

For establishing QSAR model

For external validation

Both training and test sets should separately span the whole descriptor space occupied by the entire data set.

Ideally, each member of test set should be close to one point in training set.

34

Approaches for creating training and test sets:

1. Straightforward random selection

Yasri, et alToward an optimal procedure for variable selection and QSAR model building, J Chem Inf Comput Sci (2001) 41, 1218-1227.

2. Activity sampling

Kauffman, et alQSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically based numerical descriptors, J Chem Inf Comput Sci (2001) 41, 1553-1560.

Mattioni, et alDevelopment of Q. S.-A. R. and classification models for a set of carbonic anhydrase inhibitors, J Chem Inf Comput Sci (2002) 42, 94-102.

Training and test

35

3. Systematic clustering techniques

Burden, et alUse of automatic relevance determination in QSAR studies using Bayesian neural networks, J Chem Inf Comput Sci (2000) 40, 1423-1430.

Snarey, et alComparison of algorithms for dissimilarity-based compound selection, J Mol Graph Model (1997) 15, 372-385.

4. Self organizing maps (SOMs)

Gramatica, et alQSAR study on the tropospheric degradation of organic compounds, Chemosphere (1999) 38, 1371-1378.

Better than random selection

Training and test

Kohonen Map53 × 31

Columns (molecules) as input for Kohonen map:

Sampling from all region of columns (molecules) space

19 ,18

4 ,23,14

3,20

15,16

selwood data matrix

TrainTest

arrangem1

Other..19,18,3,20,4,23,14,

15,16

arrangem 2

….27,12,3,7,30,23,11,

16

Training and test

RMSECVRMSEP

Sample selection (Kohonen)

Descriptor selection (Kohonen)0.43840.6251Sample selection (Kohonen)

Descriptor selection (P-value)0.42050.6432

Descriptor selection using

P-Value

Descriptor selection using

Kohonen correlation map

51,37,35,38,39 36,15

35,36,37, 40,44 43,51

15CorrelationCorrelation

With activityWith activity

!

Training and test

38

5. Kennard Stone

Kennard, et alComputer aided design of experiments, Technometrics (1969) 11, 137-148.

Bourguignon, et alOptimization in irregularly shaped regions- pH and solvent strength in reverse-phase HPLC separation, Anal Chem (1994) 66, 893-904.

6. Factorial and D-optimal design

Eriksson, et alMultivariate design and modeling in QSAR. Tutorial, Chemometr Intell Lab Syst (1996) 34, 1-19.

Mitchell, et alAlgorithm for the construction of “D-optimal” experimental designs, Technometrics (2000) 42, 48-54.

Training and test

39

Gramatica, et alQSAR modeling of bioconcentration factors by theoretical molecular descriptors, Quant Struct-Act Relat (2003) 22, 374-385.

D-optimal

Selection of samples that maximize the |X’X| determinant.

X: Variance-covariance (information) matrix of independent variables (desriptors) or independent plus dependent variables.

These samples will be spanned across the whole area occupied by representative points and constitute the training set. The points not selected are used as test set.

=> well-balanced structural diversity and representativity of entire data space(descriptors and responses)

Training and test

40

trianD1 = [D(1:3:end,:);[D(2:3:end,:)]];

Training and test

trianD2 = D([1:2 5:13 17 21 22 25:end],:);

Selected descritorsdetCovDy

D=Dini; %all-3.48 e-236 !!

D=Dini(:,[51 37 35 38 39 36 15]); 2.18 e53

D=Dini(:,[38 50 3]); 5.90 e 08

D=Dini;2.13 e-243 !!

D=Dini(:,[51 37 35 38 39 36 15]);2.66 e53

D=Dini(:,[38 50 3]);4.45 e08

Optimum selection of descriptors and molecules in training set can be performed using detCovDy (D-optimal

41

leverage

xXXx 1)( TTleverage

Model applicability domainModel applicability domain

No matter how robust, significant and validated a QSAR maybe, it cannot be expected to reliably predict the modeled property for the entire universe of chemicals.!!

Leverage is a criterion for determining the applicability domain of the model for a query compound:

x: vector of query compound

X: Matrix of training set indep variables

42

5 10 15 20

-50

0

50

100

150

train samples

leve

rag

e

0 5 10-5

0

5

10

15

20x 10

14

test samples

leve

rage

5 10

-2

0

2

4x 10

13

test samples

leve

rag

e

Using all descriptors leverage for all test samples are very high.

It means that test samples are not in the space of training samples and can not be predicted.

leverage

43

leverage

0 10 200

0.2

0.4

0.6

0.8

1

1.2

train samples

leve

rag

e

0 5 100

0.2

0.4

0.6

0.8

1

1.2

1.4

test samples

leve

rag

e

38 50 3 13 24

Using a number of descriptors (38 50 3 13 24) leverage for test samples are similar to training samples.

It means that test samples are in the space of training samples and can be predicted.

lmo & jackknife

Documents

lmo validation

lmo jackknife6

lmo jackknifeif

lmo valiation

subvalid lmo jackknifesubtrain

smaller training set

data set

population p