defining multivarite calibration model complexity for model selection and comparison

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON

John KalivasDepartment of Chemistry

Idaho State UniversityPocatello, Idaho

MULTIVARITE CALIBRATION MODEL

y (m 1)

quantitative information of prediction property for m samples X (m p)

respective values for p predictor variables (wavelengths for spectral data) b (p 1)

unknown regression coefficients

e (m 1) errors with mean zero and covariance σ2I

y Xb e

unk unkˆˆ Ty x b

REGRESSION VECTOR SOLUTION

MLR solution, requires m ≥ p (variable selection) and nearly orthogonal X

Biased regression methods require selection of meta-

parameter(s)

1ˆ ( )T Tb X X X y

ˆ b X y

2

2min Xb y

BIASED MODELING METHODS

PLS PCR Ridge regression (RR) Generalized RR Cyclic subspace regression Continuum regression Ridge PCR and PLS Generalized ridge PCR and PLS Etc.

GENERIC EXPRESSION

where k = rank(X) ≤ min(m,p)

ˆ b X y1ˆ whereT T b VF U y X U V

1

ˆTki

i ii i

f

u y

b v

1

ˆk

i ii

b v

PCR, RR, AND PLS FILTER VALUES

PCR: fi = 1 for retained basis vectors and fi = 0 for deleted basis vectors

RR: 0 ≤ fi ≤ 1 depending on ridge value

PLS: 0 ≤ fi < ∞ depending on PLS factor model

1

ˆTki

i ii i

f

u y

b v

RR AND PLS FILTER VALUES

RR

PLS

dθj are the eigenvalues of XTX restricted to Krylov subspace

2

2 0i

ii

f

2

1

1 for d factor modeld

d j id i

j d j

f th

1, span , , ,

dT T T T T T Td

X X X y X y X XX y X X X yK

1

ˆTki

i ii i

f

u y

b v

A CALIBRATION ASSESSMENT PROBLEM

s2 (σ2) is estimated by MSEC

Need degrees of freedom or fitting degrees of freedom (df)

df = p for the particular MLR model requires m > p

2ˆ

RMSEC= i i

m df

y y

MORE df PROBLEMS

df = d, the number of factors (basis vectors) for PCR and PLS but models can be represented in any basis set

the same model in different basis sets requires different

number of basis vectorsRR and others are not factor based and/or use

multiple meta-parameters

Bb ˆ

ANOTHER CALIBRATION ASSESSMENT PROBLEM Useful to plot results for different modeling

methods on one plotExample: a plot of RMSEV against number of

factors (basis vectors) is possible for PCR and PLS RR cannot be included in plot still have improper comparison of PCR and PLS as

factors are in different basis sets

NECESSITY

Effective-rank (ER) for inter-model comparison of

from y = Xb where is from factor based methods such as PCR or PLS, non-factor based methods such as RR, and/or methods based on multiple meta-parameters smaller ER, more parsimonious model ?

bb

SOLUTIONS

Develop ER in a common basis set using information on how the basis vectors are used

Develop ER that is basis set independent

COMMON BASIS SET

Use filter values ( fi ) in eigenvector basis set V

f ER = Gilliam, et al., Inverse Problems, 6 (1990) 725

f ER = Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM,

(1998)

T

ˆ ii i

i

f

u y

b v

if

ˆwhere tr tr XX H y Hy

BASIS SET INDEPENDENT hi = change in the fitted value depending on the change in the

observed value y the larger the hi, the more will change if y changes (fluctuations

around the expected value due to random noise) Add normally distributed noise δ to y N times Obtain vectors from models with perturbed y Calculate for the ith sample (sensitivity of a fitted value to

perturbation in the respective observed value) as the regression slope to:

Ye, Journal American Statistical Association, 93 (1998) 120

yˆ

ih

GDF

1

ÊR=m

ii

h

ˆˆ 1, ,i i n iy h n N

îy

îy

BASIS SET INDEPENDENT

Van der Voet, Journal Chemometrics, 13 (1999) 111

VDVER is based on error estimates which contain error

VDV2

SSE/ER 1

RMSECV

mm

BASIS SET INDEPENDENT Know for eigenvector basis set V, PLS

basis set T, and std. basis set I with β, δ, and γ being respective weight vectors for a model in that basis set Eigenvector basis set:

PLS basis set:

Std. basis set:

ITVb ˆ

222

2 2ˆ

i β b

222

2 2ˆ

i δ b

2222 2

ˆiγ γ b

2

ˆ 22

LS 2

ˆER rank

ˆb

bX

b

DATA SETS CARBONIC ANHYDRASE (CA) INHIBITORS: CA contributes to

production of eye humor which with excess secretion causes permanent damage and diseases (macular edema and open-angle glaucoma).

142 compounds assayed for inhibition of CA isozymes CA I, CA II, & CA IV. Inhibition values Log(Ki) modeled with 63 (full) & 8 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Chem. Inf. Comput. Sci., 42 (2002) 94)

DIHYDROFOLATE REDUCTASE (DHFR) INHIBITORS: inhibition of DHFR important in combating diseases from pathogens Pneumocystis carinii (pc) and Toxoplasma gondii (tg) in unhealthy immune systems

334, 320, & 340 compounds assayed for inhibition of (pc) DHFR, (tg) DHFR, & mammalian standard rlDHFR. Log of 50% inhibition concentration values (IC50) modeled with 84, 83, & 84 (full) & 10 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Mol. Graphics Modeling, 21 (2002) 391)

CA IV (FULL): PLS & PCR RMSEC (df = d) AGAINST d

PCR (df = d = fER), PLS (df = d),& PLS (df = fER)

PCR & PLS (df = fER)AGAINST fER

fER

PCR, PLS, & RR (df = fER)

fER

CA IV (FULL): PLS & PCR RMSEV AGAINST d

PLS & PCR AGAINST fER

fER

PLS, PCR, & RR AGAINST fER

fER

BIAS/VARIANCE CONSIDERATION

Model complexity

variancebias

Pre

dict

ion

Err

or

GENERAL TIKHONOV REGULARIZATION

λ is meta-parameter that must be optimized L is a matrix of values, usually a derivative operator

Tikhonov, Soviet Math. Dokl., 4 (1963) 1035 L can be the spectral error covariance matrix for removal of undesired

spectral variation (wavelength selection) Kalivas, Anal. Chim. Acta, 505 (2004) 9

2 2

2 2min Xb y Lb

1ˆ T T T

b X X L L X y

( 1)1

1 1

1 1

m p

L ( 2)2

1 2 1

1 2 1

m p

L

STANDARDIZED TIKHONOV REGULARIZATION

Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, (1998)

2 2

2 2min Xb y Lb

2 2

2 2min where , , are , , in std. form Xb y b X b y X b y

1ˆ T T

b X X I X yˆ ˆback-transform to in genral formb b

2 22 2

2 22 2

ˆ ˆˆ ˆ and b Lb Xb y Xb y

STANDARDIZED TIKHONOV REGULARIZATION Simple case: L is square and invertible

L = I

RR

1, , and y y X XL b Lb

1ˆ T T

b X X I X y

1ˆ T T

b X X I X y

1 ˆˆ b L b

HARMONIOUS (PARETO) PLOT For graphical characterization of Tikhonov regularization,

plot a variance indicator against a bias criterion to reduce the chance of overfitting or underfitting

Curve will have an L-shape (L-curve) Ideal model at corner with the proper bias/variance trade-off

(harmonious model) PCR and PLS: best number of factors RR: best ridge value etc.

Intra- and inter-model comparison Lawson, et.al., Solving Least-Squares Problems. Prentice-Hall,

(1974)

EXAMPLE PLOT

underfitting

overfitting

best model

2ˆ yy

2b

VARIANCE EXPRESSIONS

Faber, et al., Journal of Chemometrics, 11 (1997) 181

Lorber, et al., Journal of Chemometrics, 2 (1988) 93

unk unk

2 22 2 2 2 2

unk unk2 2

1ˆ ˆˆV ey s s s h s sm

e y X xb b

2 22 2 2 2

eff2

ˆ ˆi is s s s y y df e y X b

unk

22 2

unk unk2

ˆˆV y s h s x b

22 ˆi is y y df

EXPERIMENTAL APPROACH

Intra- and inter-model comparison of RR, PLS, and PCR with QSAR data for the most harmonious and parsimonious models

LOOCV tends to overfit Use mean values from LMOCV

Data sets randomly split 300 times with v validation and m – v calibration samples where v ≈ 0.6m

Shao, J., J. Am. Statist. Assoc., 88 (1993) 486-494

CA IV HARMONIOUS RR(750), PLS(6), AND PCR(8) PLOTS FOR 63 DESCRIPTORS

2b

RMSEC RMSEV

ridge value range: 45 - 7050

GENERAL APPROACH TO OPTIMIZATION OF PARETO CURVE for basis set V and weights β or any basis set

with respective weights Use an optimization algorithm (simplex, simulated

annealing, etc.) adjusting weight values in β while minimizing the distance to target values of variance and bias measures

Models converge to RR models

ˆ b V

CA IV MODEL VALUES FOR 63 DESCRIPTORS

Modela

2b RMSEC RMSEV 2

calR 2valR ER

RR (750) 0.0721 0.667 0.774 0.870 0.751 7.97

PLS (6) 0.0855 0.679 0.791 0.855 0.747 8.08

PCR (8) 0.0948 0.679 0.795 0.853 0.747 8 aParentheses contain ridge values for RR and the number of factors for PLS and PCR.

CA IV HARMONIOUS RR(6), PLS(4), AND PCR(5) PLOTS FOR 8 DESCRIPTORS

2b

RMSEC RMSEV

ridge value range: 0.2 - 126

CA IV MODEL VALUES FOR 8 DESCRIPTORS

Modela

2b RMSEC RMSEV 2

calR 2valR ER

RR (6) 0.460 0.593 0.671 0.880 0.816 5.45

PLS (4) 0.463 0.602 0.676 0.875 0.813 5.20

PCR (5) 0.464 0.605 0.677 0.879 0.814 5

MLR 3.553 0.584 0.700 0.894 0.802 8

aParentheses contain ridge values for RR and the number of factors for PLS and PCR.

CA IV MODEL VALUESModela No. of

Descriptors 2b RMSEC RMSEV 2

calR 2valR ER

RR (750) 63 0.0721 0.667 0.774 0.870 0.751 7.97

PLS (6) 63 0.0855 0.679 0.791 0.855 0.747 8.08

PCR (8) 63 0.0948 0.679 0.795 0.853 0.747 8

RR (6) 8 0.460 0.593 0.671 0.880 0.816 5.45

PLS (4) 8 0.463 0.602 0.676 0.875 0.813 5.20

PCR (5) 8 0.464 0.605 0.677 0.879 0.814 5

MLR 8 3.553 0.584 0.700 0.894 0.802 8 aParentheses contain ridge values for RR and the number of factors for PLS and PCR.

CA IV HARMONY/PARSIMONY PLOTS: PLS(6) AND PCR(8) FOR 63 DESCRIPTORS

RMSEV RMSEV

fERfER

2b

CA IV HARMONY/PARSIMONY PLOTS: PLS(6) 63 DESCRIPTORS AND PLS(4) 8 DESCRIPTORS

2b

fERRMSEV

CA I MODEL VALUESModela No. of

Descriptors 2b RMSEC RMSEV 2

calR 2valR ER

RR (670) 63 0.0667 0.566 0.695 0.868 0.717 8.09

PLS (6) 63 0.0767 0.577 0.703 0.851 0.720 8.03

PCR (7) 63 0.0742 0.591 0.705 0.835 0.717 7

RR (22) 8 0.209 0.624 0.699 0.792 0.720 3.54

PLS (3) 8 0.227 0.641 0.705 0.787 0.717 3.65

PCR (4) 8 0.267 0.641 0.711 0.790 0.712 4

MLR 8 9.722 0.532 0.657 0.878 0.762 8 aParentheses contain ridge values for RR and the number of factors for PLS and PCR

tgDHFR MODEL VALUES USING 10 DESCRIPTORS

Modela

2b RMSEC RMSEV 2

calR 2valR ER

RR (11) 0.597 0.857 0.919 0.677 0.580 6.51

PLS (5) 0.607 0.852 0.929 0.657 0.578 6.62

PCR (6) 0.646 0.867 0.927 0.657 0.581 6

MLR 8.664 0.765 0.902 0.766 0.634 10

aParentheses contain ridge value for RR and the number of factors for PLS and PCR.

pcDHFR MODEL VALUES USING 10 DESCRIPTORS

Modela

2b RMSEC RMSEV 2

calR 2valR ER

RR (17) 0.421 0.915 0.979 0.605 0.478 5.89

PLS (5) 0.500 0.916 0.996 0.603 0.478 6.50

PCR (6) 0.500 0.916 0.993 0.600 0.478 6

MLR 679 0.870 1.020 0.603 0.495 10

aParentheses contain ridge values for RR and the number of factors for PLS and PCR.

SUMMARY

ER necessary for fair intra- and inter-model comparison RMSEC and RMSEV plot overlays are possible for different modeling

methods Harmonious plots allow proper determination of meta-

parameters and validation Fair intra- and inter-model comparisons are possible (plot overlays are

possible) In optimal model region of harmonious curve, differences in

models are small ER assesses the true nature of variable selection for improved

parsimony Harmony/parsimony compromise

FUTURE WORK

Use ER with multiple variance and bias indicators for better characterization of the harmony/parsimony tradeoff for intra- and inter-model comparison with full and/or variable subsets

Include variable selection in the modeling process

2 1

2 1min( + )Xb y Lb

Include L = second derivative operator in Tikhonov regularization a form of RR with smoothing

smooth spectral noise and temperature influences

Use standardization approach with PCR and PLS PCR:

PLS:

2

2: min d Xb yfrom

2

2: min d Xb yto

2

2: min subject to ,T T

d Xb y b X X X yKfrom

2

2: min subject to ,T T

d Xb y b X X X yKto

1where , span , , ,

dT T T T T T Td

X X X y X y X XX y X X X yK

ACKNOWLEDGEMENTS

Forrest Stout and Heather Seipel Peter Jurs and Brian Mattioni provided QSAR

data sets National Science Foundation

STANDARDIZATION PROCESS

For with rank(L) = s < p, obtain a QR factorization of LT

Form and perform a QR factorization of XKo

Compute standardized data

Perform back-transformation

sTs o

RL KR K K

0

s pL

( )m p so

XK

oo o q

TXK HT H H

0

T T Tq q s s

X H XL H XK RTqy H y

1ˆˆ ( )To o o

b L b K T H y XL b

defining multivarite calibration model complexity for model selection and comparison

Documents