defining multivarite calibration model complexity for model selection and comparison
DESCRIPTION
DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON. John Kalivas Department of Chemistry Idaho State University Pocatello, Idaho. MULTIVARITE CALIBRATION MODEL. y ( m 1) quantitative information of prediction property for m samples - PowerPoint PPT PresentationTRANSCRIPT
DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON
John KalivasDepartment of Chemistry
Idaho State UniversityPocatello, Idaho
MULTIVARITE CALIBRATION MODEL
y (m 1)
quantitative information of prediction property for m samples X (m p)
respective values for p predictor variables (wavelengths for spectral data) b (p 1)
unknown regression coefficients
e (m 1) errors with mean zero and covariance σ2I
y Xb e
unk unkˆˆ Ty x b
REGRESSION VECTOR SOLUTION
MLR solution, requires m ≥ p (variable selection) and nearly orthogonal X
Biased regression methods require selection of meta-
parameter(s)
1ˆ ( )T Tb X X X y
ˆ b X y
2
2min Xb y
BIASED MODELING METHODS
PLS PCR Ridge regression (RR) Generalized RR Cyclic subspace regression Continuum regression Ridge PCR and PLS Generalized ridge PCR and PLS Etc.
GENERIC EXPRESSION
where k = rank(X) ≤ min(m,p)
ˆ b X y1ˆ whereT T b VF U y X U V
1
ˆTki
i ii i
f
u y
b v
1
ˆk
i ii
b v
PCR, RR, AND PLS FILTER VALUES
PCR: fi = 1 for retained basis vectors and fi = 0 for deleted basis vectors
RR: 0 ≤ fi ≤ 1 depending on ridge value
PLS: 0 ≤ fi < ∞ depending on PLS factor model
1
ˆTki
i ii i
f
u y
b v
RR AND PLS FILTER VALUES
RR
PLS
dθj are the eigenvalues of XTX restricted to Krylov subspace
2
2 0i
ii
f
2
1
1 for d factor modeld
d j id i
j d j
f th
1, span , , ,
dT T T T T T Td
X X X y X y X XX y X X X yK
1
ˆTki
i ii i
f
u y
b v
A CALIBRATION ASSESSMENT PROBLEM
s2 (σ2) is estimated by MSEC
Need degrees of freedom or fitting degrees of freedom (df)
df = p for the particular MLR model requires m > p
2ˆ
RMSEC= i i
m df
y y
MORE df PROBLEMS
df = d, the number of factors (basis vectors) for PCR and PLS but models can be represented in any basis set
the same model in different basis sets requires different
number of basis vectorsRR and others are not factor based and/or use
multiple meta-parameters
Bb ˆ
ANOTHER CALIBRATION ASSESSMENT PROBLEM Useful to plot results for different modeling
methods on one plotExample: a plot of RMSEV against number of
factors (basis vectors) is possible for PCR and PLS RR cannot be included in plot still have improper comparison of PCR and PLS as
factors are in different basis sets
NECESSITY
Effective-rank (ER) for inter-model comparison of
from y = Xb where is from factor based methods such as PCR or PLS, non-factor based methods such as RR, and/or methods based on multiple meta-parameters smaller ER, more parsimonious model ?
bb
SOLUTIONS
Develop ER in a common basis set using information on how the basis vectors are used
Develop ER that is basis set independent
COMMON BASIS SET
Use filter values ( fi ) in eigenvector basis set V
f ER = Gilliam, et al., Inverse Problems, 6 (1990) 725
f ER = Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM,
(1998)
T
ˆ ii i
i
f
u y
b v
if
ˆwhere tr tr XX H y Hy
BASIS SET INDEPENDENT hi = change in the fitted value depending on the change in the
observed value y the larger the hi, the more will change if y changes (fluctuations
around the expected value due to random noise) Add normally distributed noise δ to y N times Obtain vectors from models with perturbed y Calculate for the ith sample (sensitivity of a fitted value to
perturbation in the respective observed value) as the regression slope to:
Ye, Journal American Statistical Association, 93 (1998) 120
yˆ
ih
GDF
1
ˆER=m
ii
h
ˆˆ 1, ,i i n iy h n N
ˆiy
ˆiy
BASIS SET INDEPENDENT
Van der Voet, Journal Chemometrics, 13 (1999) 111
VDVER is based on error estimates which contain error
VDV2
SSE/ER 1
RMSECV
mm
BASIS SET INDEPENDENT Know for eigenvector basis set V, PLS
basis set T, and std. basis set I with β, δ, and γ being respective weight vectors for a model in that basis set Eigenvector basis set:
PLS basis set:
Std. basis set:
ITVb ˆ
222
2 2ˆ
i β b
222
2 2ˆ
i δ b
2222 2
ˆiγ γ b
2
ˆ 22
LS 2
ˆER rank
ˆb
bX
b
DATA SETS CARBONIC ANHYDRASE (CA) INHIBITORS: CA contributes to
production of eye humor which with excess secretion causes permanent damage and diseases (macular edema and open-angle glaucoma).
142 compounds assayed for inhibition of CA isozymes CA I, CA II, & CA IV. Inhibition values Log(Ki) modeled with 63 (full) & 8 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Chem. Inf. Comput. Sci., 42 (2002) 94)
DIHYDROFOLATE REDUCTASE (DHFR) INHIBITORS: inhibition of DHFR important in combating diseases from pathogens Pneumocystis carinii (pc) and Toxoplasma gondii (tg) in unhealthy immune systems
334, 320, & 340 compounds assayed for inhibition of (pc) DHFR, (tg) DHFR, & mammalian standard rlDHFR. Log of 50% inhibition concentration values (IC50) modeled with 84, 83, & 84 (full) & 10 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Mol. Graphics Modeling, 21 (2002) 391)
CA IV (FULL): PLS & PCR RMSEC (df = d) AGAINST d
PCR (df = d = fER), PLS (df = d),& PLS (df = fER)
PCR & PLS (df = fER)AGAINST fER
fER
PCR, PLS, & RR (df = fER)
fER
CA IV (FULL): PLS & PCR RMSEV AGAINST d
PLS & PCR AGAINST fER
fER
PLS, PCR, & RR AGAINST fER
fER
BIAS/VARIANCE CONSIDERATION
Model complexity
variancebias
Pre
dict
ion
Err
or
GENERAL TIKHONOV REGULARIZATION
λ is meta-parameter that must be optimized L is a matrix of values, usually a derivative operator
Tikhonov, Soviet Math. Dokl., 4 (1963) 1035 L can be the spectral error covariance matrix for removal of undesired
spectral variation (wavelength selection) Kalivas, Anal. Chim. Acta, 505 (2004) 9
2 2
2 2min Xb y Lb
1ˆ T T T
b X X L L X y
( 1)1
1 1
1 1
m p
L ( 2)2
1 2 1
1 2 1
m p
L
STANDARDIZED TIKHONOV REGULARIZATION
Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, (1998)
2 2
2 2min Xb y Lb
2 2
2 2min where , , are , , in std. form Xb y b X b y X b y
1ˆ T T
b X X I X yˆ ˆback-transform to in genral formb b
2 22 2
2 22 2
ˆ ˆˆ ˆ and b Lb Xb y Xb y
STANDARDIZED TIKHONOV REGULARIZATION Simple case: L is square and invertible
L = I
RR
1, , and y y X XL b Lb
1ˆ T T
b X X I X y
1ˆ T T
b X X I X y
1 ˆˆ b L b
HARMONIOUS (PARETO) PLOT For graphical characterization of Tikhonov regularization,
plot a variance indicator against a bias criterion to reduce the chance of overfitting or underfitting
Curve will have an L-shape (L-curve) Ideal model at corner with the proper bias/variance trade-off
(harmonious model) PCR and PLS: best number of factors RR: best ridge value etc.
Intra- and inter-model comparison Lawson, et.al., Solving Least-Squares Problems. Prentice-Hall,
(1974)
EXAMPLE PLOT
underfitting
overfitting
best model
2ˆ yy
2b
VARIANCE EXPRESSIONS
Faber, et al., Journal of Chemometrics, 11 (1997) 181
Lorber, et al., Journal of Chemometrics, 2 (1988) 93
unk unk
2 22 2 2 2 2
unk unk2 2
1ˆ ˆˆV ey s s s h s sm
e y X xb b
2 22 2 2 2
eff2
ˆ ˆi is s s s y y df e y X b
unk
22 2
unk unk2
ˆˆV y s h s x b
22 ˆi is y y df
EXPERIMENTAL APPROACH
Intra- and inter-model comparison of RR, PLS, and PCR with QSAR data for the most harmonious and parsimonious models
LOOCV tends to overfit Use mean values from LMOCV
Data sets randomly split 300 times with v validation and m – v calibration samples where v ≈ 0.6m
Shao, J., J. Am. Statist. Assoc., 88 (1993) 486-494
CA IV HARMONIOUS RR(750), PLS(6), AND PCR(8) PLOTS FOR 63 DESCRIPTORS
2b
RMSEC RMSEV
ridge value range: 45 - 7050
GENERAL APPROACH TO OPTIMIZATION OF PARETO CURVE for basis set V and weights β or any basis set
with respective weights Use an optimization algorithm (simplex, simulated
annealing, etc.) adjusting weight values in β while minimizing the distance to target values of variance and bias measures
Models converge to RR models
ˆ b V
CA IV MODEL VALUES FOR 63 DESCRIPTORS
Modela
2b RMSEC RMSEV 2
calR 2valR ER
RR (750) 0.0721 0.667 0.774 0.870 0.751 7.97
PLS (6) 0.0855 0.679 0.791 0.855 0.747 8.08
PCR (8) 0.0948 0.679 0.795 0.853 0.747 8 aParentheses contain ridge values for RR and the number of factors for PLS and PCR.
CA IV HARMONIOUS RR(6), PLS(4), AND PCR(5) PLOTS FOR 8 DESCRIPTORS
2b
RMSEC RMSEV
ridge value range: 0.2 - 126
CA IV MODEL VALUES FOR 8 DESCRIPTORS
Modela
2b RMSEC RMSEV 2
calR 2valR ER
RR (6) 0.460 0.593 0.671 0.880 0.816 5.45
PLS (4) 0.463 0.602 0.676 0.875 0.813 5.20
PCR (5) 0.464 0.605 0.677 0.879 0.814 5
MLR 3.553 0.584 0.700 0.894 0.802 8
aParentheses contain ridge values for RR and the number of factors for PLS and PCR.
CA IV MODEL VALUESModela No. of
Descriptors 2b RMSEC RMSEV 2
calR 2valR ER
RR (750) 63 0.0721 0.667 0.774 0.870 0.751 7.97
PLS (6) 63 0.0855 0.679 0.791 0.855 0.747 8.08
PCR (8) 63 0.0948 0.679 0.795 0.853 0.747 8
RR (6) 8 0.460 0.593 0.671 0.880 0.816 5.45
PLS (4) 8 0.463 0.602 0.676 0.875 0.813 5.20
PCR (5) 8 0.464 0.605 0.677 0.879 0.814 5
MLR 8 3.553 0.584 0.700 0.894 0.802 8 aParentheses contain ridge values for RR and the number of factors for PLS and PCR.
CA IV HARMONY/PARSIMONY PLOTS: PLS(6) AND PCR(8) FOR 63 DESCRIPTORS
RMSEV RMSEV
fERfER
2b
CA IV HARMONY/PARSIMONY PLOTS: PLS(6) 63 DESCRIPTORS AND PLS(4) 8 DESCRIPTORS
2b
fERRMSEV
CA I MODEL VALUESModela No. of
Descriptors 2b RMSEC RMSEV 2
calR 2valR ER
RR (670) 63 0.0667 0.566 0.695 0.868 0.717 8.09
PLS (6) 63 0.0767 0.577 0.703 0.851 0.720 8.03
PCR (7) 63 0.0742 0.591 0.705 0.835 0.717 7
RR (22) 8 0.209 0.624 0.699 0.792 0.720 3.54
PLS (3) 8 0.227 0.641 0.705 0.787 0.717 3.65
PCR (4) 8 0.267 0.641 0.711 0.790 0.712 4
MLR 8 9.722 0.532 0.657 0.878 0.762 8 aParentheses contain ridge values for RR and the number of factors for PLS and PCR
tgDHFR MODEL VALUES USING 10 DESCRIPTORS
Modela
2b RMSEC RMSEV 2
calR 2valR ER
RR (11) 0.597 0.857 0.919 0.677 0.580 6.51
PLS (5) 0.607 0.852 0.929 0.657 0.578 6.62
PCR (6) 0.646 0.867 0.927 0.657 0.581 6
MLR 8.664 0.765 0.902 0.766 0.634 10
aParentheses contain ridge value for RR and the number of factors for PLS and PCR.
pcDHFR MODEL VALUES USING 10 DESCRIPTORS
Modela
2b RMSEC RMSEV 2
calR 2valR ER
RR (17) 0.421 0.915 0.979 0.605 0.478 5.89
PLS (5) 0.500 0.916 0.996 0.603 0.478 6.50
PCR (6) 0.500 0.916 0.993 0.600 0.478 6
MLR 679 0.870 1.020 0.603 0.495 10
aParentheses contain ridge values for RR and the number of factors for PLS and PCR.
SUMMARY
ER necessary for fair intra- and inter-model comparison RMSEC and RMSEV plot overlays are possible for different modeling
methods Harmonious plots allow proper determination of meta-
parameters and validation Fair intra- and inter-model comparisons are possible (plot overlays are
possible) In optimal model region of harmonious curve, differences in
models are small ER assesses the true nature of variable selection for improved
parsimony Harmony/parsimony compromise
FUTURE WORK
Use ER with multiple variance and bias indicators for better characterization of the harmony/parsimony tradeoff for intra- and inter-model comparison with full and/or variable subsets
Include variable selection in the modeling process
2 1
2 1min( + )Xb y Lb
Include L = second derivative operator in Tikhonov regularization a form of RR with smoothing
smooth spectral noise and temperature influences
Use standardization approach with PCR and PLS PCR:
PLS:
2
2: min d Xb yfrom
2
2: min d Xb yto
2
2: min subject to ,T T
d Xb y b X X X yKfrom
2
2: min subject to ,T T
d Xb y b X X X yKto
1where , span , , ,
dT T T T T T Td
X X X y X y X XX y X X X yK
ACKNOWLEDGEMENTS
Forrest Stout and Heather Seipel Peter Jurs and Brian Mattioni provided QSAR
data sets National Science Foundation
STANDARDIZATION PROCESS
For with rank(L) = s < p, obtain a QR factorization of LT
Form and perform a QR factorization of XKo
Compute standardized data
Perform back-transformation
sTs o
RL KR K K
0
s pL
( )m p so
XK
oo o q
TXK HT H H
0
T T Tq q s s
X H XL H XK RTqy H y
1ˆˆ ( )To o o
b L b K T H y XL b