![Page 1: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/1.jpg)
Unsupervised Forward SelectionA data reduction algorithm for use
with very large data sets
David Whitley†, Martyn Ford† and David Livingstone†‡
†Centre for Molecular Design, University of Portsmouth†‡ChemQuest
![Page 2: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/2.jpg)
Outline
• Variable selection issues• Pre-processing strategy• Dealing with multicollinearity• Unsupervised forward selection• Model selection strategy• Applications
![Page 3: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/3.jpg)
Variable Selection Issues
• Relevance– statistically significant correlation with response– non-small variance
• Redundancy– linear dependence
– some variables have no unique information
• Multicollinearity– near linear dependence
– some variables have little unique information
0 iiv
0 iiv
![Page 4: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/4.jpg)
Pre-processing Strategy
• Identify variables with a significant correlation with the response
• Remove variables with small variance• Remove variables with no unique information• Identify a set of variables on which to construct a
model
![Page 5: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/5.jpg)
Effect of Multicollinearity
ii xxxxxy 55443322110
izxx 15
Build regression models of the form
where
Increasing reduces the collinearity between x5 and x1
and x1 - x4 , y, zi and ei are random N(0,1)
![Page 6: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/6.jpg)
Effect of Multicollinearity
Q2
![Page 7: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/7.jpg)
Dealing with Multicollinearity
• Examine pair-wise correlations between variables, and remove one from each pair with high correlation
• Corchop (Livingstone & Rahr, 1989) aims to remove the smallest number of variables while breaking the largest number of pair-wise collinearities
![Page 8: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/8.jpg)
Unsupervised Forward Selection1 Select the first two variables with the smallest pair-
wise correlation coefficient2 Reject variables whose pair-wise correlation
coefficient with the selected columns exceeds rsqmax3 Select the next variable to have the smallest squared
multiple correlation coefficient with those previously selected
4 Reject variables with squared multiple correlation coefficients greater than rsqmax
5 Repeat 3 - 4 until all variables are selected or rejected
![Page 9: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/9.jpg)
Continuum Regression• A regression procedure with the generalized
criterion function)21(422 )''()''(
2 XcXcyXcF
• Varying the continuous parameter 0 1.5 adjusts the balance between the covariance of the response with the descriptors and the variance of the descriptors, so that = 0 is equivalent to ordinary least squares = 0.5 is equivalent to partial least squares = 1.0 is equivalent to principal components regression
![Page 10: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/10.jpg)
Model Selection Strategy
• For = 0.0, 0.1, …, 1.5 build a CR model for the set of variables selected by UFS with rsqmax = 0.1, 0.2, …, 0.9, 0.99
• Select the model with rsqmax and maximizing Q2 (leave-one-out cross-validated R2)– Apply n-fold cross-validation to check predictive
ability– Apply a randomization test (1000 permutations of the
response scores) to guard against chance correlation
![Page 11: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/11.jpg)
Pyrethroid Data Set
• 70 physicochemical descriptors to predict killing activity (KA) of 19 pyrethroid insecticides
• Only 6 descriptors are correlated with KA at the 5% level
• Optimal models– 4-variable, 2-component model with R2 = 0.775,
Q2 = 0.773 obtained when rsqmax = 0.7, = 1.2
– 3-variable, 1-component model with R2 = 0.81, Q2 = 0.76 obtained when rsqmax = 0.6, = 0.2
![Page 12: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/12.jpg)
Optimal Model I
• Standard errors are bootstrap estimates based on 5000 bootstraps
• Randomization test tail probabilities below 0.0003 for fit and 0.0071 for prediction
DVXMIZAAKA 037.000024.08044.0564.931.2 )80.2( )98( )000083.0( )11.0(
![Page 13: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/13.jpg)
Optimal Model II
• Standard errors are bootstrap estimates based on 5000 bootstraps
• Randomization test tail probabilities below 0.0001 for fit and 0.0052 for prediction
DVXMIZAKA 20.000019.0567.880.1 )00.2( )000055.0( )08.0(
![Page 14: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/14.jpg)
N-Fold Cross-Validation
3 variable model4 variable model
![Page 15: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/15.jpg)
Feature Recognition
• Important explanatory variables may not be selected for inclusion in the model– force some variables in, then continue UFS algorithm
• The component loadings for the original variables can be examined to identify variables highly correlated with the components in the model
![Page 16: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/16.jpg)
Loadings for the 1-component pyrethroid model with tail probability < 0.01
variable loading
A5 0.756
A3 0.723
A8 0.619
NS16 - 0.605
DVX - 0.603
ES12 - 0.584
MIZ 0.567
![Page 17: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/17.jpg)
Steroid Data Set
• 21 steroid compounds from SYBYL CoMFA tutorial to model binding affinity to human TBG
• Initial data set has 1248 variables with values below 30 kcal/mol
• Removed 858 variables not significantly correlated with response (5% level)
• Removed 367 variables with variance below 1.0 kcal/mol
• Leaving 23 variables to be processed by UFS/CR
![Page 18: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/18.jpg)
Optimal models
• UFS/CR produces a 3-variable, 1-component model with R2 = 0.85, Q2 = 0.83 at rsqmax = 0.3, = 0.3
• CoMFA tutorial produces a 5-component model with R2 = 0.98, Q2 = 0.6
![Page 19: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/19.jpg)
N-Fold Cross-Validation
CoMFA tutorial model UFS/CR model
![Page 20: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/20.jpg)
Putative Pharmacophore
![Page 21: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/21.jpg)
Selwood Data Set
• 53 descriptors to predict biological activity of 31 antifilarial antimycin analogues
• 12 descriptors are correlated with the response variable at the 5% level
• Optimal models– 2-variable, 1-component model with R2 = 0.42,
Q2 = 0.41 obtained when rsqmax = 0.1, = 1.0
– 12-variable, 1-component model with R2 = 0.85, Q2 = 0.5 obtained when rsqmax = 0.99, = 0.0 (omitting compound M6)
![Page 22: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/22.jpg)
N-Fold Cross-Validation
2-variable model 12-variable model
![Page 23: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/23.jpg)
Summary
• Multicollinearity is a potential cause of poor predictive power in regression.
• The UFS algorithm eliminates redundancy and reduces multicollinearity, thus improving the chances of obtaining robust, low-dimensional regression models.
• Chance correlation can be addressed by eliminating variables that are uncorrelated with the response.
![Page 24: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/24.jpg)
Summary
• UFS can be used to adjust the balance between reducing multicollinearity and including relevant information.
• Case studies show that leave-one-out cross-validation should be supplemented by n-fold cross-validation, in order to obtain accurate and precise estimates of predictive ability (Q2).
![Page 25: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/25.jpg)
Acknowledgements
• Astra Zeneca• GlaxoSmithKline• MSI• Unilever
BBSRC Cooperation with Industry Project: Improved Mathematical Methods for Drug Design
![Page 26: Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre](https://reader035.vdocument.in/reader035/viewer/2022062519/5697bf741a28abf838c7fdfb/html5/thumbnails/26.jpg)
Reference
D. C. Whitley, M.G. Ford and D. J. Livingstone Unsupervised forward selection: a method for eliminating redundant variables.J. Chem. Inf. Comp. Sci., 2000, 40, 1160-1168.
UFS software available from: http://www.cmd.port.ac.uk
CR is a component of Paragon (available summer 2001)