multivariate analysis - ufg
Post on 26-Mar-2022
7 Views
Preview:
TRANSCRIPT
Multivariate Analysis Multivariate Calibration โ part 2
Prof. Dr. Anselmo E de Oliveira
anselmo.quimica.ufg.br
anselmo.disciplinas@gmail.com
Linear Latent Variables
โข An essential concept in multivariate data analysis is the mathematical combination of several variables into a new variable that has a certain desired property. In chemometrics such a new variable is often called a latent variable
โข Linear Latent Variable ๐ข = ๐1๐ฅ1 + ๐2๐ฅ2 +โฏ+ ๐๐๐ฅ๐
๐ข scores, value of linear latent variable
๐๐ loadings, coefficients describing the influence of variables on the score
๐ฅ๐ variables, features.
Calibration
โข Linear models ๐ฆ = ๐0 + ๐1๐ฅ1 + ๐2๐ฅ2 +โฏ+ ๐๐๐ฅ๐ +โฏ+ ๐๐๐ฅ๐ + ๐
๐0 is called the intercept
๐1 to ๐๐ the regression coefficients
๐ the number of variables
๐ the residual (error term)
โ Often mean-centered data are used and then ๐0 becomes zero
โข This model corresponds to a linear latent variable.
Calibration
โข The parameters of a model are estimated from a calibration set (training set) containing the values of the ๐ฅ-variables and ๐ฆ for ๐ samples (objects)
โข The resulting model is evaluated with a test set (with know values for ๐ฅ and ๐ฆ)
โข Because modeling and prediction of the ๐ฆ-data is a defined aim of data analysis , this type of data treatment is called supervised learning.
Calibration
โข All regression methods aim at the minimization of residuals, for instance minimization of the sum of the squared residuals
โข It is essential to focus on minimal prediction errors for new cases โ the test set โ but no (only) for the calibration set from which the model has been created โ It is realatively ease to create a model โ especially
with many variables and eventually nonlinear features โ that very well fits the calibration data; however, it may be useless for new cases. This effect of overfitting is a crucial topic in model creation .
Calibration
โข Regression can be performed directly with the values of the variables (ordinary least-squares regression, OLS) but in the most powerful methods, such as principal component regression (PCR) and partial least-squares regression (PLS), it is done via a small set of intermediate linear latent variables (the components). This approach has important advantages: โ Data with highly correlating ๐ฅ-variables can be used โ Data sets with more variables than samples can be used โ The complexity of the model can be controlled by the
number of components and thus overfitting can be avoided and maximum prediction performance for test set data can be approached .
Calibration
โข Depending on the type of data, different methods are available
Number of ๐-variables
Number of ๐-variables
Name Methods
1 1 Simple Simple OLS, robust regression
Many 1 Multiple PLS, PCR, multiple OLS, robust regression, Ridge regression, Lasso regression
Many Many Multivariate PLS2, Canonical Correlation Analysis (CCA)
Calibration
๐ฆ = 1.418 + 4.423๐ฅ1 + 4.101๐ฅ2 โ 0.0357๐ฅ3
Calibration
Calibration
OLS-model using only ๐ฅ1 and ๐ฅ2
๐ = ๐. ๐๐๐ + ๐. ๐๐๐๐๐ + ๐. ๐๐๐๐๐
Performance of Regression Models
โข Any model for prediction makes sense only if appropriate criteria are defined and applied to measure the performance of the model
โข For models based on regression, the residuals (prediction errors) ๐๐
๐๐ = ๐ฆ๐ โ ๐ฆ ๐ are the basis for performance measures, with ๐ฆ๐ for the given (experimental, โtrueโ) value and ๐ฆ ๐ the predicted (modeled) value of an object ๐
โข An often-used performance measure estimates the standard deviation of the prediction errors (standard error of prediction, SEP) .
Performance of Regression Models
โข Using the same objects for calibration and test should be strictly avoided
โข Depending on the size of the data set (the number of objects available) โ and on the effort of work โ different strategies are possible
โข The following levels are ordered by typical applications to data with decreasing size and also by decreasing reliability of the results .
Performance of Regression Models
1. If data from many objects are available, a split into three sets is best:
i. Training set (ca. 50% of the objects) for creating models
ii. Validation set (ca. 25% of the objects) for optimizing the model to obtain good prediction performance
iii. Test set (prediction set, approximately 25%) for testing the final model to obtain a realistic estimation of the prediction performance for new cases
iv. The three sets are treated separately
v. Applications in chemistry rarely allow this strategy because of a too small number of objects available .
Performance of Regression Models
2. The data is split into a calibration set used for model creation and optimization and a test set (prediction set) to obtain a realistic estimation of the prediction perfomance for new cases
i. The calibration set is divided into a training set and a validation set by cross validation (CV) or bootstrap
ii. First the optimum complexity (for instance optimum number of PLS components) of the model is estimated and then a model is built from the whole calibration set applying the found optimum complexity; this model is applied to the test set .
Performance of Regression Models
3. CV or bootstrap is used to split the data set into different calibration sets and test sets
i. A calibration set is used as described (2) to create an optimized model and this is applied to the corresponding test set
ii. All objects are principally used in training set, validation set, and test set; however, an object is never simultaneously used for model creations and for test
iii. This strategy (double CV or double bootstrap or a combination of CV and bootstrap) is applicable to a relatively small number of objects; furthermore, the process can be repeated many times with different random splits resulting in a high number of test-set-predicted values .
Performance of Regression Models
โข Mostly the split of the objects intro training, validation, and test sets is performed by simple random sampling
โข More sophisticated procedures โ related to experimental design and theory of sampling โ are available โ Kennard-Stone algorithm
โข it claims to set up a calibration set that is representative for the population, to cover the ๐ฅ-space as uniformly as possible, and to give more weight to objects outside the center
โข This aim is reached by selecting objects with maximum distances (for instance Euclidean distances) in the ๐ฅ-space .
Performance of Regression Models
โ The methods CV and bootstrap are called resampling strategies
โข Small data sets
โข They are applied to obtain a reasonable high number of predictions
โข The larger the date set and the more reliable the data are, the better the prediction performance can be estimated โ ๐ ๐๐ง๐ ๐๐ ๐กโ๐ ๐๐๐ก๐ ร ๐๐๐๐๐๐๐๐๐๐๐ ๐ ๐๐ ๐๐๐ก๐ ร
๐ข๐๐๐๐๐ก๐๐๐๐ก๐ฆ ๐๐ ๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐ ๐ข๐๐ = ๐๐๐๐ ๐ก๐๐๐ก .
Overfitting and Underfitting
โข The more complex (the โlargerโ) a model is, the better is it capable of fit given data (the calibration set)
โข The prediction error for the calibration set in general decreases with increasing complexity of the model
โ An appropriate highly complicated model can fit almost any data with almost zero deviations (residuals) between experimental (true) ๐ฆ and modeled (predicted) ๐ฆ
โ Such model are not necessarily useful for new cases, because they are probably overfitted .
Overfitting and Underfitting
Fonte da Imagen: Dr. Frank Dieterle
Overfitting and Underfitting
Fonte da Imagen: holehouse.org
Overfitting and Underfitting
โข Prediction error for new cases are high for โsmallโ models (underfitting, low complexity, too simple models) but also for overfitted models
โข Determination of the optimum complexity of a model is an important but not always na easy task, because the minumum of measures for the prediction error for test sets is often not well marked โ In chemometrics, the complexity is typically controled by the
number of PLS or PCA components (latent variables), and the optimum complexity is estimated by CV
โ CV or bootstrap allows an estimation of the prediction error for each object of the calibration set at each considered model complexity .
Performance Criteria
โข The basis of all perfomence criteria are prediction errors (residuals) ๐ฆ๐ โ ๐ฆ ๐
โข The classical standard deviation of prediction errors is widely used as a measure of the spread of the error distribution, and is called standard error of prediction (SEP) defined by
๐๐ธ๐ =1
๐ง โ 1 ๐ฆ๐ โ ๐ฆ ๐ โ ๐๐๐๐ 2
๐ง
๐=1
with
๐๐๐๐ =1
๐ง ๐ฆ๐ โ ๐ฆ ๐
๐ง
๐=1
where ๐ฆ๐ are the given (experimental, โtrueโ) values ๐ฆ ๐ are the predicted (modeled) values ๐ง is the number of predictions .
Performance Criteria
โข The bias is the arithmetic mean of the prediction errors and should be near zero โ A systematic error (a nonzero bias) may appear if,
for instance, a calibration model is applied to data that have been produced by another instrument
โข In the case of a normal distribution, about 95% of the prediction errors are within the tolerance interval 2 ๐๐ธ๐ โ The measure SEP and the tolerance interval are
given in the units of ๐ฆ .
Performance Criteria
โข Standard error of calibration (SEC) is similar to SEP applied to predictions of the calibration set
โข The mean squared error (MSE) is the arithmetic mean of the squared errors
๐๐๐ธ =1
๐ง ๐ฆ๐ โ ๐ฆ ๐
2
๐ง
๐=1
โข MSEC refers to results from a calibration set, MSECV to results obtained in CV, and MSEP to results from a prediction/test set
โข MSE minus the squated bias gives the squared SEP ๐๐ธ๐2 = ๐๐๐ธ โ ๐๐๐๐ 2 .
Performance Criteria
โข The root mean squared error (RMSE) is the square root of MSE, and can again be given for calibration (RMSEC), CV (RMSECV) or for prediction/test (RMSEP)
๐ ๐๐๐ธ =1
๐ง ๐ฆ๐ โ ๐ฆ ๐
2
๐ง
๐=1
โข MSE is preferably used during the development and optimization of models but is less useful for practical applications because it has not the units of the predicted property โ A similar widely used measure is predicted residual error sum of
squares (PRESS), the sum of the squared errors
๐๐ ๐ธ๐๐ = ๐ฆ๐ โ ๐ฆ ๐2๐ง
๐=1 = ๐ง โ ๐๐๐ธ .
Performance Criteria
โข Correlation measures between experimental ๐ฆ and predicted ๐ฆ are frequently used to characterize the model performance
โ Mostly used is the squared Pearson correlation coefficient .
Criteria for Models with Different Numbers of Variables
โข The model should not contain a too smal number of variables because this leads to poor prediction performance. On the other hand, it should also not contain a too large number of variables because this results in overfitting and thus again poor prediction performance
โข The adjusted R-square, ๐ ๐๐๐2 or ๐ โฒ
๐ ๐๐๐2 = 1 โ
๐ โ 1
๐ โ๐ โ 11 โ ๐ 2
where ๐ is the number of objects ๐ is the number of regressor variables (including the intercept) ๐ 2 is called coefficient of determination, expressing the proportion of variance that is explained by the model .
Criteria for Models with Different Numbers of Variables
โข Another, equivalent representation for ๐ ๐๐๐2 is
๐ ๐๐๐2 = 1 โ
๐ ๐๐๐ โ๐ โ 1
๐๐๐๐ โ 1
with residual sum of squares (RSS) for the sum of the squared residuals
๐ ๐๐ = ๐ฆ๐ โ ๐ฆ ๐2
๐
๐=1
and the total sum of squares (TSS) for the sum of the squared differences to the mean ๐ฆ of ๐ฆ,
๐๐๐ = ๐ฆ๐ โ ๐ฆ ๐2๐
๐=1 .
Cross Validation (CV)
โข It is the most used resampling strategy to obtain a reasonable large number of predictions
โข It is also often applied to optimize the complexity of a model
โ The optimum number of PLS or PCA components
โ To split the data into calibration sets and test sets .
Cross Validation (CV)
โข The procedure of CV applied to model optimization โ The available set with ๐ object is randomly split into s segments (parts) of
approximately equal size โข The number of segments can be 2 to ๐, often values between 4 to 10 are used
โ One segment is left out as a validation set โ The other ๐ โ 1 sets are used as a training set to create models which have
increasing complexity (for instance, 1,2,3, โฆ , ๐๐๐๐ฅ๐๐ฟ๐ components) โ The models are separately applied to the objects of the validation set resulting
in predicted values connected to different model complexities โ This procedure is repeated so that each segment is a validation set once โ The result is a matrix with ๐ rows and ๐๐๐๐ฅ colums containing predicted
values ๐ฆ ๐ถ๐ (predicted by CV) for all objects and all considered model complexities
โ From this matrix and the known ๐ฆ-values, a residual matrix is computed โ An error measure (for instance MSECV) is calculated from the residuals, and
the lowest MSECV or a similar criterion indicates the optimum model complexity .
Cross Validation (CV)
K. Baumann, TrAC 2003, 22(6), 395
Cross Validation (CV)
โข A single CV gives ๐ predictions โข For many data sets in chemistry ๐ is too small for
a visualization of the error distribution โข The perfomance measure depends on the split of
the objects into segmentes Is therefore recommended to repeat the CV with
different random splits into segments (repeated CV), and to summarize the results .
Cross Validation (CV)
โข If the number of segments is equal to the number of objects (each segment contains only one object), the method is called leave-one-out CV or full CV โ Randomization of the sequence of objects is senseless,
therefore only one CV run is necessary resulting in ๐ predictions โ The number of created models is ๐ which may be time-
consuming for large data sets โ Depending on the data, full CV may give too optimistic results,
especially if pairwise similar objects are in the data set, for instance from duplicate measurements
โ Full CV is easier to apply than repeated CV or bootstrap โ In many cases, full CV gives a reasonable first estimate of the
model performance .
Bootstrap
โข As a verb, bootstrap has survived into the computer age, being the origin of the phrase "booting a computer." "Boot" is a shortened form of the word "bootstrap," and refers to the initial commands necessary to load the rest of the computer's operating system. Thus, the system is "booted," or "pulled up by its own bootstraps"
Bootstrap
โข Within multivariate analysis, bootstrap is a resampling method that can be used as na alternative to CV, for instance, to estimate the prediction performance of a model or to estimate the optimum complexity
โข In general, bootstrap can be used to estimate the distribution of model parameters
โข Basic ideas of bootstraping are resampling with replacement, and to use calibration sets with the same number of objects, ๐, as objects are in the available data set โ A calibration set is obtained by selecting randomly objects and
copying (not moving) them into the calibration set
โข Toolbox .
Ordinary Least-Squares Regression
Simple OLS
๐ = ๐0 + ๐๐ + ๐
โข ๐ and ๐0 are the regression parameters (regression coefficients)
โ ๐0 is the intercept and ๐ is the slope
โข Since the data will in general not follow a perfect linear relation, the vector ๐ contains the residuals (errors) ๐1, ๐2, โฆ , ๐๐ .
Simple OLS
โข The predicted (modeled) property ๐ฆ ๐ for sample ๐ and the prediction error ๐๐ are calculated by
๐ฆ ๐ = ๐0 + ๐๐ฅ๐ ๐๐ = ๐ฆ๐ โ ๐ฆ ๐
โข The Ordinary Least-Squares (OLS) approach minimizes the sum of the squared residuals ๐๐
2 to estimate the model parameters ๐ and ๐0
๐ = ๐ฅ๐ โ ๐ฅ ๐ฆ๐ โ ๐ฆ ๐๐=1
๐ฅ๐ โ ๐ฅ 2๐๐=1
๐0 = ๐ฆ โ ๐๐ฅ โ For mean-centered data (๐ฅ = 0, ๐ฆ = 0), ๐0 = 0 and
๐ = ๐ฅ๐๐ฆ๐๐๐=1
๐ฅ๐2๐
๐=1=
๐๐ป๐
๐๐ป๐ .
Simple OLS
โข The described model best fits the given (calibration) data, but is not necessarily optimal for predictions
โข The least-squares approach can become very unreliable if outliers are present in the data
โข Assumption for obtaining reliable estimates โ Errors are only in ๐ฆ but not in ๐ฅ
โ Residuals are uncorrelated and normally distributed with mean 0 and constant variance ๐2 (homoscedasticity) .
Simple OLS
โข The following scatterplots are plots of ๐ฅ๐ (measurement) vs. ๐ (observation number) with the sample mean marked with a red horizontal line. The measurement is plotted on the vertical axis; the observation number is plotted on the horizontal axis โ Unbiased The average of the observations in every thin vertical strip is the same all the way across the scatterplot.
โ Biased The average of the observations changes, depending on which thin vertical strip you pick.
โ Homoscedastic The variation (๐) of the observations is the same in every thin vertical strip all the way across the scatterplot.
โ Heteroscedastic The variation (๐) of the observations in a thin vertical strip changes, depending on which vertical strip you pick.
(a) unbiased and homoscedastic (b) unbiased and heteroscedastic (c) biased and homoscedastic (d) biased and heteroscedastic (e) unbiased and heteroscedastic (f) biased and homoscedastic
Simple OLS
โข Besides estimating the regression coefficients, it is also of interest to estimate the variation of the measurements around the fitted regression line. This means that the residual variance ๐2 has to be estimated
๐ ๐2 =
1
๐ โ 2 ๐ฆ๐ โ ๐ฆ ๐
2
๐
๐=1
=1
๐ โ 2 ๐๐
2
๐
๐=1
The denominator ๐ โ 2 is used here because two parameters are necessary for a fitted straight line, and this makes ๐ ๐
2 an unbiased estimator for ๐2 โข Confidence intervals for intercept and slope
๐0 ยฑ ๐ก๐โ2;๐๐ ๐0
๐ ยฑ ๐ก๐โ2;๐๐ ๐
โข Standard deviations of ๐0 and ๐
๐ ๐0 = ๐ ๐ ๐ฅ๐
2๐๐=1
๐ ๐ฅ๐ โ ๐ฅ 2๐๐=1
๐ ๐ =๐ ๐
๐ฅ๐ โ ๐ฅ 2๐๐=1
๐ก๐โ2;๐ is the ๐-quantile of the ๐ก-distribution witn ๐ โ 2 degress of freedom, with for instance ๐ = 0.025 for a 95% confidence interval .
Simple OLS
โข Confidence interval for the residual variance ๐2 ๐ โ 2 ๐ ๐
2
๐๐โ2;1โ๐2 < ๐2 <
๐ โ 2 ๐ ๐2
๐๐โ2;๐2
where ๐๐โ2;1โ๐2 and ๐๐โ2;๐
2 are the appropriate quantiles of the chi-square distribution (ref 1, table) with ๐ โ 2 degrees of freedom (e.g., ๐ = 0.025 for a 95% confidence interval) โข Null hypothesis ๐ป0: ๐0 = 0
๐๐0 =๐0
๐ ๐0
โ ๐ป0 is rejected at the significance level of if ๐๐0 > ๐ก๐โ2;1โ๐ผ2
โ The test for ๐ = 0 is equivalent
๐๐ =๐
๐ ๐ .
Simple OLS
โข Often it is of interest to obtain a confidence interval for the prediction at a new ๐ฅ value
๐ฆ ยฑ ๐ ๐ 2๐น2,๐โ2;๐1
๐+
๐ฅ โ ๐ฅ 2
๐ฅ โ ๐ฅ 2๐๐=1
with ๐น2,๐โ2;๐ the ๐-quantile of the ๐น-distribution
with 2 and ๐ โ 2 degrees of freedrom
โ Best predictions are possible in the mid part of the range of ๐ฅ where most information is available .
Simple OLS
โข Using the open source software (An Introduction to R)
> x=c(1.5,2,2.5,2.9,3.4,3.7,4,4.2,4.6,5,5.5,5.7,6.6)
> y=c(3.5,6.1,5.6,7.1,6.2,7.2,8.9,9.1,8.5,9.4,9.5,11.3,11.1) > res<-lm(y~x) # linear model for y on x. The symbol ~ # allows to construct a formula for the relation > plot(x,y) > abline(res)
Simple OLS
> summary(res) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -0.9724 -0.5789 -0.2855 0.8124 0.9211 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3529 0.6186 3.804 0.00293 ** x 1.4130 0.1463 9.655 1.05e-06 *** --- Signif. codes: 0 โ***โ 0.001 โ**โ 0.01 โ*โ 0.05 โ.โ 0.1 โ โ 1 Residual standard error: 0.7667 on 11 degrees of freedom Multiple R-squared: 0.8945, Adjusted R-squared: 0.8849 F-statistic: 93.22 on 1 and 11 DF, p-value: 1.049e-06
๐0 and ๐ ๐ ๐0 and s ๐
๐ ๐0 and ๐ ๐
since ๐-values are much smaller than a reasonable significance level , e,g, ๐ผ = 0.05, both intercept and slope are important in our regression model
๐ ๐
in the case of univariate ๐ฅ and ๐ฆ is the same as the squared Pearson correlation coefficient, and it is a measure of model fit
tests wheter all parameters are zero, against the alternative that at least one regression parameter is different from zero. Since the ๐-value ~0, at least one of intercept or slope contributes to the regression model
first quartile is the value in which 25% of the numbers are below it
Muliple OLS
๐ฆ1 = ๐0 + ๐1๐ฅ11 + ๐2๐ฅ12 +โฏ+ ๐๐๐ฅ1๐ + ๐1 ๐ฆ2 = ๐0 + ๐1๐ฅ21 + ๐2๐ฅ22 +โฏ+ ๐๐๐ฅ2๐ + ๐2
โฎ ๐ฆ๐ = ๐0 + ๐1๐ฅ๐1 + ๐2๐ฅ๐2 +โฏ+ ๐๐๐ฅ๐๐ + ๐๐
or ๐ = ๐ฟ๐ + ๐
โข Multiple Linear Regression Model โข ๐ฟ of size ๐ ร ๐ + 1 which includes in its first column ๐ values of
1 โข The residuals are calculated by ๐ = ๐ โ ๐ โข The regression coefficients ๐ = ๐0, ๐1, โฆ , ๐๐
๐ result from the OLS estimation minimizing the sum of squared residuals ๐๐ป๐
๐ = ๐ฟ๐ป๐ฟโ๐๐ฟ๐ป๐ .
Muliple OLS
โข Confidence Intervals and Statistical Tests โ The following assumptions must be fulfilled
โข Erros ๐ are independent ๐-dimensional normally distributed โข With mean vector ๐ โข And covariance matrix ๐2๐ฐ๐
โ An unbiased estimator for the residual variance ๐2 is
๐ ๐2 =
1
๐ โ๐ โ 1 ๐ฆ๐ โ ๐ฆ ๐
2
๐
๐=1
=1
๐ โ๐ โ 1๐ โ ๐ฟ๐ ๐ ๐ โ ๐ฟ๐
โ The null hypothesis ๐๐ = 0 against the alternative ๐๐ โ 0 can be tested with the test statistic
๐ง๐ =๐๐
๐ ๐ ๐๐
where ๐๐ is the ๐th diagonal element of ๐ฟ๐ป๐ฟโ๐
โ The distribution of ๐ง๐ is ๐ก๐โ๐โ1, and thus a large absolute value of ๐ง๐ will lead to a rejection of the null hypothesis
โ ๐น-test can also be constructed to test the null hypothesis ๐0 = ๐1 = โฏ = ๐๐ = 0 against the alternative ๐๐ โ 0 for any ๐ = 0,1, โฆ ,๐ .
Multiple OLS
โข Using the open source software R > T=c(80,93,100,82,90,99,81,96,94,93,97,95,100,85,86,87)
> V=c(8,9,10,12,11,8,8,10,12,11,13,11,8,12,9,12)
> y=c(2256,2340,2426,2293,2330,2368,2250,2409,2364,2379,2440,2364,2404,2317,2309,2328)
> res<-lm(y~T+V) # linear model for y on T and V. The symbol ~
# allows to construct a formula for the relation
> summary(res)
Call:
lm(formula = y ~ T + V)
Residuals:
Min 1Q Median 3Q Max
-21.4972 -13.1978 -0.4736 10.5558 25.4299
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1566.0778 61.5918 25.43 1.80e-12 ***
T 7.6213 0.6184 12.32 1.52e-08 ***
V 8.5848 2.4387 3.52 0.00376 **
---
Signif. codes: 0 โ***โ 0.001 โ**โ 0.01 โ*โ 0.05 โ.โ 0.1 โ โ 1
Residual standard error: 16.36 on 13 degrees of freedom
Multiple R-squared: 0.927, Adjusted R-squared: 0.9157
F-statistic: 82.5 on 2 and 13 DF, p-value: 4.1e-08
Hat Matrix
โข The hat-matrix ๐ combines the observed and predicted ๐ฆ-values
๐ = ๐๐
and so it โputs the hatโ on ๐ฆ. The hat-matrix is defined as
๐ = ๐ ๐T๐โ1๐T
โ the diagonal elements โ๐๐ of the ๐ ร ๐ matrix ๐ reflet the influence of each value ๐ฆ๐ on its own prediction ๐ฆ ๐.
Multivariate OLS
โข Multivariate Linear Regression relates several ๐ฆ-variables with several ๐ฅ-variables ๐ = ๐๐ + ๐
โข In terms of a single ๐ฆ-variable ๐ฒ๐ฃ = ๐๐๐ฃ + ๐๐ฃ
โข OLS estimator for ๐๐ฃ
๐๐ฃ = ๐ ๐T๐โ1๐T๐ฒ๐ฃ
resulting in
๐ = ๐ ๐T๐โ1๐T๐
and
๐ = ๐๐
โข Matrix ๐ฉ consists of ๐ loading vectors, each defining a direction in the ๐ฅ-space for a linear latent variable which has maximum Pearsonยดs correlation coefficient between ๐๐ and ๐ ๐ for
๐ = 1, โฆ , ๐
โข The regression coefficients for all ๐ฆ-variables can be computed at once, however, only for noncollinear ๐ฅ-variables and if ๐ < ๐
โข ๐ ๐ฅ-variables, ๐ observations, and ๐ ๐ฆ-variables
โข Alternative methods are PLS2 and CCA .
Variable Selection
โข Feature selection โข For multiple regression all available variables ๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ were used to build a linear model for the prediction of the ๐ฆ-variable โ Useful as long as the number ๐ of regressor variables
is small, say not more than 10
โข OLS regression is no longer computable if the โ Regressor variables are highly correlated โ Number of objects is lower than the number of
variables
โข PCR and PLS can handle such data .
Variable Selection
โข Arguments against the use of all available regressor variables โ Use of all variable will produce a better fit of the
model for the training data โข The residuals become smaller and thus the ๐ 2 measure
increases โข We are usually not interested in maximizing the fit for the
training data but in maximizing the prediction performance for the test data
โข Reduction of the regressor variables can avoid the effects of overfitting
โ A regression model with a high number of variables is pratically impossible to interpret .
Variable Selection
โข Univariate and Bivariate Selection Methods โ Criteria for the elimination of regressor variables
โข A considerable percentage of the variable values is missing of below a small threshold
โข All or nearly all variable values are equal
โข The variable includes many and severe outliers
โข Compute the correlation between pairs of regressor variables. If the correlation is high (positive or negative), exclude that variable having the larger sum of (absolute) correlation coefficients to all remaining regressor variables .
Variable Selection
โ Criteria for the identification of potentially useful regressor variables
โข High variance of the variable
โข High (absolute) correlation coefficient with the ๐ฆ-variable .
Variable Selection
โข Stepwise Selectrion Methods
โ Adds or drops one variable at a time
โข Forward selection โ Start with the empty model (or with preselected variables)
and add that variable to the model that optimizes a criterion
โ Continue to add variables until a stopping rule becomes active
โข Backward elimination โ Start with the full model...
โข Both directions .
Variable Selection
โ An often-used version of stepwise variable selection works as follows โข Select the variable with highest absolute correlation coefficient with the ๐ฆ-variable; the
number of selected variables is ๐0 = 1 โข Add each of the remaining ๐ฅ-variables separately to the selected variable; the number of
variables in each subset is ๐1 = 2 โข Calculate ๐น
๐น =
๐ ๐๐0 โ ๐ ๐๐1๐1 โ๐0
๐ ๐๐1๐ โ ๐1 โ 1
with ๐ ๐๐ being the sum of the squared residuals ๐ฆ๐ โ ๐ฆ ๐2๐
๐=1 โข Consider the added variables which gives the highest ๐น, and if the decrease of ๐ ๐๐ is
significant, take this variable as the second selected one โ Significance: ๐น > ๐น๐1โ๐0,๐โ๐1โ1;0.95
โข Forward selection of variables would continue in the same way until no significant change occurs
โ Disadvantage: a selected variable cannot be removed later on ยป Usually the better strategy is to continue with a backwad step ยป Another forward step is done, followed by another backward step, and so on, until no
significant change of ๐ ๐๐ occurs or a defined maximum number of variables is reached .
Variable Selection
โข Best-Subset/All-Subsets Regression โ Allows excluding complete branches in the tree of all
possible subsets, and thus finding the best subset for data sets with up to about 30-40 variables
โ Leaps and Bounds algorithm or regression-tree methods
โ Model selection โข Adjusted ๐ 2
โข Akaikeยดs Information Criterion (AIC)
๐ด๐ผ๐ถ = ๐๐๐๐๐ ๐๐
๐+ 2๐
โข Bayes Information Criterion (BIC)
โข Mallowยดs Cp .
Variable Selection
๐ฅ1 + ๐ฅ2 + ๐ฅ3
๐ฅ1 + ๐ฅ2
๐ด๐ผ๐ถ = 10
๐ฅ1
๐ด๐ผ๐ถ > 8
๐ฅ1 + ๐ฅ3
๐ฅ2
๐ด๐ผ๐ถ > 18
๐ฅ2 + ๐ฅ3
๐ด๐ผ๐ถ = 20
๐ฅ3
๐ด๐ผ๐ถ > 18
Since we want to select the model which gives the smallest value of the ๐ด๐ผ๐ถ, the complete branch with ๐ฅ2 + ๐ฅ3 can be ignored, bacause any submodel in this branch is worse (๐ด๐ผ๐ถ > 18) than the model ๐ฅ1 + ๐ฅ2 with ๐ด๐ผ๐ถ = 10
Variable Selection
โข Variable Selection Based on PCA or PLS Models
โ These methods form new latent variables by using linear combinations of the regressor variables
๐1๐ฅ1 + ๐2๐ฅ2 +โฏ+ ๐๐๐ฅ๐
โ The coefficients/loadings reflect the importance of an ๐ฅ-variable for the new latent variable
โ The absolute size of the coefficients can be used as a criterion for variable selection .
Variable Selection
โข Genetic Algorithms (GAs)
โ Natural Computation Method
Variable Selection
Gene
Population
โข Delete chromosomes with poor fitness (selection)
โข Create new chromosomes from pairs of good chromosomes (crossover)
โข Change a few genes randomly (mutation)
New (better) population
Variable Selection
โ Crossover
โข two chromosomes are cut at a random position and the parts are connected in a crossover scheme resulting in two new chromosomes
Variable Selection
โข Cluster Analysis of Variables
โ Cluster analysis tries to identify homogenous groups in the data
โ If it is applied to the correlation matix of the regressor variables , one may obtain groups of variables that are strongly related, while variables in different groups will have a weak correlation .
top related