multivariate analysis

Multivariate Analysis Multivariate Calibration – part 2

Prof. Dr. Anselmo E de Oliveira

anselmo.quimica.ufg.br

anselmo.disciplinas@gmail.com

Linear Latent Variables

• An essential concept in multivariate data analysis is the mathematical combination of several variables into a new variable that has a certain desired property. In chemometrics such a new variable is often called a latent variable

• Linear Latent Variable 𝑢 = 𝑏1𝑥1 + 𝑏2𝑥2 +⋯+ 𝑏𝑚𝑥𝑚

𝑢 scores, value of linear latent variable

𝑏𝑗 loadings, coefficients describing the influence of variables on the score

𝑥𝑗 variables, features.

Calibration

• Linear models 𝑦 = 𝑏0 + 𝑏1𝑥1 + 𝑏2𝑥2 +⋯+ 𝑏𝑗𝑥𝑗 +⋯+ 𝑏𝑚𝑥𝑚 + 𝑒

𝑏0 is called the intercept

𝑏1 to 𝑏𝑚 the regression coefficients

𝑚 the number of variables

𝑒 the residual (error term)

– Often mean-centered data are used and then 𝑏0 becomes zero

• This model corresponds to a linear latent variable.

Calibration

• The parameters of a model are estimated from a calibration set (training set) containing the values of the 𝑥-variables and 𝑦 for 𝑛 samples (objects)

• The resulting model is evaluated with a test set (with know values for 𝑥 and 𝑦)

• Because modeling and prediction of the 𝑦-data is a defined aim of data analysis , this type of data treatment is called supervised learning.

Calibration

• All regression methods aim at the minimization of residuals, for instance minimization of the sum of the squared residuals

• It is essential to focus on minimal prediction errors for new cases – the test set – but no (only) for the calibration set from which the model has been created – It is realatively ease to create a model – especially

with many variables and eventually nonlinear features – that very well fits the calibration data; however, it may be useless for new cases. This effect of overfitting is a crucial topic in model creation .

Calibration

• Regression can be performed directly with the values of the variables (ordinary least-squares regression, OLS) but in the most powerful methods, such as principal component regression (PCR) and partial least-squares regression (PLS), it is done via a small set of intermediate linear latent variables (the components). This approach has important advantages: – Data with highly correlating 𝑥-variables can be used – Data sets with more variables than samples can be used – The complexity of the model can be controlled by the

number of components and thus overfitting can be avoided and maximum prediction performance for test set data can be approached .

Calibration

• Depending on the type of data, different methods are available

Number of 𝒙-variables

Number of 𝒚-variables

Name Methods

1 1 Simple Simple OLS, robust regression

Many 1 Multiple PLS, PCR, multiple OLS, robust regression, Ridge regression, Lasso regression

Many Many Multivariate PLS2, Canonical Correlation Analysis (CCA)

Calibration

𝑦 = 1.418 + 4.423𝑥1 + 4.101𝑥2 − 0.0357𝑥3

Calibration

OLS-model using only 𝑥1 and 𝑥2

𝒚 = 𝟏. 𝟑𝟓𝟑 + 𝟒. 𝟒𝟑𝟑𝒙𝟏 + 𝟒. 𝟏𝟏𝟔𝒙𝟐

Performance of Regression Models

• Any model for prediction makes sense only if appropriate criteria are defined and applied to measure the performance of the model

• For models based on regression, the residuals (prediction errors) 𝑒𝑖

𝑒𝑖 = 𝑦𝑖 − 𝑦 𝑖 are the basis for performance measures, with 𝑦𝑖 for the given (experimental, “true”) value and 𝑦 𝑖 the predicted (modeled) value of an object 𝑖

• An often-used performance measure estimates the standard deviation of the prediction errors (standard error of prediction, SEP) .

• Using the same objects for calibration and test should be strictly avoided

• Depending on the size of the data set (the number of objects available) – and on the effort of work – different strategies are possible

• The following levels are ordered by typical applications to data with decreasing size and also by decreasing reliability of the results .

1. If data from many objects are available, a split into three sets is best:

i. Training set (ca. 50% of the objects) for creating models

ii. Validation set (ca. 25% of the objects) for optimizing the model to obtain good prediction performance

iii. Test set (prediction set, approximately 25%) for testing the final model to obtain a realistic estimation of the prediction performance for new cases

iv. The three sets are treated separately

v. Applications in chemistry rarely allow this strategy because of a too small number of objects available .

2. The data is split into a calibration set used for model creation and optimization and a test set (prediction set) to obtain a realistic estimation of the prediction perfomance for new cases

i. The calibration set is divided into a training set and a validation set by cross validation (CV) or bootstrap

ii. First the optimum complexity (for instance optimum number of PLS components) of the model is estimated and then a model is built from the whole calibration set applying the found optimum complexity; this model is applied to the test set .

3. CV or bootstrap is used to split the data set into different calibration sets and test sets

i. A calibration set is used as described (2) to create an optimized model and this is applied to the corresponding test set

ii. All objects are principally used in training set, validation set, and test set; however, an object is never simultaneously used for model creations and for test

iii. This strategy (double CV or double bootstrap or a combination of CV and bootstrap) is applicable to a relatively small number of objects; furthermore, the process can be repeated many times with different random splits resulting in a high number of test-set-predicted values .

• Mostly the split of the objects intro training, validation, and test sets is performed by simple random sampling

• More sophisticated procedures – related to experimental design and theory of sampling – are available – Kennard-Stone algorithm

• it claims to set up a calibration set that is representative for the population, to cover the 𝑥-space as uniformly as possible, and to give more weight to objects outside the center

• This aim is reached by selecting objects with maximum distances (for instance Euclidean distances) in the 𝑥-space .

– The methods CV and bootstrap are called resampling strategies

• Small data sets

• They are applied to obtain a reasonable high number of predictions

• The larger the date set and the more reliable the data are, the better the prediction performance can be estimated – 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 × 𝑓𝑟𝑖𝑒𝑛𝑑𝑙𝑖𝑛𝑒𝑠𝑠 𝑜𝑓 𝑑𝑎𝑡𝑎 ×

𝑢𝑛𝑐𝑒𝑟𝑡𝑎𝑖𝑛𝑡𝑦 𝑜𝑓 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 .

Overfitting and Underfitting

• The more complex (the “larger”) a model is, the better is it capable of fit given data (the calibration set)

• The prediction error for the calibration set in general decreases with increasing complexity of the model

– An appropriate highly complicated model can fit almost any data with almost zero deviations (residuals) between experimental (true) 𝑦 and modeled (predicted) 𝑦

– Such model are not necessarily useful for new cases, because they are probably overfitted .

Fonte da Imagen: Dr. Frank Dieterle

Fonte da Imagen: holehouse.org

• Prediction error for new cases are high for “small” models (underfitting, low complexity, too simple models) but also for overfitted models

• Determination of the optimum complexity of a model is an important but not always na easy task, because the minumum of measures for the prediction error for test sets is often not well marked – In chemometrics, the complexity is typically controled by the

number of PLS or PCA components (latent variables), and the optimum complexity is estimated by CV

– CV or bootstrap allows an estimation of the prediction error for each object of the calibration set at each considered model complexity .

Performance Criteria

• The basis of all perfomence criteria are prediction errors (residuals) 𝑦𝑖 − 𝑦 𝑖

• The classical standard deviation of prediction errors is widely used as a measure of the spread of the error distribution, and is called standard error of prediction (SEP) defined by

𝑆𝐸𝑃 =1

𝑧 − 1 𝑦𝑖 − 𝑦 𝑖 − 𝑏𝑖𝑎𝑠 2

𝑖=1

𝑏𝑖𝑎𝑠 =1

𝑧 𝑦𝑖 − 𝑦 𝑖

𝑖=1

where 𝑦𝑖 are the given (experimental, “true”) values 𝑦 𝑖 are the predicted (modeled) values 𝑧 is the number of predictions .

• The bias is the arithmetic mean of the prediction errors and should be near zero – A systematic error (a nonzero bias) may appear if,

for instance, a calibration model is applied to data that have been produced by another instrument

• In the case of a normal distribution, about 95% of the prediction errors are within the tolerance interval 2 𝑆𝐸𝑃 – The measure SEP and the tolerance interval are

given in the units of 𝑦 .

• Standard error of calibration (SEC) is similar to SEP applied to predictions of the calibration set

• The mean squared error (MSE) is the arithmetic mean of the squared errors

𝑀𝑆𝐸 =1

𝑖=1

• MSEC refers to results from a calibration set, MSECV to results obtained in CV, and MSEP to results from a prediction/test set

• MSE minus the squated bias gives the squared SEP 𝑆𝐸𝑃2 = 𝑀𝑆𝐸 − 𝑏𝑖𝑎𝑠2 .

• The root mean squared error (RMSE) is the square root of MSE, and can again be given for calibration (RMSEC), CV (RMSECV) or for prediction/test (RMSEP)

𝑅𝑀𝑆𝐸 =1

𝑖=1

• MSE is preferably used during the development and optimization of models but is less useful for practical applications because it has not the units of the predicted property – A similar widely used measure is predicted residual error sum of

squares (PRESS), the sum of the squared errors

𝑃𝑅𝐸𝑆𝑆 = 𝑦𝑖 − 𝑦 𝑖2𝑧

𝑖=1 = 𝑧 ∙ 𝑀𝑆𝐸 .

• Correlation measures between experimental 𝑦 and predicted 𝑦 are frequently used to characterize the model performance

– Mostly used is the squared Pearson correlation coefficient .

Criteria for Models with Different Numbers of Variables

• The model should not contain a too smal number of variables because this leads to poor prediction performance. On the other hand, it should also not contain a too large number of variables because this results in overfitting and thus again poor prediction performance

• The adjusted R-square, 𝑅𝑎𝑑𝑗2 or 𝑅′

𝑅𝑎𝑑𝑗2 = 1 −

𝑛 − 1

𝑛 −𝑚 − 11 − 𝑅2

where 𝑛 is the number of objects 𝑚 is the number of regressor variables (including the intercept) 𝑅2 is called coefficient of determination, expressing the proportion of variance that is explained by the model .

Criteria for Models with Different Numbers of Variables

• Another, equivalent representation for 𝑅𝑎𝑑𝑗2 is

𝑅𝑎𝑑𝑗2 = 1 −

𝑅𝑆𝑆𝑛 −𝑚 − 1

𝑇𝑆𝑆𝑛 − 1

with residual sum of squares (RSS) for the sum of the squared residuals

𝑅𝑆𝑆 = 𝑦𝑖 − 𝑦 𝑖2

𝑖=1

and the total sum of squares (TSS) for the sum of the squared differences to the mean 𝑦 of 𝑦,

𝑇𝑆𝑆 = 𝑦𝑖 − 𝑦 𝑖2𝑛

𝑖=1 .

Cross Validation (CV)

• It is the most used resampling strategy to obtain a reasonable large number of predictions

• It is also often applied to optimize the complexity of a model

– The optimum number of PLS or PCA components

– To split the data into calibration sets and test sets .

• The procedure of CV applied to model optimization – The available set with 𝑛 object is randomly split into s segments (parts) of

approximately equal size • The number of segments can be 2 to 𝑛, often values between 4 to 10 are used

– One segment is left out as a validation set – The other 𝑠 − 1 sets are used as a training set to create models which have

increasing complexity (for instance, 1,2,3, … , 𝑎𝑚𝑎𝑥𝑃𝐿𝑆 components) – The models are separately applied to the objects of the validation set resulting

in predicted values connected to different model complexities – This procedure is repeated so that each segment is a validation set once – The result is a matrix with 𝑛 rows and 𝑎𝑚𝑎𝑥 colums containing predicted

values 𝑦 𝐶𝑉 (predicted by CV) for all objects and all considered model complexities

– From this matrix and the known 𝑦-values, a residual matrix is computed – An error measure (for instance MSECV) is calculated from the residuals, and

the lowest MSECV or a similar criterion indicates the optimum model complexity .

K. Baumann, TrAC 2003, 22(6), 395

• A single CV gives 𝑛 predictions • For many data sets in chemistry 𝑛 is too small for

a visualization of the error distribution • The perfomance measure depends on the split of

the objects into segmentes Is therefore recommended to repeat the CV with

different random splits into segments (repeated CV), and to summarize the results .

• If the number of segments is equal to the number of objects (each segment contains only one object), the method is called leave-one-out CV or full CV – Randomization of the sequence of objects is senseless,

therefore only one CV run is necessary resulting in 𝑛 predictions – The number of created models is 𝑛 which may be time-

consuming for large data sets – Depending on the data, full CV may give too optimistic results,

especially if pairwise similar objects are in the data set, for instance from duplicate measurements

– Full CV is easier to apply than repeated CV or bootstrap – In many cases, full CV gives a reasonable first estimate of the

model performance .

Bootstrap

• As a verb, bootstrap has survived into the computer age, being the origin of the phrase "booting a computer." "Boot" is a shortened form of the word "bootstrap," and refers to the initial commands necessary to load the rest of the computer's operating system. Thus, the system is "booted," or "pulled up by its own bootstraps"

Bootstrap

• Within multivariate analysis, bootstrap is a resampling method that can be used as na alternative to CV, for instance, to estimate the prediction performance of a model or to estimate the optimum complexity

• In general, bootstrap can be used to estimate the distribution of model parameters

• Basic ideas of bootstraping are resampling with replacement, and to use calibration sets with the same number of objects, 𝑛, as objects are in the available data set – A calibration set is obtained by selecting randomly objects and

copying (not moving) them into the calibration set

• Toolbox .

Ordinary Least-Squares Regression

Simple OLS

𝒚 = 𝑏0 + 𝑏𝒙 + 𝒆

• 𝑏 and 𝑏0 are the regression parameters (regression coefficients)

– 𝑏0 is the intercept and 𝑏 is the slope

• Since the data will in general not follow a perfect linear relation, the vector 𝒆 contains the residuals (errors) 𝑒1, 𝑒2, … , 𝑒𝑛 .

Simple OLS

• The predicted (modeled) property 𝑦 𝑖 for sample 𝑖 and the prediction error 𝑒𝑖 are calculated by

𝑦 𝑖 = 𝑏0 + 𝑏𝑥𝑖 𝑒𝑖 = 𝑦𝑖 − 𝑦 𝑖

• The Ordinary Least-Squares (OLS) approach minimizes the sum of the squared residuals 𝑒𝑖

2 to estimate the model parameters 𝑏 and 𝑏0

𝑏 = 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 𝑛𝑖=1

𝑥𝑖 − 𝑥 2𝑛𝑖=1

𝑏0 = 𝑦 − 𝑏𝑥 – For mean-centered data (𝑥 = 0, 𝑦 = 0), 𝑏0 = 0 and

𝑏 = 𝑥𝑖𝑦𝑖𝑛𝑖=1

𝑥𝑖2𝑛

𝑖=1=

𝒙𝑻𝒚

𝒙𝑻𝒙 .

Simple OLS

• The described model best fits the given (calibration) data, but is not necessarily optimal for predictions

• The least-squares approach can become very unreliable if outliers are present in the data

• Assumption for obtaining reliable estimates – Errors are only in 𝑦 but not in 𝑥

– Residuals are uncorrelated and normally distributed with mean 0 and constant variance 𝜎2 (homoscedasticity) .

Simple OLS

• The following scatterplots are plots of 𝑥𝑖 (measurement) vs. 𝑖 (observation number) with the sample mean marked with a red horizontal line. The measurement is plotted on the vertical axis; the observation number is plotted on the horizontal axis – Unbiased The average of the observations in every thin vertical strip is the same all the way across the scatterplot.

– Biased The average of the observations changes, depending on which thin vertical strip you pick.

– Homoscedastic The variation (𝜎) of the observations is the same in every thin vertical strip all the way across the scatterplot.

– Heteroscedastic The variation (𝜎) of the observations in a thin vertical strip changes, depending on which vertical strip you pick.

(a) unbiased and homoscedastic (b) unbiased and heteroscedastic (c) biased and homoscedastic (d) biased and heteroscedastic (e) unbiased and heteroscedastic (f) biased and homoscedastic

Simple OLS

• Besides estimating the regression coefficients, it is also of interest to estimate the variation of the measurements around the fitted regression line. This means that the residual variance 𝜎2 has to be estimated

𝑠𝑒2 =

𝑛 − 2 𝑦𝑖 − 𝑦 𝑖

𝑖=1

𝑛 − 2 𝑒𝑖

𝑖=1

The denominator 𝑛 − 2 is used here because two parameters are necessary for a fitted straight line, and this makes 𝑠𝑒

2 an unbiased estimator for 𝜎2 • Confidence intervals for intercept and slope

𝑏0 ± 𝑡𝑛−2;𝑝𝑠 𝑏0

𝑏 ± 𝑡𝑛−2;𝑝𝑠 𝑏

• Standard deviations of 𝑏0 and 𝑏

𝑠 𝑏0 = 𝑠𝑒 𝑥𝑖

2𝑛𝑖=1

𝑛 𝑥𝑖 − 𝑥 2𝑛𝑖=1

𝑠 𝑏 =𝑠𝑒

𝑥𝑖 − 𝑥 2𝑛𝑖=1

𝑡𝑛−2;𝑝 is the 𝑝-quantile of the 𝑡-distribution witn 𝑛 − 2 degress of freedom, with for instance 𝑝 = 0.025 for a 95% confidence interval .

Simple OLS

• Confidence interval for the residual variance 𝜎2 𝑛 − 2 𝑠𝑒

𝜒𝑛−2;1−𝑝2 < 𝜎2 <

𝑛 − 2 𝑠𝑒2

𝜒𝑛−2;𝑝2

where 𝜒𝑛−2;1−𝑝2 and 𝜒𝑛−2;𝑝

2 are the appropriate quantiles of the chi-square distribution (ref 1, table) with 𝑛 − 2 degrees of freedom (e.g., 𝑝 = 0.025 for a 95% confidence interval) • Null hypothesis 𝐻0: 𝑏0 = 0

𝑇𝑏0 =𝑏0

𝑠 𝑏0

– 𝐻0 is rejected at the significance level of if 𝑇𝑏0 > 𝑡𝑛−2;1−𝛼2

– The test for 𝑏 = 0 is equivalent

𝑇𝑏 =𝑏

𝑠 𝑏 .

Simple OLS

• Often it is of interest to obtain a confidence interval for the prediction at a new 𝑥 value

𝑦 ± 𝑠𝑒 2𝐹2,𝑛−2;𝑝1

𝑥 − 𝑥 2

𝑥 − 𝑥 2𝑛𝑖=1

with 𝐹2,𝑛−2;𝑝 the 𝑝-quantile of the 𝐹-distribution

with 2 and 𝑛 − 2 degrees of freedrom

– Best predictions are possible in the mid part of the range of 𝑥 where most information is available .

Simple OLS

• Using the open source software (An Introduction to R)

> x=c(1.5,2,2.5,2.9,3.4,3.7,4,4.2,4.6,5,5.5,5.7,6.6)

> y=c(3.5,6.1,5.6,7.1,6.2,7.2,8.9,9.1,8.5,9.4,9.5,11.3,11.1) > res<-lm(y~x) # linear model for y on x. The symbol ~ # allows to construct a formula for the relation > plot(x,y) > abline(res)

Simple OLS

> summary(res) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -0.9724 -0.5789 -0.2855 0.8124 0.9211 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3529 0.6186 3.804 0.00293 ** x 1.4130 0.1463 9.655 1.05e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7667 on 11 degrees of freedom Multiple R-squared: 0.8945, Adjusted R-squared: 0.8849 F-statistic: 93.22 on 1 and 11 DF, p-value: 1.049e-06

𝑏0 and 𝑏 𝑠 𝑏0 and s 𝑏

𝑇 𝑏0 and 𝑇 𝑏

since 𝑝-values are much smaller than a reasonable significance level , e,g, 𝛼 = 0.05, both intercept and slope are important in our regression model

𝑠𝑒

in the case of univariate 𝑥 and 𝑦 is the same as the squared Pearson correlation coefficient, and it is a measure of model fit

tests wheter all parameters are zero, against the alternative that at least one regression parameter is different from zero. Since the 𝑝-value ~0, at least one of intercept or slope contributes to the regression model

first quartile is the value in which 25% of the numbers are below it

Muliple OLS

𝑦1 = 𝑏0 + 𝑏1𝑥11 + 𝑏2𝑥12 +⋯+ 𝑏𝑚𝑥1𝑚 + 𝑒1 𝑦2 = 𝑏0 + 𝑏1𝑥21 + 𝑏2𝑥22 +⋯+ 𝑏𝑚𝑥2𝑚 + 𝑒2

⋮ 𝑦𝑛 = 𝑏0 + 𝑏1𝑥𝑛1 + 𝑏2𝑥𝑛2 +⋯+ 𝑏𝑚𝑥𝑛𝑚 + 𝑒𝑛

or 𝒚 = 𝑿𝒃 + 𝒆

• Multiple Linear Regression Model • 𝑿 of size 𝑛 × 𝑚 + 1 which includes in its first column 𝑛 values of

1 • The residuals are calculated by 𝒆 = 𝒚 − 𝒚 • The regression coefficients 𝒃 = 𝑏0, 𝑏1, … , 𝑏𝑚

𝑇 result from the OLS estimation minimizing the sum of squared residuals 𝒆𝑻𝒆

𝒃 = 𝑿𝑻𝑿−𝟏𝑿𝑻𝒚 .

Muliple OLS

• Confidence Intervals and Statistical Tests – The following assumptions must be fulfilled

• Erros 𝒆 are independent 𝑛-dimensional normally distributed • With mean vector 𝟎 • And covariance matrix 𝜎2𝑰𝑛

– An unbiased estimator for the residual variance 𝜎2 is

𝑠𝑒2 =

𝑛 −𝑚 − 1 𝑦𝑖 − 𝑦 𝑖

𝑖=1

𝑛 −𝑚 − 1𝒚 − 𝑿𝒃 𝑇 𝒚 − 𝑿𝒃

– The null hypothesis 𝑏𝑗 = 0 against the alternative 𝑏𝑗 ≠ 0 can be tested with the test statistic

𝑧𝑗 =𝑏𝑗

𝑠𝑒 𝑑𝑗

where 𝑑𝑗 is the 𝑗th diagonal element of 𝑿𝑻𝑿−𝟏

– The distribution of 𝑧𝑗 is 𝑡𝑛−𝑚−1, and thus a large absolute value of 𝑧𝑗 will lead to a rejection of the null hypothesis

– 𝐹-test can also be constructed to test the null hypothesis 𝑏0 = 𝑏1 = ⋯ = 𝑏𝑚 = 0 against the alternative 𝑏𝑗 ≠ 0 for any 𝑗 = 0,1, … ,𝑚 .

Multiple OLS

• Using the open source software R > T=c(80,93,100,82,90,99,81,96,94,93,97,95,100,85,86,87)

> V=c(8,9,10,12,11,8,8,10,12,11,13,11,8,12,9,12)

> y=c(2256,2340,2426,2293,2330,2368,2250,2409,2364,2379,2440,2364,2404,2317,2309,2328)

> res<-lm(y~T+V) # linear model for y on T and V. The symbol ~

# allows to construct a formula for the relation

> summary(res)

lm(formula = y ~ T + V)

Residuals:

Min 1Q Median 3Q Max

-21.4972 -13.1978 -0.4736 10.5558 25.4299

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1566.0778 61.5918 25.43 1.80e-12 ***

T 7.6213 0.6184 12.32 1.52e-08 ***

V 8.5848 2.4387 3.52 0.00376 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.36 on 13 degrees of freedom

Multiple R-squared: 0.927, Adjusted R-squared: 0.9157

F-statistic: 82.5 on 2 and 13 DF, p-value: 4.1e-08

Hat Matrix

• The hat-matrix 𝐇 combines the observed and predicted 𝑦-values

𝒚 = 𝐇𝒚

and so it “puts the hat” on 𝑦. The hat-matrix is defined as

𝐇 = 𝐗 𝐗T𝐗−1𝐗T

– the diagonal elements ℎ𝑖𝑖 of the 𝑛 × 𝑛 matrix 𝐇 reflet the influence of each value 𝑦𝑖 on its own prediction 𝑦 𝑖.

Multivariate OLS

• Multivariate Linear Regression relates several 𝑦-variables with several 𝑥-variables 𝐘 = 𝐗𝐁 + 𝐄

• In terms of a single 𝑦-variable 𝐲𝐣 = 𝐗𝐛𝐣 + 𝐞𝐣

• OLS estimator for 𝐛𝐣

𝐛𝐣 = 𝐗 𝐗T𝐗−1𝐗T𝐲𝐣

resulting in

𝐁 = 𝐗 𝐗T𝐗−1𝐗T𝐘

𝐘 = 𝐗𝐁

• Matrix 𝑩 consists of 𝑞 loading vectors, each defining a direction in the 𝑥-space for a linear latent variable which has maximum Pearson´s correlation coefficient between 𝒚𝒋 and 𝒚 𝒋 for

𝑗 = 1, … , 𝑞

• The regression coefficients for all 𝑦-variables can be computed at once, however, only for noncollinear 𝑥-variables and if 𝑚 < 𝑛

• 𝑚 𝑥-variables, 𝑛 observations, and 𝑞 𝑦-variables

• Alternative methods are PLS2 and CCA .

Variable Selection

• Feature selection • For multiple regression all available variables 𝑥1, 𝑥2, … , 𝑥𝑚 were used to build a linear model for the prediction of the 𝑦-variable – Useful as long as the number 𝑚 of regressor variables

is small, say not more than 10

• OLS regression is no longer computable if the – Regressor variables are highly correlated – Number of objects is lower than the number of

variables

• PCR and PLS can handle such data .

Variable Selection

• Arguments against the use of all available regressor variables – Use of all variable will produce a better fit of the

model for the training data • The residuals become smaller and thus the 𝑅2 measure

increases • We are usually not interested in maximizing the fit for the

training data but in maximizing the prediction performance for the test data

• Reduction of the regressor variables can avoid the effects of overfitting

– A regression model with a high number of variables is pratically impossible to interpret .

Variable Selection

• Univariate and Bivariate Selection Methods – Criteria for the elimination of regressor variables

• A considerable percentage of the variable values is missing of below a small threshold

• All or nearly all variable values are equal

• The variable includes many and severe outliers

• Compute the correlation between pairs of regressor variables. If the correlation is high (positive or negative), exclude that variable having the larger sum of (absolute) correlation coefficients to all remaining regressor variables .

Variable Selection

– Criteria for the identification of potentially useful regressor variables

• High variance of the variable

• High (absolute) correlation coefficient with the 𝑦-variable .

Variable Selection

• Stepwise Selectrion Methods

– Adds or drops one variable at a time

• Forward selection – Start with the empty model (or with preselected variables)

and add that variable to the model that optimizes a criterion

– Continue to add variables until a stopping rule becomes active

• Backward elimination – Start with the full model...

• Both directions .

Variable Selection

– An often-used version of stepwise variable selection works as follows • Select the variable with highest absolute correlation coefficient with the 𝑦-variable; the

number of selected variables is 𝑚0 = 1 • Add each of the remaining 𝑥-variables separately to the selected variable; the number of

variables in each subset is 𝑚1 = 2 • Calculate 𝐹

𝐹 =

𝑅𝑆𝑆0 − 𝑅𝑆𝑆1𝑚1 −𝑚0

𝑅𝑆𝑆1𝑛 − 𝑚1 − 1

with 𝑅𝑆𝑆 being the sum of the squared residuals 𝑦𝑖 − 𝑦 𝑖2𝑛

𝑖=1 • Consider the added variables which gives the highest 𝐹, and if the decrease of 𝑅𝑆𝑆 is

significant, take this variable as the second selected one – Significance: 𝐹 > 𝐹𝑚1−𝑚0,𝑛−𝑚1−1;0.95

• Forward selection of variables would continue in the same way until no significant change occurs

– Disadvantage: a selected variable cannot be removed later on » Usually the better strategy is to continue with a backwad step » Another forward step is done, followed by another backward step, and so on, until no

significant change of 𝑅𝑆𝑆 occurs or a defined maximum number of variables is reached .

Variable Selection

• Best-Subset/All-Subsets Regression – Allows excluding complete branches in the tree of all

possible subsets, and thus finding the best subset for data sets with up to about 30-40 variables

– Leaps and Bounds algorithm or regression-tree methods

– Model selection • Adjusted 𝑅2

• Akaike´s Information Criterion (AIC)

𝐴𝐼𝐶 = 𝑛𝑙𝑜𝑔𝑅𝑆𝑆

𝑛+ 2𝑚

• Bayes Information Criterion (BIC)

• Mallow´s Cp .

Variable Selection

𝑥1 + 𝑥2 + 𝑥3

𝑥1 + 𝑥2

𝐴𝐼𝐶 = 10

𝐴𝐼𝐶 > 8

𝑥1 + 𝑥3

𝐴𝐼𝐶 > 18

𝑥2 + 𝑥3

𝐴𝐼𝐶 = 20

𝐴𝐼𝐶 > 18

Since we want to select the model which gives the smallest value of the 𝐴𝐼𝐶, the complete branch with 𝑥2 + 𝑥3 can be ignored, bacause any submodel in this branch is worse (𝐴𝐼𝐶 > 18) than the model 𝑥1 + 𝑥2 with 𝐴𝐼𝐶 = 10

Variable Selection

• Variable Selection Based on PCA or PLS Models

– These methods form new latent variables by using linear combinations of the regressor variables

𝑏1𝑥1 + 𝑏2𝑥2 +⋯+ 𝑏𝑚𝑥𝑚

– The coefficients/loadings reflect the importance of an 𝑥-variable for the new latent variable

– The absolute size of the coefficients can be used as a criterion for variable selection .

Variable Selection

• Genetic Algorithms (GAs)

– Natural Computation Method

Variable Selection

Population

• Delete chromosomes with poor fitness (selection)

• Create new chromosomes from pairs of good chromosomes (crossover)

• Change a few genes randomly (mutation)

New (better) population

Variable Selection

– Crossover

• two chromosomes are cut at a random position and the parts are connected in a crossover scheme resulting in two new chromosomes

Variable Selection

• Cluster Analysis of Variables

– Cluster analysis tries to identify homogenous groups in the data

– If it is applied to the correlation matix of the regressor variables , one may obtain groups of variables that are strongly related, while variables in different groups will have a weak correlation .

multivariate analysis - ufg

Documents

multivariate analysis (ppt)

introduction to multivariate analysis...

discrete multivariate analysis analysis of multivariate...

multivariate genetic analysis

multivariate meta-analysis

multivariate analysis : introduction

multivariate statistical analysis

multivariate survival analysis

introduction to multivariate analysis and multivariate...

04 multivariate analysis

multivariate pattern analysis

title stata.com manova — multivariate analysis of variance...

multivariate analysis of variance (manova) · multivariate...

ch24 multivariate analysis

discrete multivariate analysis

multivariate data analysis - university of sheffield ·...

multivariate association analysis

multivariate data analysis

multivariate analysis, clustering, and...