multivariate analysis - ufg

63
Multivariate Analysis Multivariate Calibration โ€“ part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br [email protected]

Upload: others

Post on 26-Mar-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multivariate Analysis - UFG

Multivariate Analysis Multivariate Calibration โ€“ part 2

Prof. Dr. Anselmo E de Oliveira

anselmo.quimica.ufg.br

[email protected]

Page 2: Multivariate Analysis - UFG

Linear Latent Variables

โ€ข An essential concept in multivariate data analysis is the mathematical combination of several variables into a new variable that has a certain desired property. In chemometrics such a new variable is often called a latent variable

โ€ข Linear Latent Variable ๐‘ข = ๐‘1๐‘ฅ1 + ๐‘2๐‘ฅ2 +โ‹ฏ+ ๐‘๐‘š๐‘ฅ๐‘š

๐‘ข scores, value of linear latent variable

๐‘๐‘— loadings, coefficients describing the influence of variables on the score

๐‘ฅ๐‘— variables, features.

Page 3: Multivariate Analysis - UFG

Calibration

โ€ข Linear models ๐‘ฆ = ๐‘0 + ๐‘1๐‘ฅ1 + ๐‘2๐‘ฅ2 +โ‹ฏ+ ๐‘๐‘—๐‘ฅ๐‘— +โ‹ฏ+ ๐‘๐‘š๐‘ฅ๐‘š + ๐‘’

๐‘0 is called the intercept

๐‘1 to ๐‘๐‘š the regression coefficients

๐‘š the number of variables

๐‘’ the residual (error term)

โ€“ Often mean-centered data are used and then ๐‘0 becomes zero

โ€ข This model corresponds to a linear latent variable.

Page 4: Multivariate Analysis - UFG

Calibration

โ€ข The parameters of a model are estimated from a calibration set (training set) containing the values of the ๐‘ฅ-variables and ๐‘ฆ for ๐‘› samples (objects)

โ€ข The resulting model is evaluated with a test set (with know values for ๐‘ฅ and ๐‘ฆ)

โ€ข Because modeling and prediction of the ๐‘ฆ-data is a defined aim of data analysis , this type of data treatment is called supervised learning.

Page 5: Multivariate Analysis - UFG

Calibration

โ€ข All regression methods aim at the minimization of residuals, for instance minimization of the sum of the squared residuals

โ€ข It is essential to focus on minimal prediction errors for new cases โ€“ the test set โ€“ but no (only) for the calibration set from which the model has been created โ€“ It is realatively ease to create a model โ€“ especially

with many variables and eventually nonlinear features โ€“ that very well fits the calibration data; however, it may be useless for new cases. This effect of overfitting is a crucial topic in model creation .

Page 6: Multivariate Analysis - UFG

Calibration

โ€ข Regression can be performed directly with the values of the variables (ordinary least-squares regression, OLS) but in the most powerful methods, such as principal component regression (PCR) and partial least-squares regression (PLS), it is done via a small set of intermediate linear latent variables (the components). This approach has important advantages: โ€“ Data with highly correlating ๐‘ฅ-variables can be used โ€“ Data sets with more variables than samples can be used โ€“ The complexity of the model can be controlled by the

number of components and thus overfitting can be avoided and maximum prediction performance for test set data can be approached .

Page 7: Multivariate Analysis - UFG

Calibration

โ€ข Depending on the type of data, different methods are available

Number of ๐’™-variables

Number of ๐’š-variables

Name Methods

1 1 Simple Simple OLS, robust regression

Many 1 Multiple PLS, PCR, multiple OLS, robust regression, Ridge regression, Lasso regression

Many Many Multivariate PLS2, Canonical Correlation Analysis (CCA)

Page 8: Multivariate Analysis - UFG

Calibration

๐‘ฆ = 1.418 + 4.423๐‘ฅ1 + 4.101๐‘ฅ2 โˆ’ 0.0357๐‘ฅ3

Page 9: Multivariate Analysis - UFG

Calibration

Page 10: Multivariate Analysis - UFG

Calibration

OLS-model using only ๐‘ฅ1 and ๐‘ฅ2

๐’š = ๐Ÿ. ๐Ÿ‘๐Ÿ“๐Ÿ‘ + ๐Ÿ’. ๐Ÿ’๐Ÿ‘๐Ÿ‘๐’™๐Ÿ + ๐Ÿ’. ๐Ÿ๐Ÿ๐Ÿ”๐’™๐Ÿ

Page 11: Multivariate Analysis - UFG

Performance of Regression Models

โ€ข Any model for prediction makes sense only if appropriate criteria are defined and applied to measure the performance of the model

โ€ข For models based on regression, the residuals (prediction errors) ๐‘’๐‘–

๐‘’๐‘– = ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘– are the basis for performance measures, with ๐‘ฆ๐‘– for the given (experimental, โ€œtrueโ€) value and ๐‘ฆ ๐‘– the predicted (modeled) value of an object ๐‘–

โ€ข An often-used performance measure estimates the standard deviation of the prediction errors (standard error of prediction, SEP) .

Page 12: Multivariate Analysis - UFG

Performance of Regression Models

โ€ข Using the same objects for calibration and test should be strictly avoided

โ€ข Depending on the size of the data set (the number of objects available) โ€“ and on the effort of work โ€“ different strategies are possible

โ€ข The following levels are ordered by typical applications to data with decreasing size and also by decreasing reliability of the results .

Page 13: Multivariate Analysis - UFG

Performance of Regression Models

1. If data from many objects are available, a split into three sets is best:

i. Training set (ca. 50% of the objects) for creating models

ii. Validation set (ca. 25% of the objects) for optimizing the model to obtain good prediction performance

iii. Test set (prediction set, approximately 25%) for testing the final model to obtain a realistic estimation of the prediction performance for new cases

iv. The three sets are treated separately

v. Applications in chemistry rarely allow this strategy because of a too small number of objects available .

Page 14: Multivariate Analysis - UFG

Performance of Regression Models

2. The data is split into a calibration set used for model creation and optimization and a test set (prediction set) to obtain a realistic estimation of the prediction perfomance for new cases

i. The calibration set is divided into a training set and a validation set by cross validation (CV) or bootstrap

ii. First the optimum complexity (for instance optimum number of PLS components) of the model is estimated and then a model is built from the whole calibration set applying the found optimum complexity; this model is applied to the test set .

Page 15: Multivariate Analysis - UFG

Performance of Regression Models

3. CV or bootstrap is used to split the data set into different calibration sets and test sets

i. A calibration set is used as described (2) to create an optimized model and this is applied to the corresponding test set

ii. All objects are principally used in training set, validation set, and test set; however, an object is never simultaneously used for model creations and for test

iii. This strategy (double CV or double bootstrap or a combination of CV and bootstrap) is applicable to a relatively small number of objects; furthermore, the process can be repeated many times with different random splits resulting in a high number of test-set-predicted values .

Page 16: Multivariate Analysis - UFG

Performance of Regression Models

โ€ข Mostly the split of the objects intro training, validation, and test sets is performed by simple random sampling

โ€ข More sophisticated procedures โ€“ related to experimental design and theory of sampling โ€“ are available โ€“ Kennard-Stone algorithm

โ€ข it claims to set up a calibration set that is representative for the population, to cover the ๐‘ฅ-space as uniformly as possible, and to give more weight to objects outside the center

โ€ข This aim is reached by selecting objects with maximum distances (for instance Euclidean distances) in the ๐‘ฅ-space .

Page 17: Multivariate Analysis - UFG

Performance of Regression Models

โ€“ The methods CV and bootstrap are called resampling strategies

โ€ข Small data sets

โ€ข They are applied to obtain a reasonable high number of predictions

โ€ข The larger the date set and the more reliable the data are, the better the prediction performance can be estimated โ€“ ๐‘ ๐‘–๐‘ง๐‘’ ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘‘๐‘Ž๐‘ก๐‘Ž ร— ๐‘“๐‘Ÿ๐‘–๐‘’๐‘›๐‘‘๐‘™๐‘–๐‘›๐‘’๐‘ ๐‘  ๐‘œ๐‘“ ๐‘‘๐‘Ž๐‘ก๐‘Ž ร—

๐‘ข๐‘›๐‘๐‘’๐‘Ÿ๐‘ก๐‘Ž๐‘–๐‘›๐‘ก๐‘ฆ ๐‘œ๐‘“ ๐‘๐‘’๐‘Ÿ๐‘“๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘›๐‘๐‘’ ๐‘š๐‘’๐‘Ž๐‘ ๐‘ข๐‘Ÿ๐‘’ = ๐‘๐‘œ๐‘›๐‘ ๐‘ก๐‘Ž๐‘›๐‘ก .

Page 18: Multivariate Analysis - UFG

Overfitting and Underfitting

โ€ข The more complex (the โ€œlargerโ€) a model is, the better is it capable of fit given data (the calibration set)

โ€ข The prediction error for the calibration set in general decreases with increasing complexity of the model

โ€“ An appropriate highly complicated model can fit almost any data with almost zero deviations (residuals) between experimental (true) ๐‘ฆ and modeled (predicted) ๐‘ฆ

โ€“ Such model are not necessarily useful for new cases, because they are probably overfitted .

Page 19: Multivariate Analysis - UFG

Overfitting and Underfitting

Fonte da Imagen: Dr. Frank Dieterle

Page 20: Multivariate Analysis - UFG

Overfitting and Underfitting

Fonte da Imagen: holehouse.org

Page 21: Multivariate Analysis - UFG

Overfitting and Underfitting

โ€ข Prediction error for new cases are high for โ€œsmallโ€ models (underfitting, low complexity, too simple models) but also for overfitted models

โ€ข Determination of the optimum complexity of a model is an important but not always na easy task, because the minumum of measures for the prediction error for test sets is often not well marked โ€“ In chemometrics, the complexity is typically controled by the

number of PLS or PCA components (latent variables), and the optimum complexity is estimated by CV

โ€“ CV or bootstrap allows an estimation of the prediction error for each object of the calibration set at each considered model complexity .

Page 22: Multivariate Analysis - UFG

Performance Criteria

โ€ข The basis of all perfomence criteria are prediction errors (residuals) ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–

โ€ข The classical standard deviation of prediction errors is widely used as a measure of the spread of the error distribution, and is called standard error of prediction (SEP) defined by

๐‘†๐ธ๐‘ƒ =1

๐‘ง โˆ’ 1 ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘– โˆ’ ๐‘๐‘–๐‘Ž๐‘  2

๐‘ง

๐‘–=1

with

๐‘๐‘–๐‘Ž๐‘  =1

๐‘ง ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–

๐‘ง

๐‘–=1

where ๐‘ฆ๐‘– are the given (experimental, โ€œtrueโ€) values ๐‘ฆ ๐‘– are the predicted (modeled) values ๐‘ง is the number of predictions .

Page 23: Multivariate Analysis - UFG

Performance Criteria

โ€ข The bias is the arithmetic mean of the prediction errors and should be near zero โ€“ A systematic error (a nonzero bias) may appear if,

for instance, a calibration model is applied to data that have been produced by another instrument

โ€ข In the case of a normal distribution, about 95% of the prediction errors are within the tolerance interval 2 ๐‘†๐ธ๐‘ƒ โ€“ The measure SEP and the tolerance interval are

given in the units of ๐‘ฆ .

Page 24: Multivariate Analysis - UFG

Performance Criteria

โ€ข Standard error of calibration (SEC) is similar to SEP applied to predictions of the calibration set

โ€ข The mean squared error (MSE) is the arithmetic mean of the squared errors

๐‘€๐‘†๐ธ =1

๐‘ง ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–

2

๐‘ง

๐‘–=1

โ€ข MSEC refers to results from a calibration set, MSECV to results obtained in CV, and MSEP to results from a prediction/test set

โ€ข MSE minus the squated bias gives the squared SEP ๐‘†๐ธ๐‘ƒ2 = ๐‘€๐‘†๐ธ โˆ’ ๐‘๐‘–๐‘Ž๐‘ 2 .

Page 25: Multivariate Analysis - UFG

Performance Criteria

โ€ข The root mean squared error (RMSE) is the square root of MSE, and can again be given for calibration (RMSEC), CV (RMSECV) or for prediction/test (RMSEP)

๐‘…๐‘€๐‘†๐ธ =1

๐‘ง ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–

2

๐‘ง

๐‘–=1

โ€ข MSE is preferably used during the development and optimization of models but is less useful for practical applications because it has not the units of the predicted property โ€“ A similar widely used measure is predicted residual error sum of

squares (PRESS), the sum of the squared errors

๐‘ƒ๐‘…๐ธ๐‘†๐‘† = ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–2๐‘ง

๐‘–=1 = ๐‘ง โˆ™ ๐‘€๐‘†๐ธ .

Page 26: Multivariate Analysis - UFG

Performance Criteria

โ€ข Correlation measures between experimental ๐‘ฆ and predicted ๐‘ฆ are frequently used to characterize the model performance

โ€“ Mostly used is the squared Pearson correlation coefficient .

Page 27: Multivariate Analysis - UFG

Criteria for Models with Different Numbers of Variables

โ€ข The model should not contain a too smal number of variables because this leads to poor prediction performance. On the other hand, it should also not contain a too large number of variables because this results in overfitting and thus again poor prediction performance

โ€ข The adjusted R-square, ๐‘…๐‘Ž๐‘‘๐‘—2 or ๐‘…โ€ฒ

๐‘…๐‘Ž๐‘‘๐‘—2 = 1 โˆ’

๐‘› โˆ’ 1

๐‘› โˆ’๐‘š โˆ’ 11 โˆ’ ๐‘…2

where ๐‘› is the number of objects ๐‘š is the number of regressor variables (including the intercept) ๐‘…2 is called coefficient of determination, expressing the proportion of variance that is explained by the model .

Page 28: Multivariate Analysis - UFG

Criteria for Models with Different Numbers of Variables

โ€ข Another, equivalent representation for ๐‘…๐‘Ž๐‘‘๐‘—2 is

๐‘…๐‘Ž๐‘‘๐‘—2 = 1 โˆ’

๐‘…๐‘†๐‘†๐‘› โˆ’๐‘š โˆ’ 1

๐‘‡๐‘†๐‘†๐‘› โˆ’ 1

with residual sum of squares (RSS) for the sum of the squared residuals

๐‘…๐‘†๐‘† = ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–2

๐‘›

๐‘–=1

and the total sum of squares (TSS) for the sum of the squared differences to the mean ๐‘ฆ of ๐‘ฆ,

๐‘‡๐‘†๐‘† = ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–2๐‘›

๐‘–=1 .

Page 29: Multivariate Analysis - UFG

Cross Validation (CV)

โ€ข It is the most used resampling strategy to obtain a reasonable large number of predictions

โ€ข It is also often applied to optimize the complexity of a model

โ€“ The optimum number of PLS or PCA components

โ€“ To split the data into calibration sets and test sets .

Page 30: Multivariate Analysis - UFG

Cross Validation (CV)

โ€ข The procedure of CV applied to model optimization โ€“ The available set with ๐‘› object is randomly split into s segments (parts) of

approximately equal size โ€ข The number of segments can be 2 to ๐‘›, often values between 4 to 10 are used

โ€“ One segment is left out as a validation set โ€“ The other ๐‘  โˆ’ 1 sets are used as a training set to create models which have

increasing complexity (for instance, 1,2,3, โ€ฆ , ๐‘Ž๐‘š๐‘Ž๐‘ฅ๐‘ƒ๐ฟ๐‘† components) โ€“ The models are separately applied to the objects of the validation set resulting

in predicted values connected to different model complexities โ€“ This procedure is repeated so that each segment is a validation set once โ€“ The result is a matrix with ๐‘› rows and ๐‘Ž๐‘š๐‘Ž๐‘ฅ colums containing predicted

values ๐‘ฆ ๐ถ๐‘‰ (predicted by CV) for all objects and all considered model complexities

โ€“ From this matrix and the known ๐‘ฆ-values, a residual matrix is computed โ€“ An error measure (for instance MSECV) is calculated from the residuals, and

the lowest MSECV or a similar criterion indicates the optimum model complexity .

Page 32: Multivariate Analysis - UFG

Cross Validation (CV)

โ€ข A single CV gives ๐‘› predictions โ€ข For many data sets in chemistry ๐‘› is too small for

a visualization of the error distribution โ€ข The perfomance measure depends on the split of

the objects into segmentes Is therefore recommended to repeat the CV with

different random splits into segments (repeated CV), and to summarize the results .

Page 33: Multivariate Analysis - UFG

Cross Validation (CV)

โ€ข If the number of segments is equal to the number of objects (each segment contains only one object), the method is called leave-one-out CV or full CV โ€“ Randomization of the sequence of objects is senseless,

therefore only one CV run is necessary resulting in ๐‘› predictions โ€“ The number of created models is ๐‘› which may be time-

consuming for large data sets โ€“ Depending on the data, full CV may give too optimistic results,

especially if pairwise similar objects are in the data set, for instance from duplicate measurements

โ€“ Full CV is easier to apply than repeated CV or bootstrap โ€“ In many cases, full CV gives a reasonable first estimate of the

model performance .

Page 34: Multivariate Analysis - UFG

Bootstrap

โ€ข As a verb, bootstrap has survived into the computer age, being the origin of the phrase "booting a computer." "Boot" is a shortened form of the word "bootstrap," and refers to the initial commands necessary to load the rest of the computer's operating system. Thus, the system is "booted," or "pulled up by its own bootstraps"

Page 35: Multivariate Analysis - UFG

Bootstrap

โ€ข Within multivariate analysis, bootstrap is a resampling method that can be used as na alternative to CV, for instance, to estimate the prediction performance of a model or to estimate the optimum complexity

โ€ข In general, bootstrap can be used to estimate the distribution of model parameters

โ€ข Basic ideas of bootstraping are resampling with replacement, and to use calibration sets with the same number of objects, ๐‘›, as objects are in the available data set โ€“ A calibration set is obtained by selecting randomly objects and

copying (not moving) them into the calibration set

โ€ข Toolbox .

Page 36: Multivariate Analysis - UFG

Ordinary Least-Squares Regression

Page 37: Multivariate Analysis - UFG

Simple OLS

๐’š = ๐‘0 + ๐‘๐’™ + ๐’†

โ€ข ๐‘ and ๐‘0 are the regression parameters (regression coefficients)

โ€“ ๐‘0 is the intercept and ๐‘ is the slope

โ€ข Since the data will in general not follow a perfect linear relation, the vector ๐’† contains the residuals (errors) ๐‘’1, ๐‘’2, โ€ฆ , ๐‘’๐‘› .

Page 38: Multivariate Analysis - UFG

Simple OLS

โ€ข The predicted (modeled) property ๐‘ฆ ๐‘– for sample ๐‘– and the prediction error ๐‘’๐‘– are calculated by

๐‘ฆ ๐‘– = ๐‘0 + ๐‘๐‘ฅ๐‘– ๐‘’๐‘– = ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–

โ€ข The Ordinary Least-Squares (OLS) approach minimizes the sum of the squared residuals ๐‘’๐‘–

2 to estimate the model parameters ๐‘ and ๐‘0

๐‘ = ๐‘ฅ๐‘– โˆ’ ๐‘ฅ ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘›๐‘–=1

๐‘ฅ๐‘– โˆ’ ๐‘ฅ 2๐‘›๐‘–=1

๐‘0 = ๐‘ฆ โˆ’ ๐‘๐‘ฅ โ€“ For mean-centered data (๐‘ฅ = 0, ๐‘ฆ = 0), ๐‘0 = 0 and

๐‘ = ๐‘ฅ๐‘–๐‘ฆ๐‘–๐‘›๐‘–=1

๐‘ฅ๐‘–2๐‘›

๐‘–=1=

๐’™๐‘ป๐’š

๐’™๐‘ป๐’™ .

Page 39: Multivariate Analysis - UFG

Simple OLS

โ€ข The described model best fits the given (calibration) data, but is not necessarily optimal for predictions

โ€ข The least-squares approach can become very unreliable if outliers are present in the data

โ€ข Assumption for obtaining reliable estimates โ€“ Errors are only in ๐‘ฆ but not in ๐‘ฅ

โ€“ Residuals are uncorrelated and normally distributed with mean 0 and constant variance ๐œŽ2 (homoscedasticity) .

Page 40: Multivariate Analysis - UFG

Simple OLS

โ€ข The following scatterplots are plots of ๐‘ฅ๐‘– (measurement) vs. ๐‘– (observation number) with the sample mean marked with a red horizontal line. The measurement is plotted on the vertical axis; the observation number is plotted on the horizontal axis โ€“ Unbiased The average of the observations in every thin vertical strip is the same all the way across the scatterplot.

โ€“ Biased The average of the observations changes, depending on which thin vertical strip you pick.

โ€“ Homoscedastic The variation (๐œŽ) of the observations is the same in every thin vertical strip all the way across the scatterplot.

โ€“ Heteroscedastic The variation (๐œŽ) of the observations in a thin vertical strip changes, depending on which vertical strip you pick.

(a) unbiased and homoscedastic (b) unbiased and heteroscedastic (c) biased and homoscedastic (d) biased and heteroscedastic (e) unbiased and heteroscedastic (f) biased and homoscedastic

Page 41: Multivariate Analysis - UFG

Simple OLS

โ€ข Besides estimating the regression coefficients, it is also of interest to estimate the variation of the measurements around the fitted regression line. This means that the residual variance ๐œŽ2 has to be estimated

๐‘ ๐‘’2 =

1

๐‘› โˆ’ 2 ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–

2

๐‘›

๐‘–=1

=1

๐‘› โˆ’ 2 ๐‘’๐‘–

2

๐‘›

๐‘–=1

The denominator ๐‘› โˆ’ 2 is used here because two parameters are necessary for a fitted straight line, and this makes ๐‘ ๐‘’

2 an unbiased estimator for ๐œŽ2 โ€ข Confidence intervals for intercept and slope

๐‘0 ยฑ ๐‘ก๐‘›โˆ’2;๐‘๐‘  ๐‘0

๐‘ ยฑ ๐‘ก๐‘›โˆ’2;๐‘๐‘  ๐‘

โ€ข Standard deviations of ๐‘0 and ๐‘

๐‘  ๐‘0 = ๐‘ ๐‘’ ๐‘ฅ๐‘–

2๐‘›๐‘–=1

๐‘› ๐‘ฅ๐‘– โˆ’ ๐‘ฅ 2๐‘›๐‘–=1

๐‘  ๐‘ =๐‘ ๐‘’

๐‘ฅ๐‘– โˆ’ ๐‘ฅ 2๐‘›๐‘–=1

๐‘ก๐‘›โˆ’2;๐‘ is the ๐‘-quantile of the ๐‘ก-distribution witn ๐‘› โˆ’ 2 degress of freedom, with for instance ๐‘ = 0.025 for a 95% confidence interval .

Page 42: Multivariate Analysis - UFG

Simple OLS

โ€ข Confidence interval for the residual variance ๐œŽ2 ๐‘› โˆ’ 2 ๐‘ ๐‘’

2

๐œ’๐‘›โˆ’2;1โˆ’๐‘2 < ๐œŽ2 <

๐‘› โˆ’ 2 ๐‘ ๐‘’2

๐œ’๐‘›โˆ’2;๐‘2

where ๐œ’๐‘›โˆ’2;1โˆ’๐‘2 and ๐œ’๐‘›โˆ’2;๐‘

2 are the appropriate quantiles of the chi-square distribution (ref 1, table) with ๐‘› โˆ’ 2 degrees of freedom (e.g., ๐‘ = 0.025 for a 95% confidence interval) โ€ข Null hypothesis ๐ป0: ๐‘0 = 0

๐‘‡๐‘0 =๐‘0

๐‘  ๐‘0

โ€“ ๐ป0 is rejected at the significance level of if ๐‘‡๐‘0 > ๐‘ก๐‘›โˆ’2;1โˆ’๐›ผ2

โ€“ The test for ๐‘ = 0 is equivalent

๐‘‡๐‘ =๐‘

๐‘  ๐‘ .

Page 43: Multivariate Analysis - UFG

Simple OLS

โ€ข Often it is of interest to obtain a confidence interval for the prediction at a new ๐‘ฅ value

๐‘ฆ ยฑ ๐‘ ๐‘’ 2๐น2,๐‘›โˆ’2;๐‘1

๐‘›+

๐‘ฅ โˆ’ ๐‘ฅ 2

๐‘ฅ โˆ’ ๐‘ฅ 2๐‘›๐‘–=1

with ๐น2,๐‘›โˆ’2;๐‘ the ๐‘-quantile of the ๐น-distribution

with 2 and ๐‘› โˆ’ 2 degrees of freedrom

โ€“ Best predictions are possible in the mid part of the range of ๐‘ฅ where most information is available .

Page 44: Multivariate Analysis - UFG

Simple OLS

โ€ข Using the open source software (An Introduction to R)

> x=c(1.5,2,2.5,2.9,3.4,3.7,4,4.2,4.6,5,5.5,5.7,6.6)

> y=c(3.5,6.1,5.6,7.1,6.2,7.2,8.9,9.1,8.5,9.4,9.5,11.3,11.1) > res<-lm(y~x) # linear model for y on x. The symbol ~ # allows to construct a formula for the relation > plot(x,y) > abline(res)

Page 45: Multivariate Analysis - UFG

Simple OLS

> summary(res) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -0.9724 -0.5789 -0.2855 0.8124 0.9211 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3529 0.6186 3.804 0.00293 ** x 1.4130 0.1463 9.655 1.05e-06 *** --- Signif. codes: 0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1 Residual standard error: 0.7667 on 11 degrees of freedom Multiple R-squared: 0.8945, Adjusted R-squared: 0.8849 F-statistic: 93.22 on 1 and 11 DF, p-value: 1.049e-06

๐‘0 and ๐‘ ๐‘  ๐‘0 and s ๐‘

๐‘‡ ๐‘0 and ๐‘‡ ๐‘

since ๐‘-values are much smaller than a reasonable significance level , e,g, ๐›ผ = 0.05, both intercept and slope are important in our regression model

๐‘ ๐‘’

in the case of univariate ๐‘ฅ and ๐‘ฆ is the same as the squared Pearson correlation coefficient, and it is a measure of model fit

tests wheter all parameters are zero, against the alternative that at least one regression parameter is different from zero. Since the ๐‘-value ~0, at least one of intercept or slope contributes to the regression model

first quartile is the value in which 25% of the numbers are below it

Page 46: Multivariate Analysis - UFG

Muliple OLS

๐‘ฆ1 = ๐‘0 + ๐‘1๐‘ฅ11 + ๐‘2๐‘ฅ12 +โ‹ฏ+ ๐‘๐‘š๐‘ฅ1๐‘š + ๐‘’1 ๐‘ฆ2 = ๐‘0 + ๐‘1๐‘ฅ21 + ๐‘2๐‘ฅ22 +โ‹ฏ+ ๐‘๐‘š๐‘ฅ2๐‘š + ๐‘’2

โ‹ฎ ๐‘ฆ๐‘› = ๐‘0 + ๐‘1๐‘ฅ๐‘›1 + ๐‘2๐‘ฅ๐‘›2 +โ‹ฏ+ ๐‘๐‘š๐‘ฅ๐‘›๐‘š + ๐‘’๐‘›

or ๐’š = ๐‘ฟ๐’ƒ + ๐’†

โ€ข Multiple Linear Regression Model โ€ข ๐‘ฟ of size ๐‘› ร— ๐‘š + 1 which includes in its first column ๐‘› values of

1 โ€ข The residuals are calculated by ๐’† = ๐’š โˆ’ ๐’š โ€ข The regression coefficients ๐’ƒ = ๐‘0, ๐‘1, โ€ฆ , ๐‘๐‘š

๐‘‡ result from the OLS estimation minimizing the sum of squared residuals ๐’†๐‘ป๐’†

๐’ƒ = ๐‘ฟ๐‘ป๐‘ฟโˆ’๐Ÿ๐‘ฟ๐‘ป๐’š .

Page 47: Multivariate Analysis - UFG

Muliple OLS

โ€ข Confidence Intervals and Statistical Tests โ€“ The following assumptions must be fulfilled

โ€ข Erros ๐’† are independent ๐‘›-dimensional normally distributed โ€ข With mean vector ๐ŸŽ โ€ข And covariance matrix ๐œŽ2๐‘ฐ๐‘›

โ€“ An unbiased estimator for the residual variance ๐œŽ2 is

๐‘ ๐‘’2 =

1

๐‘› โˆ’๐‘š โˆ’ 1 ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–

2

๐‘›

๐‘–=1

=1

๐‘› โˆ’๐‘š โˆ’ 1๐’š โˆ’ ๐‘ฟ๐’ƒ ๐‘‡ ๐’š โˆ’ ๐‘ฟ๐’ƒ

โ€“ The null hypothesis ๐‘๐‘— = 0 against the alternative ๐‘๐‘— โ‰  0 can be tested with the test statistic

๐‘ง๐‘— =๐‘๐‘—

๐‘ ๐‘’ ๐‘‘๐‘—

where ๐‘‘๐‘— is the ๐‘—th diagonal element of ๐‘ฟ๐‘ป๐‘ฟโˆ’๐Ÿ

โ€“ The distribution of ๐‘ง๐‘— is ๐‘ก๐‘›โˆ’๐‘šโˆ’1, and thus a large absolute value of ๐‘ง๐‘— will lead to a rejection of the null hypothesis

โ€“ ๐น-test can also be constructed to test the null hypothesis ๐‘0 = ๐‘1 = โ‹ฏ = ๐‘๐‘š = 0 against the alternative ๐‘๐‘— โ‰  0 for any ๐‘— = 0,1, โ€ฆ ,๐‘š .

Page 48: Multivariate Analysis - UFG

Multiple OLS

โ€ข Using the open source software R > T=c(80,93,100,82,90,99,81,96,94,93,97,95,100,85,86,87)

> V=c(8,9,10,12,11,8,8,10,12,11,13,11,8,12,9,12)

> y=c(2256,2340,2426,2293,2330,2368,2250,2409,2364,2379,2440,2364,2404,2317,2309,2328)

> res<-lm(y~T+V) # linear model for y on T and V. The symbol ~

# allows to construct a formula for the relation

> summary(res)

Call:

lm(formula = y ~ T + V)

Residuals:

Min 1Q Median 3Q Max

-21.4972 -13.1978 -0.4736 10.5558 25.4299

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1566.0778 61.5918 25.43 1.80e-12 ***

T 7.6213 0.6184 12.32 1.52e-08 ***

V 8.5848 2.4387 3.52 0.00376 **

---

Signif. codes: 0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1

Residual standard error: 16.36 on 13 degrees of freedom

Multiple R-squared: 0.927, Adjusted R-squared: 0.9157

F-statistic: 82.5 on 2 and 13 DF, p-value: 4.1e-08

Page 49: Multivariate Analysis - UFG

Hat Matrix

โ€ข The hat-matrix ๐‡ combines the observed and predicted ๐‘ฆ-values

๐’š = ๐‡๐’š

and so it โ€œputs the hatโ€ on ๐‘ฆ. The hat-matrix is defined as

๐‡ = ๐— ๐—T๐—โˆ’1๐—T

โ€“ the diagonal elements โ„Ž๐‘–๐‘– of the ๐‘› ร— ๐‘› matrix ๐‡ reflet the influence of each value ๐‘ฆ๐‘– on its own prediction ๐‘ฆ ๐‘–.

Page 50: Multivariate Analysis - UFG

Multivariate OLS

โ€ข Multivariate Linear Regression relates several ๐‘ฆ-variables with several ๐‘ฅ-variables ๐˜ = ๐—๐ + ๐„

โ€ข In terms of a single ๐‘ฆ-variable ๐ฒ๐ฃ = ๐—๐›๐ฃ + ๐ž๐ฃ

โ€ข OLS estimator for ๐›๐ฃ

๐›๐ฃ = ๐— ๐—T๐—โˆ’1๐—T๐ฒ๐ฃ

resulting in

๐ = ๐— ๐—T๐—โˆ’1๐—T๐˜

and

๐˜ = ๐—๐

โ€ข Matrix ๐‘ฉ consists of ๐‘ž loading vectors, each defining a direction in the ๐‘ฅ-space for a linear latent variable which has maximum Pearsonยดs correlation coefficient between ๐’š๐’‹ and ๐’š ๐’‹ for

๐‘— = 1, โ€ฆ , ๐‘ž

โ€ข The regression coefficients for all ๐‘ฆ-variables can be computed at once, however, only for noncollinear ๐‘ฅ-variables and if ๐‘š < ๐‘›

โ€ข ๐‘š ๐‘ฅ-variables, ๐‘› observations, and ๐‘ž ๐‘ฆ-variables

โ€ข Alternative methods are PLS2 and CCA .

Page 51: Multivariate Analysis - UFG

Variable Selection

โ€ข Feature selection โ€ข For multiple regression all available variables ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘š were used to build a linear model for the prediction of the ๐‘ฆ-variable โ€“ Useful as long as the number ๐‘š of regressor variables

is small, say not more than 10

โ€ข OLS regression is no longer computable if the โ€“ Regressor variables are highly correlated โ€“ Number of objects is lower than the number of

variables

โ€ข PCR and PLS can handle such data .

Page 52: Multivariate Analysis - UFG

Variable Selection

โ€ข Arguments against the use of all available regressor variables โ€“ Use of all variable will produce a better fit of the

model for the training data โ€ข The residuals become smaller and thus the ๐‘…2 measure

increases โ€ข We are usually not interested in maximizing the fit for the

training data but in maximizing the prediction performance for the test data

โ€ข Reduction of the regressor variables can avoid the effects of overfitting

โ€“ A regression model with a high number of variables is pratically impossible to interpret .

Page 53: Multivariate Analysis - UFG

Variable Selection

โ€ข Univariate and Bivariate Selection Methods โ€“ Criteria for the elimination of regressor variables

โ€ข A considerable percentage of the variable values is missing of below a small threshold

โ€ข All or nearly all variable values are equal

โ€ข The variable includes many and severe outliers

โ€ข Compute the correlation between pairs of regressor variables. If the correlation is high (positive or negative), exclude that variable having the larger sum of (absolute) correlation coefficients to all remaining regressor variables .

Page 54: Multivariate Analysis - UFG

Variable Selection

โ€“ Criteria for the identification of potentially useful regressor variables

โ€ข High variance of the variable

โ€ข High (absolute) correlation coefficient with the ๐‘ฆ-variable .

Page 55: Multivariate Analysis - UFG

Variable Selection

โ€ข Stepwise Selectrion Methods

โ€“ Adds or drops one variable at a time

โ€ข Forward selection โ€“ Start with the empty model (or with preselected variables)

and add that variable to the model that optimizes a criterion

โ€“ Continue to add variables until a stopping rule becomes active

โ€ข Backward elimination โ€“ Start with the full model...

โ€ข Both directions .

Page 56: Multivariate Analysis - UFG

Variable Selection

โ€“ An often-used version of stepwise variable selection works as follows โ€ข Select the variable with highest absolute correlation coefficient with the ๐‘ฆ-variable; the

number of selected variables is ๐‘š0 = 1 โ€ข Add each of the remaining ๐‘ฅ-variables separately to the selected variable; the number of

variables in each subset is ๐‘š1 = 2 โ€ข Calculate ๐น

๐น =

๐‘…๐‘†๐‘†0 โˆ’ ๐‘…๐‘†๐‘†1๐‘š1 โˆ’๐‘š0

๐‘…๐‘†๐‘†1๐‘› โˆ’ ๐‘š1 โˆ’ 1

with ๐‘…๐‘†๐‘† being the sum of the squared residuals ๐‘ฆ๐‘– โˆ’ ๐‘ฆ ๐‘–2๐‘›

๐‘–=1 โ€ข Consider the added variables which gives the highest ๐น, and if the decrease of ๐‘…๐‘†๐‘† is

significant, take this variable as the second selected one โ€“ Significance: ๐น > ๐น๐‘š1โˆ’๐‘š0,๐‘›โˆ’๐‘š1โˆ’1;0.95

โ€ข Forward selection of variables would continue in the same way until no significant change occurs

โ€“ Disadvantage: a selected variable cannot be removed later on ยป Usually the better strategy is to continue with a backwad step ยป Another forward step is done, followed by another backward step, and so on, until no

significant change of ๐‘…๐‘†๐‘† occurs or a defined maximum number of variables is reached .

Page 57: Multivariate Analysis - UFG

Variable Selection

โ€ข Best-Subset/All-Subsets Regression โ€“ Allows excluding complete branches in the tree of all

possible subsets, and thus finding the best subset for data sets with up to about 30-40 variables

โ€“ Leaps and Bounds algorithm or regression-tree methods

โ€“ Model selection โ€ข Adjusted ๐‘…2

โ€ข Akaikeยดs Information Criterion (AIC)

๐ด๐ผ๐ถ = ๐‘›๐‘™๐‘œ๐‘”๐‘…๐‘†๐‘†

๐‘›+ 2๐‘š

โ€ข Bayes Information Criterion (BIC)

โ€ข Mallowยดs Cp .

Page 58: Multivariate Analysis - UFG

Variable Selection

๐‘ฅ1 + ๐‘ฅ2 + ๐‘ฅ3

๐‘ฅ1 + ๐‘ฅ2

๐ด๐ผ๐ถ = 10

๐‘ฅ1

๐ด๐ผ๐ถ > 8

๐‘ฅ1 + ๐‘ฅ3

๐‘ฅ2

๐ด๐ผ๐ถ > 18

๐‘ฅ2 + ๐‘ฅ3

๐ด๐ผ๐ถ = 20

๐‘ฅ3

๐ด๐ผ๐ถ > 18

Since we want to select the model which gives the smallest value of the ๐ด๐ผ๐ถ, the complete branch with ๐‘ฅ2 + ๐‘ฅ3 can be ignored, bacause any submodel in this branch is worse (๐ด๐ผ๐ถ > 18) than the model ๐‘ฅ1 + ๐‘ฅ2 with ๐ด๐ผ๐ถ = 10

Page 59: Multivariate Analysis - UFG

Variable Selection

โ€ข Variable Selection Based on PCA or PLS Models

โ€“ These methods form new latent variables by using linear combinations of the regressor variables

๐‘1๐‘ฅ1 + ๐‘2๐‘ฅ2 +โ‹ฏ+ ๐‘๐‘š๐‘ฅ๐‘š

โ€“ The coefficients/loadings reflect the importance of an ๐‘ฅ-variable for the new latent variable

โ€“ The absolute size of the coefficients can be used as a criterion for variable selection .

Page 60: Multivariate Analysis - UFG

Variable Selection

โ€ข Genetic Algorithms (GAs)

โ€“ Natural Computation Method

Page 61: Multivariate Analysis - UFG

Variable Selection

Gene

Population

โ€ข Delete chromosomes with poor fitness (selection)

โ€ข Create new chromosomes from pairs of good chromosomes (crossover)

โ€ข Change a few genes randomly (mutation)

New (better) population

Page 62: Multivariate Analysis - UFG

Variable Selection

โ€“ Crossover

โ€ข two chromosomes are cut at a random position and the parts are connected in a crossover scheme resulting in two new chromosomes

Page 63: Multivariate Analysis - UFG

Variable Selection

โ€ข Cluster Analysis of Variables

โ€“ Cluster analysis tries to identify homogenous groups in the data

โ€“ If it is applied to the correlation matix of the regressor variables , one may obtain groups of variables that are strongly related, while variables in different groups will have a weak correlation .