multivariate analysis - quimica anselmo · –an introduction to r –multivariate statistical...

Multivariate Analysis PCR and PLS

Prof. Dr. Anselmo E de Oliveira

anselmo.quimica.ufg.br

[email protected]

http://www.quimica.ufg.br/docentes/anselmo

mailto:[email protected]

Multicollinearity

• Variable selection is one possibility of reducing the number of regressor variables and removing multicollinearity

The PCR Model

• PCR also solves the problem of data collinearity and reduces the number of regressor variables, but the regressor variables are no longer the original measured 𝑥-variables but linear combinations thereof

– The linear combinations are the principal component scores of the 𝑥-variable

– PCR is a combination of PCA and OLS

The PCR Model

• PCA decomposes a (centered) data matrix 𝑿 into scores 𝑻 and loadings 𝑷

• For a certain number 𝑎 of PCs which is usually less than the rank of the data matrix, this decomposition is

𝑿 = 𝑻𝑷𝑇 + 𝑬 with na error matrix 𝑬 • The score matrix 𝑻 contains the maximum

amount of information of 𝑿 among all matrices that are orthogonal projections on 𝑎 linear combinations of the 𝑥-data

The PCR Model

In a MLR model 𝒚 = 𝑿𝒃 + 𝒆

we replace the matrix 𝑿 by the score matrix 𝑻 and thus include major information of the 𝑥-data for regression on 𝒚

𝒚 = 𝑿𝒃 + 𝒆, 𝑿 = 𝑻𝑷𝑇 𝒚 = 𝑻𝑷𝑇 𝒃 + 𝒆𝑇 𝒈 = 𝑷𝑇𝒃 𝒚 = 𝑻𝒈 + 𝒆𝑇

with the new regression coefficients 𝒈 and the error term 𝒆𝑇 – The information of the highly correlated 𝑥-variables is

compressed in few score vectors that are uncorrelated: solves the problem with data collinearity

– The complexity of the model can be optimized by the number of used PCs

The PCR Model

• OLS regression 𝒈 = 𝑻𝑇𝑻 −1𝑻𝑇𝒚

• The final coefficients for the original model 𝒃𝑷𝑪𝑹 = 𝑷𝒈

diagonal matrix

PCA

feature 2

feature 1 𝑥𝐼,1

𝑥𝐼,2 𝑥𝐼

feature 2

feature 1

𝑧𝐼,1

𝑧𝐼,2

𝑥𝐼

PC1 PC2

colum vector 𝒙𝟏

colum vector 𝒙𝟐

PCA

PC1

PC2

feature 2

feature 1

PC1

only needed direction

Lesson 2: Linear Regression

https://onlinecourses.science.psu.edu/stat557/book/export/html/14




The PCR Model

Number of PCA Components

• The number of components has to be optimized for the best possible prediction of the 𝑦-variable – PCA: total variance

• Simple strategy – Selection of the first PCA scores which cover a

certain percentage of the total variance of 𝑿 (for instance, 99%)

– Selection of the PCA scores with maximum correlation to 𝒚

PLS

• PLS stands for partial least-squares or/and projection to latent structures by means of partial seast squares

• PLS is the most widely used method in chemometrics for multivariate calibration – Web of Science, 07/09/13

• PCR: 1,048

• PLS: 8,731

• Herman Wold, 1975

PLS

• Essentially, the model structures of PLS and PCR are the same – The 𝑥-data are first transformed into a set of a few

intermediate linear latent variables (components), and these new variables are used for regression (by OLS) with a dependent variable 𝑦

– PCR uses principal component scores (derived solely from 𝑿)

– PLS uses components that are related to 𝒚 • Maximum covariance between scores and 𝑦

– PLS and PCR are linear methods (although nonlinear versions exist)

PLS

• PLS is a powerful linear regression method – Insensitive to collinear variables – Large number of variables

• First PLS-component is calculated as the latent variable which has maximum covariance between the scores and modeled property 𝑦

• Next, the information (variance) of this component is removed from the 𝑥-data (peeling or deflation) – It is a projection of the 𝑥-space on to a (hyper-)plane that is orthogonal to the

direction of the found component

• From the residual matrix, the next PLS component is derived • This procedure is continued until no improvement of modeling 𝑦 is

achieved • The number of PLS components defines the complexity of the model

– The optimum number of components is usually estimated by CV

• PLS2: PLS with a matrix 𝒀 instead of a vector 𝒚

PLS

• Partial least squares (PLS) models are based on principal components of both the independent data 𝑿 and the dependent data 𝒀. The central idea is to calculate the principal component scores of the 𝑿 and the 𝒀 data matrix and to set up a regression model between the scores (and not the original data).

• Thus the matrix 𝑿 is decomposed into a matrix 𝑻 (the score matrix) and a matrix 𝑷′ (the loadings matrix) plus an error matrix 𝑬. The matrix 𝒀 is decomposed into 𝑼 and 𝑸 and the error term 𝑭 . These two equations are called outer relations. The goal of the PLS algorithm is to minimize the norm of F while keeping the correlation between 𝑿 and 𝒀 by the inner relation 𝑼 = 𝑩𝑻

Statistics4u.com

http://www.statistics4u.com/fundstat_eng/dd_pls.html

PLS

• References

– H. Abdi, The University of Texas at Dallas

• Partial Least Squares (PLS) Regression

– StatSoft, PLS

– Bob Collins, LPAC group meeting, PLS

– ST02: Multivariate Data Analysis and Chemometrics

http://ftp.utdallas.edu/~herve/Abdi-PLSR2007-pretty.pdf

http://ftp.utdallas.edu/~herve/Abdi-PLSR2007-pretty.pdf

http://www.ps.uci.edu/~markm/statistics/statsoft/Partial Least Squares (PLS).pdf

http://vision.cse.psu.edu/seminars/talks/PLSpresentation.pdf

http://statmaster.sdu.dk/courses/ST02/

PCR and PLS with “R”

R-bloggers: Posts Tagged ‘ "R" Chemometrics ’, page 6

http://www.r-bloggers.com/tag/r-chemometrics/page/6/




> library(ChemometricsWithR)

> data(gasoline)

> summary(gasoline$octane)

Min. 1st Qu. Median Mean 3rd Qu. Max.

83.40 85.88 87.75 87.18 88.45 89.60

> sd(gasoline$octane)

[1] 1.530078

> hist(gasoline$octane)

the gasoline data set has the spectra of 60 samples acquired by diffuse reflectance from 900 to 1700 nm

"R": Looking at the Data (Gasoline) – 001

http://cran.r-project.org/web/packages/ChemometricsWithR/ChemometricsWithR.pdf

http://www.r-bloggers.com/r-looking-at-the-data-gasoline-001/




> data(gasoline, package="pls")

> wavelengths<-seq(900, 1700,by=2)

> matplot(wavelengths,t(gasoline$NIR),type=“l”,lty=1,xlab="wavelengths(nm)",ylab="log(1/R)")

"R": Plotting the spectra (Gasoline) – 002

http://cran.r-project.org/web/packages/pls/pls.pdf

http://www.r-bloggers.com/r-plotting-the-spectra-gasoline-002/




> gaspcr <- pcr(octane~NIR, ncomp = 10,data = gasoline, validation = "LOO")

> summary(gaspcr) Data: X dimension: 60 401

Y dimension: 60 1

Fit method: svdpc

Number of components considered: 10

VALIDATION: RMSEP

Cross-validated using 60 leave-one-out segments.

(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps

CV 1.543 1.447 1.474 1.255 0.2501 0.2503 0.2578 0.2646 0.2724 0.2474 0.2508

adjCV 1.543 1.446 1.474 1.255 0.2496 0.2500 0.2575 0.2643 0.2733 0.2471 0.2508

TRAINING: % variance explained

1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps

X 72.57 83.90 90.86 95.46 96.70 97.66 98.16 98.52 98.85 99.09

octane 18.99 19.62 46.50 97.69 97.78 97.79 97.79 97.79 98.33 98.38

adjCV is the RMSEP Bias corrected which in the case of "LOO" is almost the same that the RMSEP without correction

"R": PLS Regression (Gasoline) – 003

http://www.r-bloggers.com/r-pls-regression-gasoline-003/




One way to decide better the number of components to use, is to plot the RMSEPs:

> plot(RMSEP(gaspcr), legendpos = "topright")

The plot suggest four components giving a RMSEP of 0.250

prediction plot:

> plot(gaspcr, ncomp = 4, asp = 1, line = TRUE)

> R2(gaspcr) (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps

-0.03419 0.09043 0.05573 0.31590 0.97284 0.97279 0.97113 0.96959 0.96777 0.97341 0.97267

> plot(R2(gaspcr),legendpos = "topright")






> explvar(gaspcr) Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8 Comp 9 Comp 10

72.5651378 11.3380191 6.9542569 4.5998259 1.2402978 0.9668303 0.4940023 0.3625086 0.3321849 0.2322125

> plot(gaspcr,comps=1:4,plottype = c("scores"))






> plot(gaspcr,comps=1:2,plottype = c("scores"))

We can change “scores” for “ loadings”, and get the plot of the 4 loadings together:

> plot(gaspcr,comps=1:4,plottype = c(“loadings"), legendpos = "topleft")

We can plot also the regression coefficients spectrum,

> coefplot(gaspcr,comps=1,legend = "topleft")

or to see the values in numbers

> coef(gaspcr,comp=1)

, , Comp 1

octane

900 nm -0.0341446314

902 nm -0.0327240249

904 nm -0.0350492088

906 nm -0.0395840447

908 nm -0.0415126609

910 nm -0.0449274757

912 nm -0.0434251293

914 nm -0.0451249879

916 nm -0.0416846176

918 nm -0.0385643706

920 nm -0.0375470475

922 nm -0.0365454999

924 nm -0.0358456375

...

> residuals(gaspcr)[,,"4 comps"]

1 2 3 4 5

0.03704204 0.33750933 0.22115505 -0.28487712 -0.68587280

6 7 8 9 10

0.01653984 -0.03306587 -0.18136291 -0.20117289 -0.17286853

11 12 13 14 15

0.46050412 0.41050510 -0.02757601 -0.10609027 0.02278801

16 17 18 19 20

-0.08819202 0.50414416 0.20199013 -0.26221683 0.26851495

21 22 23 24 25

0.16139618 -0.34945544 0.01459367 -0.26995777 -0.13343275

26 27 28 29 30

0.02590811 0.04002573 -0.12586186 -0.38973620 -0.13456302

31 32 33 34 35

-0.25159097 -0.06826080 0.07190096 0.18064040 0.11086376

36 37 38 39 40

0.07786016 -0.03073137 0.32008155 0.05643676 0.20914842

41 42 43 44 45

-0.14730593 -0.34578297 0.04821531 0.07854058 -0.05146090

46 47 48 49 50

-0.19527580 -0.19275490 -0.05877137 0.14775051 0.12837901

51 52 53 54 55

0.06664761 0.32048790 0.08680848 0.17854041 -0.11500208

56 57 58 59 60

0.27214465 -0.33164973 -0.25354120 0.38776329 0.02360415

> residuals(gaspcr) shows all 10 comps residuals

> plot(residuals(gaspcr)[,,"4 comps"],xlab="sample",ylab="error")

> qqnorm(residuals(gaspcr)[,,"4 comps"])

> qqline(residuals(gaspcr)[,,"4 comps"])

We divide the whole Set into a Train Set and a Test Set

> gasTrain<-gasoline[1:50,]

> gasTest<-gasoline[51:60,]

Let´s develop the PCR with the Train Set and LOO CV

> gaspcr1<-pcr(octane~NIR,ncomp=10,data=gasTrain,validation="LOO")

> summary(gaspcr1) Data: X dimension: 50 401

Y dimension: 50 1

Fit method: svdpc

Number of components considered: 10

VALIDATION: RMSEP

Cross-validated using 50 leave-one-out segments.


CV 1.545 1.472 1.483 0.2894 0.2522 0.2622 0.2681 0.2386 0.2328 0.2416 0.2423

adjCV 1.545 1.471 1.482 0.2879 0.2518 0.2618 0.2677 0.2373 0.2323 0.2411 0.2415

TRAINING: % variance explained

1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps

X 79.86 88.12 93.54 96.54 97.74 98.38 98.75 99.06 99.28 99.42

octane 16.99 21.36 97.00 97.71 97.73 97.77 98.47 98.54 98.62 98.83

For this exercise we decide 4 components

Let´s predict our Test Set with this 4 components Model

> predict(gaspcr1,ncomp=4,newdata=gasTest) , , 4 comps

octane

51 88.07381

52 87.36530

53 88.30914

54 85.00247

55 85.33157

56 84.59513

57 87.56126

58 86.90745

59 89.21833

60 87.08905

> predplot(gaspcr1,ncomp=4,newdata=gasTest,asp=1,line=TRUE)

Let´s look to the RMSEP Statistic. This is very nice tool to decide if 4 components is fine or we can choose more or less components

> RMSEP(gaspcr1,newdata=gasTest)


1.5369 1.3226 1.2568 0.4634 0.2241 0.2283 0.2600 0.2795 0.2434 0.2290 0.2881

The CV for the Model with 4 components was 0.252

RMSEP

PCR > gaspcr <- pcr(octane~NIR, ncomp = 10,data = gasoline, validation = "LOO")

PLSR > gasplsr <- plsr(octane~NIR, ncomp = 10,data = gasoline, validation = "LOO")

...

• Referências

– An Introduction to R

– Multivariate Statistical Analysis using the R package chemometrics

– R-bloggers: Posts Tagged ‘ "R" Chemometrics ’

• 7 pages

– R Tutorial: an R Introduction to Statistics

http://cran.r-project.org/doc/manuals/R-intro.pdf







http://cran.r-project.org/web/packages/chemometrics/vignettes/chemometrics-vignette.pdf






http://www.r-bloggers.com/tag/r-chemometrics/







http://www.r-tutor.com/









multivariate analysis - quimica anselmo · –an introduction to r –multivariate statistical...

Documents