multivariate analysis - quimica anselmo · –an introduction to r –multivariate statistical...
TRANSCRIPT
Multivariate Analysis PCR and PLS
Prof. Dr. Anselmo E de Oliveira
anselmo.quimica.ufg.br
Multicollinearity
• Variable selection is one possibility of reducing the number of regressor variables and removing multicollinearity
The PCR Model
• PCR also solves the problem of data collinearity and reduces the number of regressor variables, but the regressor variables are no longer the original measured 𝑥-variables but linear combinations thereof
– The linear combinations are the principal component scores of the 𝑥-variable
– PCR is a combination of PCA and OLS
The PCR Model
• PCA decomposes a (centered) data matrix 𝑿 into scores 𝑻 and loadings 𝑷
• For a certain number 𝑎 of PCs which is usually less than the rank of the data matrix, this decomposition is
𝑿 = 𝑻𝑷𝑇 + 𝑬 with na error matrix 𝑬 • The score matrix 𝑻 contains the maximum
amount of information of 𝑿 among all matrices that are orthogonal projections on 𝑎 linear combinations of the 𝑥-data
The PCR Model
In a MLR model 𝒚 = 𝑿𝒃 + 𝒆
we replace the matrix 𝑿 by the score matrix 𝑻 and thus include major information of the 𝑥-data for regression on 𝒚
𝒚 = 𝑿𝒃 + 𝒆, 𝑿 = 𝑻𝑷𝑇 𝒚 = 𝑻𝑷𝑇 𝒃 + 𝒆𝑇 𝒈 = 𝑷𝑇𝒃 𝒚 = 𝑻𝒈 + 𝒆𝑇
with the new regression coefficients 𝒈 and the error term 𝒆𝑇 – The information of the highly correlated 𝑥-variables is
compressed in few score vectors that are uncorrelated: solves the problem with data collinearity
– The complexity of the model can be optimized by the number of used PCs
The PCR Model
• OLS regression 𝒈 = 𝑻𝑇𝑻 −1𝑻𝑇𝒚
• The final coefficients for the original model 𝒃𝑷𝑪𝑹 = 𝑷𝒈
diagonal matrix
PCA
feature 2
feature 1 𝑥𝐼,1
𝑥𝐼,2 𝑥𝐼
feature 2
feature 1
𝑧𝐼,1
𝑧𝐼,2
𝑥𝐼
PC1 PC2
colum vector 𝒙𝟏
colum vector 𝒙𝟐
PCA
PC1
PC2
feature 2
feature 1
PC1
only needed direction
Lesson 2: Linear Regression
The PCR Model
Number of PCA Components
• The number of components has to be optimized for the best possible prediction of the 𝑦-variable – PCA: total variance
• Simple strategy – Selection of the first PCA scores which cover a
certain percentage of the total variance of 𝑿 (for instance, 99%)
– Selection of the PCA scores with maximum correlation to 𝒚
PLS
• PLS stands for partial least-squares or/and projection to latent structures by means of partial seast squares
• PLS is the most widely used method in chemometrics for multivariate calibration – Web of Science, 07/09/13
• PCR: 1,048
• PLS: 8,731
• Herman Wold, 1975
PLS
• Essentially, the model structures of PLS and PCR are the same – The 𝑥-data are first transformed into a set of a few
intermediate linear latent variables (components), and these new variables are used for regression (by OLS) with a dependent variable 𝑦
– PCR uses principal component scores (derived solely from 𝑿)
– PLS uses components that are related to 𝒚 • Maximum covariance between scores and 𝑦
– PLS and PCR are linear methods (although nonlinear versions exist)
PLS
• PLS is a powerful linear regression method – Insensitive to collinear variables – Large number of variables
• First PLS-component is calculated as the latent variable which has maximum covariance between the scores and modeled property 𝑦
• Next, the information (variance) of this component is removed from the 𝑥-data (peeling or deflation) – It is a projection of the 𝑥-space on to a (hyper-)plane that is orthogonal to the
direction of the found component
• From the residual matrix, the next PLS component is derived • This procedure is continued until no improvement of modeling 𝑦 is
achieved • The number of PLS components defines the complexity of the model
– The optimum number of components is usually estimated by CV
• PLS2: PLS with a matrix 𝒀 instead of a vector 𝒚
PLS
• Partial least squares (PLS) models are based on principal components of both the independent data 𝑿 and the dependent data 𝒀. The central idea is to calculate the principal component scores of the 𝑿 and the 𝒀 data matrix and to set up a regression model between the scores (and not the original data).
• Thus the matrix 𝑿 is decomposed into a matrix 𝑻 (the score matrix) and a matrix 𝑷′ (the loadings matrix) plus an error matrix 𝑬. The matrix 𝒀 is decomposed into 𝑼 and 𝑸 and the error term 𝑭 . These two equations are called outer relations. The goal of the PLS algorithm is to minimize the norm of F while keeping the correlation between 𝑿 and 𝒀 by the inner relation 𝑼 = 𝑩𝑻
Statistics4u.com
PLS
• References
– H. Abdi, The University of Texas at Dallas
• Partial Least Squares (PLS) Regression
– StatSoft, PLS
– Bob Collins, LPAC group meeting, PLS
– ST02: Multivariate Data Analysis and Chemometrics
PCR and PLS with “R”
R-bloggers: Posts Tagged ‘ "R" Chemometrics ’, page 6
> library(ChemometricsWithR)
> data(gasoline)
> summary(gasoline$octane)
Min. 1st Qu. Median Mean 3rd Qu. Max.
83.40 85.88 87.75 87.18 88.45 89.60
> sd(gasoline$octane)
[1] 1.530078
> hist(gasoline$octane)
the gasoline data set has the spectra of 60 samples acquired by diffuse reflectance from 900 to 1700 nm
"R": Looking at the Data (Gasoline) – 001
> data(gasoline, package="pls")
> wavelengths<-seq(900, 1700,by=2)
> matplot(wavelengths,t(gasoline$NIR),type=“l”,lty=1,xlab="wavelengths(nm)",ylab="log(1/R)")
"R": Plotting the spectra (Gasoline) – 002
> gaspcr <- pcr(octane~NIR, ncomp = 10,data = gasoline, validation = "LOO")
> summary(gaspcr) Data: X dimension: 60 401
Y dimension: 60 1
Fit method: svdpc
Number of components considered: 10
VALIDATION: RMSEP
Cross-validated using 60 leave-one-out segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps
CV 1.543 1.447 1.474 1.255 0.2501 0.2503 0.2578 0.2646 0.2724 0.2474 0.2508
adjCV 1.543 1.446 1.474 1.255 0.2496 0.2500 0.2575 0.2643 0.2733 0.2471 0.2508
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps
X 72.57 83.90 90.86 95.46 96.70 97.66 98.16 98.52 98.85 99.09
octane 18.99 19.62 46.50 97.69 97.78 97.79 97.79 97.79 98.33 98.38
adjCV is the RMSEP Bias corrected which in the case of "LOO" is almost the same that the RMSEP without correction
"R": PLS Regression (Gasoline) – 003
One way to decide better the number of components to use, is to plot the RMSEPs:
> plot(RMSEP(gaspcr), legendpos = "topright")
The plot suggest four components giving a RMSEP of 0.250
prediction plot:
> plot(gaspcr, ncomp = 4, asp = 1, line = TRUE)
> R2(gaspcr) (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps
-0.03419 0.09043 0.05573 0.31590 0.97284 0.97279 0.97113 0.96959 0.96777 0.97341 0.97267
> plot(R2(gaspcr),legendpos = "topright")
"R": PLS Regression (Gasoline) – 004
> explvar(gaspcr) Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8 Comp 9 Comp 10
72.5651378 11.3380191 6.9542569 4.5998259 1.2402978 0.9668303 0.4940023 0.3625086 0.3321849 0.2322125
> plot(gaspcr,comps=1:4,plottype = c("scores"))
"R": PLS Regression (Gasoline) – 005
> plot(gaspcr,comps=1:2,plottype = c("scores"))
We can change “scores” for “ loadings”, and get the plot of the 4 loadings together:
> plot(gaspcr,comps=1:4,plottype = c(“loadings"), legendpos = "topleft")
We can plot also the regression coefficients spectrum,
> coefplot(gaspcr,comps=1,legend = "topleft")
or to see the values in numbers
> coef(gaspcr,comp=1)
, , Comp 1
octane
900 nm -0.0341446314
902 nm -0.0327240249
904 nm -0.0350492088
906 nm -0.0395840447
908 nm -0.0415126609
910 nm -0.0449274757
912 nm -0.0434251293
914 nm -0.0451249879
916 nm -0.0416846176
918 nm -0.0385643706
920 nm -0.0375470475
922 nm -0.0365454999
924 nm -0.0358456375
...
> residuals(gaspcr)[,,"4 comps"]
1 2 3 4 5
0.03704204 0.33750933 0.22115505 -0.28487712 -0.68587280
6 7 8 9 10
0.01653984 -0.03306587 -0.18136291 -0.20117289 -0.17286853
11 12 13 14 15
0.46050412 0.41050510 -0.02757601 -0.10609027 0.02278801
16 17 18 19 20
-0.08819202 0.50414416 0.20199013 -0.26221683 0.26851495
21 22 23 24 25
0.16139618 -0.34945544 0.01459367 -0.26995777 -0.13343275
26 27 28 29 30
0.02590811 0.04002573 -0.12586186 -0.38973620 -0.13456302
31 32 33 34 35
-0.25159097 -0.06826080 0.07190096 0.18064040 0.11086376
36 37 38 39 40
0.07786016 -0.03073137 0.32008155 0.05643676 0.20914842
41 42 43 44 45
-0.14730593 -0.34578297 0.04821531 0.07854058 -0.05146090
46 47 48 49 50
-0.19527580 -0.19275490 -0.05877137 0.14775051 0.12837901
51 52 53 54 55
0.06664761 0.32048790 0.08680848 0.17854041 -0.11500208
56 57 58 59 60
0.27214465 -0.33164973 -0.25354120 0.38776329 0.02360415
> residuals(gaspcr) shows all 10 comps residuals
> plot(residuals(gaspcr)[,,"4 comps"],xlab="sample",ylab="error")
> qqnorm(residuals(gaspcr)[,,"4 comps"])
> qqline(residuals(gaspcr)[,,"4 comps"])
We divide the whole Set into a Train Set and a Test Set
> gasTrain<-gasoline[1:50,]
> gasTest<-gasoline[51:60,]
Let´s develop the PCR with the Train Set and LOO CV
> gaspcr1<-pcr(octane~NIR,ncomp=10,data=gasTrain,validation="LOO")
> summary(gaspcr1) Data: X dimension: 50 401
Y dimension: 50 1
Fit method: svdpc
Number of components considered: 10
VALIDATION: RMSEP
Cross-validated using 50 leave-one-out segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps
CV 1.545 1.472 1.483 0.2894 0.2522 0.2622 0.2681 0.2386 0.2328 0.2416 0.2423
adjCV 1.545 1.471 1.482 0.2879 0.2518 0.2618 0.2677 0.2373 0.2323 0.2411 0.2415
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps
X 79.86 88.12 93.54 96.54 97.74 98.38 98.75 99.06 99.28 99.42
octane 16.99 21.36 97.00 97.71 97.73 97.77 98.47 98.54 98.62 98.83
For this exercise we decide 4 components
Let´s predict our Test Set with this 4 components Model
> predict(gaspcr1,ncomp=4,newdata=gasTest) , , 4 comps
octane
51 88.07381
52 87.36530
53 88.30914
54 85.00247
55 85.33157
56 84.59513
57 87.56126
58 86.90745
59 89.21833
60 87.08905
> predplot(gaspcr1,ncomp=4,newdata=gasTest,asp=1,line=TRUE)
Let´s look to the RMSEP Statistic. This is very nice tool to decide if 4 components is fine or we can choose more or less components
> RMSEP(gaspcr1,newdata=gasTest)
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps
1.5369 1.3226 1.2568 0.4634 0.2241 0.2283 0.2600 0.2795 0.2434 0.2290 0.2881
The CV for the Model with 4 components was 0.252
RMSEP
PCR > gaspcr <- pcr(octane~NIR, ncomp = 10,data = gasoline, validation = "LOO")
PLSR > gasplsr <- plsr(octane~NIR, ncomp = 10,data = gasoline, validation = "LOO")
...
• Referências
– An Introduction to R
– Multivariate Statistical Analysis using the R package chemometrics
– R-bloggers: Posts Tagged ‘ "R" Chemometrics ’
• 7 pages
– R Tutorial: an R Introduction to Statistics