pca example the data set “lakes” consists of five year average of water quality parameters...
TRANSCRIPT
![Page 1: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/1.jpg)
PCA Example
The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010. Several lakes have golden algae boom records during this period of time. Are the differences in water quality parameters driving the golden algae blooms in these lakes? Are the water quality parameters different in lakes from a period of time to another?
R data “Lakes”
![Page 2: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/2.jpg)
PCA Example
Variables:
Name – name of the lake Bloom – presence or absence of golden algae blooms Year - the first year of the five year period Temp – water temperature in degrees Celsius SpCond - Specific conductance, microsiemens per centimeter DO – dissolved oxygen, mg/L pH – water pH Chloride – chloride concentration , mg/L Sulfate - sulfate concentration mg/L
![Page 3: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/3.jpg)
PCA Example
Lakes=read.csv("E:/Multivariate_analysis/Data/Lakes.csv",header=T)
Read the data:
Remove the first three columns of the data and keep only the water quality (WQ) parameters:
Lk=Lakes[,-c(1:3)]
> round(sapply(Lk,var),2) Temp SpCond DO pH Chloride Sulfate 9.98 2395416.94 0.83 0.14 162220.42 55044.98
Calculate the variance for each WQ parameter:
![Page 4: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/4.jpg)
PCA Example
Normalize the data:
> NLk=scale(Lk)
Calculate the correlation matrix of the normalized data:
> round(cor(NLk),2) Temp SpCond DO pH Chloride SulfateTemp 1.00 -0.02 -0.57 -0.20 0.00 0.01SpCond -0.02 1.00 -0.07 0.29 0.85 0.96DO -0.57 -0.07 1.00 0.35 -0.09 -0.10pH -0.20 0.29 0.35 1.00 0.21 0.26Chloride 0.00 0.85 -0.09 0.21 1.00 0.81Sulfate 0.01 0.96 -0.10 0.26 0.81 1.00
![Page 5: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/5.jpg)
PCA Example
> eigen(cor(NLk))$values[1] 2.86256674 1.75834470 0.75470254 0.38378574 0.20896051 0.03163976
$vectors [,1] [,2] [,3] [,4] [,5] [,6][1,] -0.01709341 0.60859261 0.53586341 -0.58462716 0.002062515 0.019495627[2,] 0.57759456 0.03011448 -0.08663364 -0.03882989 0.304075712 0.751000965[3,] -0.03239882 -0.66764588 -0.05186583 -0.74159388 0.023311888 -0.002075523[4,] 0.22462025 -0.41945689 0.82114464 0.30853194 -0.060881157 -0.020607442[5,] 0.54026408 0.06014976 -0.14527739 -0.09287685 -0.814058298 -0.109882586[6,] 0.56806964 0.05826710 -0.08526964 -0.05407189 0.490502635 -0.650472377
Calculate the eigenvectors and eigenvalues of the correlation matrix:
![Page 6: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/6.jpg)
PCA ExampleExtract the principal components from the correlation matrix:
> Lakes_PCA=princomp(NLk,corr=TRUE)> summary(Lakes_PCA,loadings=TRUE)Importance of components: Comp.1 Comp.2 Comp.3Standard deviation 1.6870433 1.3222100 0.8662362Proportion of Variance 0.4770945 0.2930575 0.1257838Cumulative Proportion 0.4770945 0.7701519 0.8959357
Loadings: Comp.1 Comp.2 Comp.3Temp 0.609 0.536SpCond 0.578DO -0.668pH 0.225 -0.419 0.821Chloride 0.540 -0.145Sulfate 0.568>
![Page 7: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/7.jpg)
PCA Example
Plot the variance of each principal component:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Lakes
Va
ria
nce
s
0.0
0.5
1.0
1.5
2.0
2.5
![Page 8: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/8.jpg)
PCA Example
Write the equations of the first three principal components:
ChloridepHTempy
pHDOTempy
SulfateChloridepHSpCondy
41.081.053.0
41.066.06.0
56.054.022.057.0
3
2
1
SpCond, Chloride, and Sulfate have important loadings on the first principal axis, Temp, DO, and pH contribute significantly to the second principal axis, and Temp, pH, and Chloride are important loadings on the third principal axis.
![Page 9: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/9.jpg)
PCA Example
Calculate the scores for each principal axis for the PCA diagram:
> Lakes_PCA$scores Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 [1,] -0.840940601 -0.129405032 0.3724851406 0.282009798 -0.0159568183 -0.0245294336 [2,] -1.129223226 -0.998041022 -1.6138622935 -0.108292698 0.1506844836 -0.0994239039 [3,] -0.803185195 2.824527084 1.7646947373 0.388294337 -0.0344049191 0.0068578657 [4,] -0.800500726 0.178911098 0.2942787158 0.185687561 -0.0006769275 0.0366685692 [5,] -0.984180111 -0.638164386 -0.2893359063 -0.977367586 0.0675853680 0.0448307344 [6,] -0.726931303 1.823618015 1.8948715840 -0.672342526 -0.0255919528 -0.0070631567 [7,] -0.768218704 -1.767306989 -0.0040369230 -0.037962664 -0.0485309836 -0.0275789371 …………………………………………………………………………………………………………………………………………………[174,] 1.629960036 -0.579185868 0.3245963801 -0.470741904 -0.0852283736 0.7779341323
![Page 10: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/10.jpg)
PCA Example
>year1=which(Lakes[,3]==sort(unique(Lakes[,3]))[1])>year2=which(Lakes[,3]==sort(unique(Lakes[,3]))[2])>year3=which(Lakes[,3]==sort(unique(Lakes[,3]))[3])>year4=which(Lakes[,3]==sort(unique(Lakes[,3]))[4])>year5=which(Lakes[,3]==sort(unique(Lakes[,3]))[5])>year6=which(Lakes[,3]==sort(unique(Lakes[,3]))[6])>year7=which(Lakes[,3]==sort(unique(Lakes[,3]))[7])
>plot(Lakes_PCA$scores[year1,1],Lakes_PCA$scores[year1,2],xlab="PC1",ylab="PC2",pch=15,xlim=range(Lakes_PCA$scores[,1])*c(0.98,1.3),ylim=range(Lakes_PCA$scores[,2]))>points(Lakes_PCA$scores[year2,1],Lakes_PCA$scores[year2,2],pch=15,col="red")>points(Lakes_PCA$scores[year3,1],Lakes_PCA$scores[year3,2],pch=15,col="blue")>points(Lakes_PCA$scores[year4,1],Lakes_PCA$scores[year4,2],pch=15,col="green")>points(Lakes_PCA$scores[year5,1],Lakes_PCA$scores[year5,2],pch=15,col="pink")>points(Lakes_PCA$scores[year6,1],Lakes_PCA$scores[year6,2],pch=15,col="yellow")>points(Lakes_PCA$scores[year7,1],Lakes_PCA$scores[year7,2],pch=15,col="brown")>legend(11,2,legend=as.character(sort(unique(Algae[,3]))),bty="n",pch=15,col=c("black","red","blue","green","pink","yellow","brown"))
Make a PC1 vs PC2 diagram showing each year with a different color:
![Page 11: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/11.jpg)
PCA Example
PC1 vs PC2 diagram :
0 5 10
-20
24
6
PC1
PC
2
1975198019851990199520002005
Several lakes have different water quality in years 1975, 1980, and 1985 (blue, red, and black isolated points).
![Page 12: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/12.jpg)
PCA Example
>plot(Lakes_PCA$scores[year1,1],Lakes_PCA$scores[year1,3],xlab="PC1",ylab="PC3",pch=15,xlim=range(Lakes_PCA$scores[,1])*c(0.98,1.3),ylim=range(Lakes_PCA$scores[,3]))>points(Lakes_PCA$scores[year2,1],Lakes_PCA$scores[year2,3],pch=15,col="red")>points(Lakes_PCA$scores[year3,1],Lakes_PCA$scores[year3,3],pch=15,col="blue")>points(Lakes_PCA$scores[year4,1],Lakes_PCA$scores[year4,3],pch=15,col="green")>points(Lakes_PCA$scores[year5,1],Lakes_PCA$scores[year5,3],pch=15,col="pink")>points(Lakes_PCA$scores[year6,1],Lakes_PCA$scores[year6,3],pch=15,col="yellow")>points(Lakes_PCA$scores[year7,1],Lakes_PCA$scores[year7,3],pch=15,col="brown")>legend("topright",legend=as.character(sort(unique(Lakes[,3]))),bty="n",pch=15,col=c("black","red","blue","green","pink","yellow","brown"))
Make a PC1 vs PC3 diagram showing each year with a different color:
![Page 13: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/13.jpg)
PCA Example
PC1 vs PC3 diagram:
0 5 10
-3-2
-10
12
PC1
PC
3
1975198019851990199520002005
The five year period starting in 1985 show different water quality in several lakes (blue dots). A few lakes show differences in 1975 and 1980 compared to the rest of the group.
![Page 14: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/14.jpg)
PCA Example
Make a PC1 vs PC2 diagram showing lakes with algae bloom records in blue:
>algae=which(Lakes[,2]=="Algae")>noalgae=which(Lakes[,2]=="NoAlgae")>plot(Lakes_PCA$scores[noalgae,1],Lakes_PCA$scores[noalgae,2],xlim=range(Lakes_PCA$scores[,1])*c(0.98,1.3),ylim=range(Lakes_PCA$scores[,2]),xlab="PC1",ylab="PC2",pch=15)>points(Lakes_PCA$scores[algae,1],Lakes_PCA$scores[algae,2],pch=15,col="blue")>legend(10,6,legend=c("no-algae","algae"),bty="n",pch=15,col=c("black","blue"))
![Page 15: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/15.jpg)
PCA Example
Make a PC1 vs PC2 diagram showing algae and no-algae lakes:
0 5 10
-20
24
6
PC1
PC
2
no-algaealgae Clear separation between
lakes with and without golden algae blooms on the PC1 axis.
![Page 16: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/16.jpg)
PCA Example
Make a PC1 vs PC3 diagram showing lakes with algae bloom records in blue:
>algae=which(Lakes[,2]=="Algae")>noalgae=which(Lakes[,2]=="NoAlgae")>plot(Lakes_PCA$scores[noalgae,1],Lakes_PCA$scores[noalgae,3],xlim=range(Lakes_PCA$scores[,1])*c(0.98,1.3),ylim=range(Lakes_PCA$scores[,3]),xlab="PC1",ylab="PC3",pch=15)>points(Lakes_PCA$scores[algae,1],Lakes_PCA$scores[algae,3],pch=15,col="blue")>legend(10,2,legend=c("no-algae","algae"),bty="n",pch=15,col=c("black","blue"))
![Page 17: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/17.jpg)
PCA Example
PC1 vs PC3 diagram :
0 5 10
-3-2
-10
12
PC1
PC
3
no-algaealgae The separation between
algae lakes and no-algae lakes is given by PC1.
![Page 18: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/18.jpg)
PCA Example
Biplot of the first two principal components.
-0.1 0.0 0.1 0.2 0.3
-0.2
-0.1
0.0
0.1
0.2
Comp.1
Co
mp
.2
Brdw
Brdw
Brdw
Bl
Bl
Bl
Brdg
Brdg
Brdg
BcBcBc
BS
BSBS
CnyCnyCny
Cny
Ch
Ch
Ch
Cr
CrCrCy
Cy
Cy
DvDv
Dv
Eg
EgEgGrg
GrgGrngGrng
GrnblGrnblGrnbl
HsHs
Hs
HCHCHCHCHCHC
HC
Km
Km
Lw
LwLw
Lm
LmLm
LvLvLvLvLvLvLvArArArArArAr
CCCC
CC
CnrCnrCnrCnrCnr
CnrCnrGrnbrGrnbrGrnbr
GrnbrGrnbrGrnbr
MKMKMK
MdMdMd
MrMrMr
MrMr
NvNv
Nv
Pl
PnPn
Pn
PKPK
PKPKPK
PK
Pr
Pr
Pr
RyRy
RyRy
RB
RBRB
Sm
SmSm
Sp
Sp
Sp
Sp
St
St
St
Sw
Sw
Sw
TwTwTwTxn
TxnTxn
Txn
TxnTxnTxn
TxmTxmTxmTxm
ThTh
Th
TlTlTl
TrTrTr
TBTB
TBWc
Wc
Wc
Whit
WhitWhit
WhtnWhtn
Whtn
WhtnWhtn
Whtn
-5 0 5 10 15 20
-10
-50
51
0Temp
SpCond
DO
pH
ChlorideSulfate
Separation of algae lakes from no-algae lakes is determined bythe variables Chloride, Sulfate, and SpCond. The eigenvectors of these three variables are so close in value that the arrows overlap.
> biplot(Lakes_PCA,xlabs=abbreviate(Lakes[,1]),xlim=c(-0.1,0.3),ylim=c(-0.2,0.3))
![Page 19: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010](https://reader036.vdocument.in/reader036/viewer/2022072008/56649d6e5503460f94a4fced/html5/thumbnails/19.jpg)
PCA Example
Biplot of the first two principal components:
-0.1 0.0 0.1 0.2 0.3
-0.2
-0.1
0.0
0.1
0.2
Comp.1
Co
mp
.2
-5 0 5 10 15 20
-10
-50
51
0Temp
SpCond
DO
pH
ChlorideSulfate
> biplot(Lakes_PCA,xlabs=rep("",dim(Lakes)[1]),xlim=c(-0.1,0.3),ylim=c(-0.2,0.2))> points(Lakes_PCA$scores[noalgae,1],Lakes_PCA$scores[noalgae,2],col="black",pch=16)> points(Lakes_PCA$scores[algae,1],Lakes_PCA$scores[algae,2],col="blue",pch=16)