pca example the data set “lakes” consists of five year average of water quality parameters...

19
PCA Example ata set “Lakes” consists of five year average of water qual meters measurements at 48 lakes in Texas for the period 197 ral lakes have golden algae boom records during this period o the differences in water quality parameters driving the golde ms in these lakes? Are the water quality parameters different akes from a period of time to another? ata “Lakes”

Upload: donna-douglas

Post on 21-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010. Several lakes have golden algae boom records during this period of time. Are the differences in water quality parameters driving the golden algae blooms in these lakes? Are the water quality parameters different in lakes from a period of time to another?

R data “Lakes”

Page 2: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Variables:

Name – name of the lake Bloom – presence or absence of golden algae blooms Year - the first year of the five year period Temp – water temperature in degrees Celsius SpCond - Specific conductance, microsiemens per centimeter DO – dissolved oxygen, mg/L pH – water pH Chloride – chloride concentration , mg/L Sulfate - sulfate concentration mg/L

Page 3: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Lakes=read.csv("E:/Multivariate_analysis/Data/Lakes.csv",header=T)

Read the data:

Remove the first three columns of the data and keep only the water quality (WQ) parameters:

Lk=Lakes[,-c(1:3)]

> round(sapply(Lk,var),2) Temp SpCond DO pH Chloride Sulfate 9.98 2395416.94 0.83 0.14 162220.42 55044.98

Calculate the variance for each WQ parameter:

Page 4: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Normalize the data:

> NLk=scale(Lk)

Calculate the correlation matrix of the normalized data:

> round(cor(NLk),2) Temp SpCond DO pH Chloride SulfateTemp 1.00 -0.02 -0.57 -0.20 0.00 0.01SpCond -0.02 1.00 -0.07 0.29 0.85 0.96DO -0.57 -0.07 1.00 0.35 -0.09 -0.10pH -0.20 0.29 0.35 1.00 0.21 0.26Chloride 0.00 0.85 -0.09 0.21 1.00 0.81Sulfate 0.01 0.96 -0.10 0.26 0.81 1.00

Page 5: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

> eigen(cor(NLk))$values[1] 2.86256674 1.75834470 0.75470254 0.38378574 0.20896051 0.03163976

$vectors [,1] [,2] [,3] [,4] [,5] [,6][1,] -0.01709341 0.60859261 0.53586341 -0.58462716 0.002062515 0.019495627[2,] 0.57759456 0.03011448 -0.08663364 -0.03882989 0.304075712 0.751000965[3,] -0.03239882 -0.66764588 -0.05186583 -0.74159388 0.023311888 -0.002075523[4,] 0.22462025 -0.41945689 0.82114464 0.30853194 -0.060881157 -0.020607442[5,] 0.54026408 0.06014976 -0.14527739 -0.09287685 -0.814058298 -0.109882586[6,] 0.56806964 0.05826710 -0.08526964 -0.05407189 0.490502635 -0.650472377

Calculate the eigenvectors and eigenvalues of the correlation matrix:

Page 6: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA ExampleExtract the principal components from the correlation matrix:

> Lakes_PCA=princomp(NLk,corr=TRUE)> summary(Lakes_PCA,loadings=TRUE)Importance of components: Comp.1 Comp.2 Comp.3Standard deviation 1.6870433 1.3222100 0.8662362Proportion of Variance 0.4770945 0.2930575 0.1257838Cumulative Proportion 0.4770945 0.7701519 0.8959357

Loadings: Comp.1 Comp.2 Comp.3Temp 0.609 0.536SpCond 0.578DO -0.668pH 0.225 -0.419 0.821Chloride 0.540 -0.145Sulfate 0.568>

Page 7: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Plot the variance of each principal component:

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6

Lakes

Va

ria

nce

s

0.0

0.5

1.0

1.5

2.0

2.5

Page 8: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Write the equations of the first three principal components:

ChloridepHTempy

pHDOTempy

SulfateChloridepHSpCondy

41.081.053.0

41.066.06.0

56.054.022.057.0

3

2

1

SpCond, Chloride, and Sulfate have important loadings on the first principal axis, Temp, DO, and pH contribute significantly to the second principal axis, and Temp, pH, and Chloride are important loadings on the third principal axis.

Page 9: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Calculate the scores for each principal axis for the PCA diagram:

> Lakes_PCA$scores Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 [1,] -0.840940601 -0.129405032 0.3724851406 0.282009798 -0.0159568183 -0.0245294336 [2,] -1.129223226 -0.998041022 -1.6138622935 -0.108292698 0.1506844836 -0.0994239039 [3,] -0.803185195 2.824527084 1.7646947373 0.388294337 -0.0344049191 0.0068578657 [4,] -0.800500726 0.178911098 0.2942787158 0.185687561 -0.0006769275 0.0366685692 [5,] -0.984180111 -0.638164386 -0.2893359063 -0.977367586 0.0675853680 0.0448307344 [6,] -0.726931303 1.823618015 1.8948715840 -0.672342526 -0.0255919528 -0.0070631567 [7,] -0.768218704 -1.767306989 -0.0040369230 -0.037962664 -0.0485309836 -0.0275789371 …………………………………………………………………………………………………………………………………………………[174,] 1.629960036 -0.579185868 0.3245963801 -0.470741904 -0.0852283736 0.7779341323

Page 10: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

>year1=which(Lakes[,3]==sort(unique(Lakes[,3]))[1])>year2=which(Lakes[,3]==sort(unique(Lakes[,3]))[2])>year3=which(Lakes[,3]==sort(unique(Lakes[,3]))[3])>year4=which(Lakes[,3]==sort(unique(Lakes[,3]))[4])>year5=which(Lakes[,3]==sort(unique(Lakes[,3]))[5])>year6=which(Lakes[,3]==sort(unique(Lakes[,3]))[6])>year7=which(Lakes[,3]==sort(unique(Lakes[,3]))[7])

>plot(Lakes_PCA$scores[year1,1],Lakes_PCA$scores[year1,2],xlab="PC1",ylab="PC2",pch=15,xlim=range(Lakes_PCA$scores[,1])*c(0.98,1.3),ylim=range(Lakes_PCA$scores[,2]))>points(Lakes_PCA$scores[year2,1],Lakes_PCA$scores[year2,2],pch=15,col="red")>points(Lakes_PCA$scores[year3,1],Lakes_PCA$scores[year3,2],pch=15,col="blue")>points(Lakes_PCA$scores[year4,1],Lakes_PCA$scores[year4,2],pch=15,col="green")>points(Lakes_PCA$scores[year5,1],Lakes_PCA$scores[year5,2],pch=15,col="pink")>points(Lakes_PCA$scores[year6,1],Lakes_PCA$scores[year6,2],pch=15,col="yellow")>points(Lakes_PCA$scores[year7,1],Lakes_PCA$scores[year7,2],pch=15,col="brown")>legend(11,2,legend=as.character(sort(unique(Algae[,3]))),bty="n",pch=15,col=c("black","red","blue","green","pink","yellow","brown"))

Make a PC1 vs PC2 diagram showing each year with a different color:

Page 11: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

PC1 vs PC2 diagram :

0 5 10

-20

24

6

PC1

PC

2

1975198019851990199520002005

Several lakes have different water quality in years 1975, 1980, and 1985 (blue, red, and black isolated points).

Page 12: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

>plot(Lakes_PCA$scores[year1,1],Lakes_PCA$scores[year1,3],xlab="PC1",ylab="PC3",pch=15,xlim=range(Lakes_PCA$scores[,1])*c(0.98,1.3),ylim=range(Lakes_PCA$scores[,3]))>points(Lakes_PCA$scores[year2,1],Lakes_PCA$scores[year2,3],pch=15,col="red")>points(Lakes_PCA$scores[year3,1],Lakes_PCA$scores[year3,3],pch=15,col="blue")>points(Lakes_PCA$scores[year4,1],Lakes_PCA$scores[year4,3],pch=15,col="green")>points(Lakes_PCA$scores[year5,1],Lakes_PCA$scores[year5,3],pch=15,col="pink")>points(Lakes_PCA$scores[year6,1],Lakes_PCA$scores[year6,3],pch=15,col="yellow")>points(Lakes_PCA$scores[year7,1],Lakes_PCA$scores[year7,3],pch=15,col="brown")>legend("topright",legend=as.character(sort(unique(Lakes[,3]))),bty="n",pch=15,col=c("black","red","blue","green","pink","yellow","brown"))

Make a PC1 vs PC3 diagram showing each year with a different color:

Page 13: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

PC1 vs PC3 diagram:

0 5 10

-3-2

-10

12

PC1

PC

3

1975198019851990199520002005

The five year period starting in 1985 show different water quality in several lakes (blue dots). A few lakes show differences in 1975 and 1980 compared to the rest of the group.

Page 14: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Make a PC1 vs PC2 diagram showing lakes with algae bloom records in blue:

>algae=which(Lakes[,2]=="Algae")>noalgae=which(Lakes[,2]=="NoAlgae")>plot(Lakes_PCA$scores[noalgae,1],Lakes_PCA$scores[noalgae,2],xlim=range(Lakes_PCA$scores[,1])*c(0.98,1.3),ylim=range(Lakes_PCA$scores[,2]),xlab="PC1",ylab="PC2",pch=15)>points(Lakes_PCA$scores[algae,1],Lakes_PCA$scores[algae,2],pch=15,col="blue")>legend(10,6,legend=c("no-algae","algae"),bty="n",pch=15,col=c("black","blue"))

Page 15: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Make a PC1 vs PC2 diagram showing algae and no-algae lakes:

0 5 10

-20

24

6

PC1

PC

2

no-algaealgae Clear separation between

lakes with and without golden algae blooms on the PC1 axis.

Page 16: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Make a PC1 vs PC3 diagram showing lakes with algae bloom records in blue:

>algae=which(Lakes[,2]=="Algae")>noalgae=which(Lakes[,2]=="NoAlgae")>plot(Lakes_PCA$scores[noalgae,1],Lakes_PCA$scores[noalgae,3],xlim=range(Lakes_PCA$scores[,1])*c(0.98,1.3),ylim=range(Lakes_PCA$scores[,3]),xlab="PC1",ylab="PC3",pch=15)>points(Lakes_PCA$scores[algae,1],Lakes_PCA$scores[algae,3],pch=15,col="blue")>legend(10,2,legend=c("no-algae","algae"),bty="n",pch=15,col=c("black","blue"))

Page 17: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

PC1 vs PC3 diagram :

0 5 10

-3-2

-10

12

PC1

PC

3

no-algaealgae The separation between

algae lakes and no-algae lakes is given by PC1.

Page 18: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Biplot of the first two principal components.

-0.1 0.0 0.1 0.2 0.3

-0.2

-0.1

0.0

0.1

0.2

Comp.1

Co

mp

.2

Brdw

Brdw

Brdw

Bl

Bl

Bl

Brdg

Brdg

Brdg

BcBcBc

BS

BSBS

CnyCnyCny

Cny

Ch

Ch

Ch

Cr

CrCrCy

Cy

Cy

DvDv

Dv

Eg

EgEgGrg

GrgGrngGrng

GrnblGrnblGrnbl

HsHs

Hs

HCHCHCHCHCHC

HC

Km

Km

Lw

LwLw

Lm

LmLm

LvLvLvLvLvLvLvArArArArArAr

CCCC

CC

CnrCnrCnrCnrCnr

CnrCnrGrnbrGrnbrGrnbr

GrnbrGrnbrGrnbr

MKMKMK

MdMdMd

MrMrMr

MrMr

NvNv

Nv

Pl

PnPn

Pn

PKPK

PKPKPK

PK

Pr

Pr

Pr

RyRy

RyRy

RB

RBRB

Sm

SmSm

Sp

Sp

Sp

Sp

St

St

St

Sw

Sw

Sw

TwTwTwTxn

TxnTxn

Txn

TxnTxnTxn

TxmTxmTxmTxm

ThTh

Th

TlTlTl

TrTrTr

TBTB

TBWc

Wc

Wc

Whit

WhitWhit

WhtnWhtn

Whtn

WhtnWhtn

Whtn

-5 0 5 10 15 20

-10

-50

51

0Temp

SpCond

DO

pH

ChlorideSulfate

Separation of algae lakes from no-algae lakes is determined bythe variables Chloride, Sulfate, and SpCond. The eigenvectors of these three variables are so close in value that the arrows overlap.

> biplot(Lakes_PCA,xlabs=abbreviate(Lakes[,1]),xlim=c(-0.1,0.3),ylim=c(-0.2,0.3))

Page 19: PCA Example The data set “Lakes” consists of five year average of water quality parameters measurements at 48 lakes in Texas for the period 1975-2010

PCA Example

Biplot of the first two principal components:

-0.1 0.0 0.1 0.2 0.3

-0.2

-0.1

0.0

0.1

0.2

Comp.1

Co

mp

.2

-5 0 5 10 15 20

-10

-50

51

0Temp

SpCond

DO

pH

ChlorideSulfate

> biplot(Lakes_PCA,xlabs=rep("",dim(Lakes)[1]),xlim=c(-0.1,0.3),ylim=c(-0.2,0.2))> points(Lakes_PCA$scores[noalgae,1],Lakes_PCA$scores[noalgae,2],col="black",pch=16)> points(Lakes_PCA$scores[algae,1],Lakes_PCA$scores[algae,2],col="blue",pch=16)