introduction to biostatistics the 3rd lecture · introduction to biostatistics the 3rd lecture...

1

1

Introduction to Biostatistics The 3rd Lecture

[email protected]

2020/5/1

Principal component analysis

Experimentsinagricultureandlifesciencesoftenmeasuremultiplecharacteristicsofthesamesamples.Forexample,infieldtrialsofcrops,varioustraitsrelatedtoyieldareexaminedsimultaneously,evenforthepurposeofyieldassessment.Itisoftenthecasethatyoucandiscoversomekindofknowledgebydrawingandviewingscatterplotsofthemultiplecharacteristicsmeasuredsimultaneously.However,ifthenumberofcharacteristicsmeasuredislarge,itwillbedifficulttograspthevariationofdatawithascatterplot.Thenumberofdimensionsthathumanscanintuitivelygraspusingthescatterplotisatmostseveraldimensions,anditisnoteasytograspthevariationofdatawhentherearemorethan10measuredcharacteristics.Principalcomponentanalysisdescribedinthislectureisamethodforsummarizingvariationcontainedinmultidimensionaldataintolow-dimensionalfeatures(i.e.,principalcomponentscores)withoutreducingtheamountofinformationasmuchaspossible.Forexample,thevariationsummaryincludedinthemarkergenotypedatashownasanexampleinthislectureshowsthatdatain1,311dimensionscanbesummarizedinfourdimensions.Principalcomponentanalysisisaveryeffectivemethodtoefficientlyextracttheinformationcontainedinvariableswhenthenumberofvariablesislarge.

Inthislecture,wewillusethericedataset(Zhaoetal.2011)asanexample.Inthislecture,wewillalsousemarkergenotypedata(RiceDiversityGeno.csv)aswellasline(accession)data(RiceDiversityLine.csv)andphenotypicdata(RiceDiversityPheno.csv).Themarkergenotypedataisthegenotypesof1,311SNPsthatZhaoetal.(2010,PLoSOne5:e10780)usedintheiranalysis.AlldataarebasedonthedatadownloadedfromtheRiceDiversitywebpagehttp://www.ricediversity.org/data/index.cfm.Asformarkerdata,missingvaluewasimputedbyusingthesoftwarefastPHASE(ScheetandStephens2006).

Loadthreetypesofdataandcombinethem.

# this data set was analyzed in Zhao 2011 (Nature Communications 2:467)line <- read.csv("RiceDiversityLine.csv")pheno <- read.csv("RiceDiversityPheno.csv")geno <- read.csv("RiceDiversityGeno.csv")line.pheno <- merge(line, pheno, by.x = "NSFTV.ID", by.y = "NSFTVID")alldata <- merge(line.pheno, geno, by.x = "NSFTV.ID", by.y = "NSFTVID")

Analyzethevariationofpaniclelengthandflagleaflengthinvarietiesandlinesincludedinricegermplasmbyprincipalcomponentanalysis.First,thedataofbothtraitsareextractedfromalldata(alldata).Then,sampleswhichhaveatleastonemissingvalueamongthedataofbothtraitsareexcluded.

# extract panicle length and flag leaf lengthmydata <- data.frame( panicle.length = alldata$Panicle.length, leaf.length = alldata$Flag.leaf.length )head(mydata, 3)

2

2

## panicle.length leaf.length## 1 20.48182 28.37500## 2 26.83333 39.00833## 3 23.53333 27.68333

dim(mydata)

## [1] 374 2

missing <- apply(is.na(mydata), 1, sum) > 0mydata <- mydata[!missing, ]dim(mydata)

## [1] 341 2

Let’scheckthevariationofpaniclelengthandflagleaflengthwithascatterplot.

# look at the relationship between two variablesplot(mydata)

3

3

lim <- range(mydata)plot(mydata, xlim = lim, ylim = lim)

Lookatthescatterplot.Whenonetraitbecomeslarge,theothertendstoalsobecomelarge.Calculatethevariance-covariancematrixandcorrelationmatrixofbothtraitsandconfirmitnumerically.

# statistics for measuring the relationship between two variablescov(mydata)

## panicle.length leaf.length## panicle.length 12.67168 11.57718## leaf.length 11.57718 33.41344

cor(mydata)


4

4

Boththecorrelationandthecovariancehavepositivevalues.Thus,wecanconfirmthatbothtendtovarytogether.

Inordertosimplifythefollowingcalculationsandexplanations,eachtraitisnormalizedtobezeroonaverage(theaverageissubtractedfromtheoriginalvariable).

# subtract the mean from each column to shift the center of data to the origin mydata <- sweep(mydata, 2, apply(mydata, 2, mean))summary(mydata)

## panicle.length leaf.length ## Min. :-8.8442 Min. :-15.1798 ## 1st Qu.:-2.1048 1st Qu.: -4.0048 ## Median :-0.1358 Median : -0.6548 ## Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 2.0892 3rd Qu.: 3.9036 ## Max. :11.2058 Max. : 18.8480

cov(mydata)


cor(mydata)


lim <- range(mydata)plot(mydata, xlim = lim, ylim = lim)abline(h = 0, v = 0)

5

5

Notethatscalingthevariablesdoesnotchangetherelationshipbetweenthevariablesrepresentedbythevariance-covariancematrixandthecorrelationmatrix.

Let’sperformprincipalcomponentanalysisandplottheobtainedprincipalcomponentscores.

# perform principal component analysis (PCA) and draw a scatterplotres <- prcomp(mydata)lim <- range(res$x)plot(res$x, xlim = lim, ylim = lim)abline(h = 0, v = 0)

6

6

Let’schecktherelationshipbetweenprincipalcomponentscoresandoriginalvariablesbydrawingascatterdiagramsidebyside.

# show graphs side by sideop <- par(mfrow = c(1,2))lim <- range(mydata)plot(mydata, xlim = lim, ylim = lim)abline(h = 0, v = 0)lim <- range(res$x)plot(res$x, xlim = lim, ylim = lim)abline(h = 0, v = 0)

7

7

par(op)

Theside-by-sidecomparisonoftheprincipalcomponentscoresandtheoriginalvariablesshowsthattheprincipalcomponentscoreshavetherotation(andinversion)oftheoriginalvariables.Principalcomponentanalysisisamethodforrepresentingthevariationoftheoriginalvariableswithnewvariableswithasfewdimensionsaspossible.Forexample,thehorizontalaxisintherightfigureexpressesthevariationofthetwovariablesofpaniclelengthandflagleaflengthasthevariationofonenewvariable(i.e,,thefirstprincipalcomponent).Itcanbeseenthatthefirstprincipalcomponentalonecanexplainmostofthevariationoftheoriginalvariables.Inaddition,thescoresofthesecondprincipalcomponentrepresentsthevariationthatcouldnotbeexplainedbythefirstprincipalcomponent.

Nowlet’scheckhowmucheachprincipalcomponentactuallyexplainsthevariations.

# show the result of PCAsummary(res)

## Importance of components:## PC1 PC2## Standard deviation 6.2117 2.7385## Proportion of Variance 0.8373 0.1627## Cumulative Proportion 0.8373 1.0000

Itcanbeseenthatthefirstprincipalcomponentaccountsfor83.7%ofthetotalvariation,andthesecondprincipalcomponentaccountsfortheremaining16.3%.Thatis,itcanbeseenthat80%ormoreofthevariationofpaniclelengthandflagleaflengthcanberepresentedbyonevariable(firstmaincomponent).

Let’slookattheresultsinmoredetail.

res

## Standard deviations (1, .., p=2):## [1] 6.211732 2.738524## ## Rotation (n x k) = (2 x 2):

8

8

## PC1 PC2## panicle.length -0.4078995 -0.9130268## leaf.length -0.9130268 0.4078995

Standarddeviationsrepresentthestandarddeviationofthefirstandsecondprincipalcomponentswhicharenewvariables.Inaddition,Rotationrepresentsaunitvectorrepresentingtheorientationoftheaxesofthefirstandsecondprincipalcomponents(notethattheseunitvectorsarecalledeigenvectors,aswillbedescribedlater).

Inaddition,theresultsmentionedabovecanalsoberetrievedseparatelyasfollows.

res$sdev

## [1] 6.211732 2.738524

res$rotation

## PC1 PC2## panicle.length -0.4078995 -0.9130268## leaf.length -0.9130268 0.4078995

Let’sdrawgraphsfortheresultofprincipalcomponentanalysis.

# draw graphs for PCAop <- par(mfrow = c(1,2))plot(res)biplot(res)

par(op)

Theleftgraphshowedthevarianceoftheprincipalcomponentscores,whichisthesquareofthestandarddeviationoftheprincipalcomponent(notethatthevarianceoftheprincipalcomponentscoreistheeigenvalueofthevariancecovariancematrix).Therightgraphisabiplotthatshowstherelationshipbetweenprincipalcomponentscoresandvariables.Lookingatthebiplot,boththeflagleaflength(leaf.Length)andthepaniclelength(panicle.Length)havearrowspointingtotheleft,andbothtraitshavelargevaluesforthefirstprincipal

9

9

component(thehorizontalaxis).Samplesthathavesmallerscoresforthecomponenthavelargervaluesinbothtraits.Thatis,thefirstprincipalcomponentcanbeinterpretedasavariablethatrepresents“size”.Ontheotherhand,thearrowoftheflagleaflengthis(slightly)upward,thearrowofthepaniclelengthisdownward.Asamplewithlargerflagleaflengthhasalargervalue,whileasamplewithlargepaniclelengthhasasmallervalue.Thatis,itcanbeinterpretedthatthesecondprincipalcomponentisavariablethatrepresentsthe“ratio”ofthelengthoftheflagleafandthepaniclelength.Thedetailsaboutthebiplotwillbeexplainedagainlater.

Formulation of principal component analysis

Here,Iwilloutlinetheformulationofprincipalcomponentanalysis,usingthetwo-dimensionaldatamentionedearlierasanexample. First,letusconsiderthatthevariationofthetwovariablesofpaniclelengthandflagleaflengthisrepresentedbyonenewvariable.Theorientationoftheaxisrepresentingthisnewvariableistentativelyassumedas(1/√2, 1/√2)here.Thevalueofthenewvariablecorrespondstothepositionofthefootoftheperpendicularlinefromthisdatapointtothisaxis.Thatis,theredlineofthefiguredrawnbelowistheaxisthatrepresentsthenewvariable,thegraylineistheperpendicularlinefromthedatapointtothenewaxis,andthegreen+isthefootoftheperpendicularline.Thepositionofthefootofthisperpendicularisthevalueofthenewvariable.

# plot againlim <- range(mydata)plot(mydata, xlim = lim, ylim = lim)abline(h = 0, v = 0)# arbitrary lineu.temp <- c(1 / sqrt(2), 1 / sqrt(2))abline(0, u.temp[2] / u.temp[1], col = "red")# draw scoresscore.temp <- as.matrix(mydata) %*% u.tempx <- score.temp * u.temp[1]y <- score.temp * u.temp[2]segments(x, y, mydata$panicle.length, mydata$leaf.length, col = "gray")points(x, y, pch = 4, col = "green")

10

10

Now,let’sfocusononesampleandexaminetherelationshipbetweenthevalueoftheoriginalvariableandthevalueofthenewvariableinmoredetail.Here,let’sdrawafiguretopayattentiontoonesamplethathasthelongestflagleaflength.

# plot againlim <- range(mydata)plot(mydata, xlim = lim, ylim = lim)abline(h = 0, v = 0)abline(0, u.temp[2] / u.temp[1], col = "red")id <- which.max(mydata$leaf.length)arrows(0, 0, mydata$panicle.length[id], mydata$leaf.length[id], col = "purple")arrows(x[id], y[id], mydata$panicle.length[id], mydata$leaf.length[id], col = "pink")arrows(0, 0, x[id], y[id], col = "blue")

11

11

Now,iftheoriginalvariableisrepresentedbyanewvariableasshownintheabovefigure,theinformationindicatedbythepinkarrowvectorwillbelost.Now,assumingthatthevectorrepresentingtheoriginalvariableas𝐱! ,thevectorrepresentingthenewvariableas𝐲! ,andthevectorrepresentingthelostinformationas𝐞! ,thesquareofthevariationoftheoriginalvariableis

|𝐱!|" = |𝐲! + 𝐞!|"

= (𝐲! + 𝐞!)#(𝐲! + 𝐞!)

= 𝐲!#𝐲! + 𝐞!#𝐲! + 𝐲!#𝐞! + 𝐞!#𝐞!

= |𝐲!|" + |𝐞!|" + 2𝐞!#𝐲!

= |𝐲!|" + |𝐞!|"

(1)

Thatis,thesquareofthevariationoftheoriginalvariablecanbedividedintothesquareofthevariationofthenewvariableandthesquareofthevariationthatislostinthenewvariable.

12

12

Thismeansthattheminimizationoflostinformationissynonymouswiththemaximizationofthevariabilityofnewvariables.

Howcanwefindanaxisthatmaximizesthevariabilityofthenewvariables?Consideravector,𝐮$,thatdeterminestheorientationoftheaxis,andseekthatmaximizesthevariationofthenewvariable.Sincethereareinfinitepossibilitieswhenweconsidervectorsofvarioussizes,letthesizeofvectorsizeas1(unitvector)here.Thatis,

|𝐮$|" = 𝐮$#𝐮$ = 𝑢$$" + 𝑢$"" = 1

(2)

Thevarianceofthevalueofthenewvariable𝑧!% underthiscondition

1𝑛 − 13𝑧$!"

&

!'$

]

(3)

Wewillmaximizethisvalue.𝑧$! isthepositionofthefootoftheperpendicular,andtheinnerproductof𝐮$and𝐱! .

𝑧$! = 𝐱!#𝐮$ = 𝑢$$𝑥$! + 𝑢$"𝑥"!

(4)

Notethattherelationshipwithinequation(1)is

𝐲! = 𝑧$!𝐮$

TomaximizeEquation(3)undertheconditionofEquation(2),usethemethodofLagrange’sundeterminedmultipliers.Thatis,

𝐿(𝐮! , 𝜆) =1

𝑛 − 13(&

!'$

𝑢$$𝑥$! + 𝑢$"𝑥"!)" − 𝜆(𝑢$$" + 𝑢$"" − 1)

First,partiallydifferentiatetheaboveequationwith𝑢$$and𝑢$".

𝜕𝐿𝜕𝑢$$

=1

𝑛 − 132&

!'$

(𝑢$$𝑥$! + 𝑢$"𝑥"!)𝑥$! − 2𝜆𝑢$$ = 0

𝜕𝐿𝜕𝑢$"

=1

𝑛 − 132&

!'$

(𝑢$$𝑥$! + 𝑢$"𝑥"!)𝑥"! − 2𝜆𝑢$" = 0

Theaboveformulacanbearrangedas

1𝑛 − 1 (𝑢$$3𝑥$!"

&

!'$

+ 𝑢$"3𝑥$!

&

!'$

𝑥"!) = 𝜆𝑢$$

1𝑛 − 1 (𝑢$$3𝑥$!

&

!'$

𝑥"! + 𝑢$"3𝑥"!"&

!'$

) = 𝜆𝑢$"

Iftheabovetwoexpressionsareexpressedusingamatrix

13

13

1𝑛 − 1

⎝

⎜⎜⎛

3𝑥$!"&

!'$

3𝑥$!

&

!'$

𝑥"!

3𝑥$!

&

!'$

𝑥"! 3𝑥"!"&

!'$ ⎠

⎟⎟⎞@𝑢$$𝑢$"A = 𝜆 @

𝑢$$𝑢$"A

Here,notethatintheportionfromthetopoftheleftsidetothematrix,thediagonalcomponentsrepresentvariances,whilethenondiagonalcomponentarecovariances.Now,letthevariance-covariancematrixbe$andthevectorrepresentingtheorientationoftheaxisbe𝐮$

𝐕𝐮$ = 𝜆𝐮$

(5)

Fortheequation(5),𝐮$ = 𝟎isaself-evidentsolution,butisnotthesolutionwearegoingtosolve.Findingasolutionforthematrix𝐕,exceptforwhichEq.(5)holds,iscalledtheeigenvalueproblem.Wecallthevectorル𝐮$asaneigenvectorofand𝜆asitseigenvalues.

Tosummarizetheresults,“tofindanewvariablethatbestdescribesthevariationoftheoriginalvariable”eventuallycorrespondsto“tofindthevariance-covariancematrixoftheoriginalvariableandfinditsfirsteigenvector”.

Equation(5)canberewrittenasfollows.

(𝐕 − 𝜆𝐈)𝐮$ = 𝟎

Thedeterminantofthematrixrepresentingthecoefficientsoftheaboveequationmustbe0inorderforthisequationtohaveasolutionotherthan𝐮$ = 𝟎.Thatis,

|𝐕 − 𝜆𝐈| = 0

Thisequationiscalledaneigen(orcharacteristic)equation.。

Here,althoughthecasewherethenumberofvariablesistwohasbeendescribedasanexample,ingeneral,iftherearemvariables,then𝐕isan𝑚 ×𝑚variance-covariancematrix.Ifthematrix𝐕isan𝑚 ×𝑚symmetricmatrix(thevariance-covariancematrixisalwaysasymmetricmatrix),has𝑚realeigenvalues𝜆$, . . . , 𝜆(andthecorrespondingeigenvectors𝐮$, . . . , 𝐮(,whichareunitvectorsinwhichallelementsarerealnumbersandorthogonaltoeachother.Now,ifeigenvectorsarerearrangedindescendingorderofeigenvalues(𝜆$ ≥. . . ≥𝜆(),𝐮$, . . . , 𝐮(becomeseigenvectorsofthefirst,…,𝑚-thprincipalcomponents.

Let’sperformprincipalcomponentanalysisaccordingtothecalculationprocedurementionedabove.First,calculatethevariance-covariancematrix𝐕.

# calculate covariance matrixcov <- var(mydata)cov


Next,performeigenvaluedecompositionofthevariance-covariancematrix.Weusethefunctioneigenforeigenvaluedecomposition.

14

14

# eigenvalue decompositioneig <- eigen(cov)eig

## eigen() decomposition## $values## [1] 38.585610 7.499513## ## $vectors## [,1] [,2]## [1,] 0.4078995 -0.9130268## [2,] 0.9130268 0.4078995

Eigenvaluedecompositionyieldseigenvalues(eigenvalues),𝜆$, . . . , 𝜆(,andeigenvectors(eigenvectors),𝐮$, . . . , 𝐮(.

Let’slookattheresultsobtainedwiththefunctioneigenandcomparethemwiththeresultsobtainedwiththefunctionprcomp.

# compare resultsres <- prcomp(mydata)res

## Standard deviations (1, .., p=2):## [1] 6.211732 2.738524## ## Rotation (n x k) = (2 x 2):## PC1 PC2## panicle.length -0.4078995 -0.9130268## leaf.length -0.9130268 0.4078995

sqrt(eig$values)

## [1] 6.211732 2.738524

Standarddeviationsarethesquarerootoftheeigenvalues.Thisisbecausethevarianceofthenewvariables(calledprincipalcomponentscores)issameastheeigenvalue,asdescribedlater.Inaddition,rotationisrepresentedbyeigenvectors,andbothresultsareidenticalexceptforthedifferencebetweenpositiveandnegative.Thesignofthecoefficientdependsonwhichsideoftheaxisisapositivevalue,butinsomecasesitmaybeupsidedownbecausethereisnoruletouniquelydetermineit.Inthepresentresult,theresultofusingthefunctionprcompandtheresultofusingthefunctioneigenareoppositeinthepositive/negativeofthefirstprincipalcomponentscore.

Nowlet’scalculatethevalueofthenewvariable,i.e.,theprincipalcomponentscore.Principalcomponentscorescanbecalculatedusingequation(4).Forexample,theprincipalcomponentscoreofthefirstsamplecanbecalculatedasfollows:

# calculate principal component scoresmydata[1,]

## panicle.length leaf.length## 1 -3.995677 -2.221425

eig$vectors[,1]

## [1] 0.4078995 0.9130268

15

15

mydata[1,1] * eig$vectors[1,1] + mydata[1,2] * eig$vectors[2,1]

## [1] -3.658055

res$x[1,1]

## [1] 3.658055

Tocalculateprincipalcomponentscoresforallsamplesandallprincipalcomponentsatonce:Thatis,itiscalculatedastheproductofthematrixofeigenvectorsandthedatamatrix.

score <- as.matrix(mydata) %*% eig$vectorshead(score)

## [,1] [,2]## 1 -3.658055 2.7420420## 2 8.641243 1.2802695## 3 -3.044854 -0.3262037## 4 1.665770 -4.1692908## 5 8.389476 -3.3045123## 6 5.897673 2.4732600

head(res$x)

## PC1 PC2## 1 3.658055 2.7420420## 2 -8.641243 1.2802695## 3 3.044854 -0.3262037## 4 -1.665770 -4.1692908## 5 -8.389476 -3.3045123## 6 -5.897673 2.4732600

Theobtainedprincipalcomponentscoresandtheprincipalcomponentscoresobtainedusingthefunctionprcompareidenticalexceptforthepositive/negativeofthefirstprincipalcomponentscores(thepositive/negativebeingreversedisthepositive/negativeasdescribedabove).

Let’sexaminethevarianceandcovarianceofprincipalcomponentscores.Comparestheelementsofthismatrixwiththeeigenvalues.

# variance of scores = eigenvalues var(score)

## [,1] [,2]## [1,] 3.858561e+01 6.763202e-16## [2,] 6.763202e-16 7.499513e+00

eig$values

## [1] 38.585610 7.499513

Theaboveresultsshowtwoimportantpoints.Oneisthatthecovarianceofthefirstprincipalcomponentandthesecondprincipalcomponentiszero.Itturnsoutthatthereisnoredundancyinbothcomponents.Theotheristhatthevarianceoftheprincipalcomponentscoresmatchestheeigenvaluesoftheprincipalcomponents.Thisrelationshipcanbederivedasfollows:

16

16

1𝑛 − 13𝑧%!"

&

!'$

=1

𝑛 − 1𝐳%#𝐳%

=1

𝑛 − 1 (𝐗𝐮%)#(𝐗𝐮%)

=1

𝑛 − 1𝐮%#𝐗#𝐗𝐮%

= 𝐮%#𝐕𝐮%

K∵1

𝑛 − 1𝐗#𝐗 = 𝐕M

= 𝜆%𝐮%#𝐮%

N∵ 𝐕𝐮% = 𝜆%𝐮%O

= 𝜆%

N𝐮%#𝐮% = 1O

where𝑧!% isthe𝑗-thprincipalcomponentscoreofthe𝑖-thsample.𝐳% = (𝑧!$, . . . 𝑧!&)# isacolumnvectorconsistingofthescoresofallsamplesofthe𝑗-thprincipalcomponent,and𝐗 =(𝐱$, . . . , 𝐱&)isadatamatrixbundlingcolumnvectors𝐱% = (𝑥%$, . . . , 𝑥%()# consistingoftheoriginal𝑚variablesofthe𝑖-thsample.Becauseofthisrelationship,asmentionedearlier,thestandarddeviationoftheprincipalcomponentscoresintheresultofthefunctionprcompcoincideswiththesquarerootoftheeigenvalueintheresultofthefunctioneigen.

Let’scheckanotherimportantrelationship.Thesumoftheeigenvalues(ie,thesumofthevariancesoftheprincipalcomponents)matchesthesumofthevariancesoftheoriginalvariables,asshownbelow.

# sum of variance sum(eig$values)

## [1] 46.08512

sum(diag(cov))

## [1] 46.08512

Therefore,calculatingtheratiooftheeigenvalueofthe𝑗-thprincipalcomponenttothesumoftheeigenvaluesofallprincipalcomponents,issameascalculatingtheratioofthevarianceofjthprincipalcomponenttothesumofthevariancesoftheoriginalvariables.Thisratioiscalledthecontributionofthej-thprincipalcomponent.Also,thesumofcontributionsfromthefirstprincipalcomponenttothe𝑗-thprincipalcomponentiscalledthecumulativecontributionofthe𝑗-thprincipalcomponent.Contributionandcumulativecontributionprovideagoodbasisfordeterminingthenumberofeffective(necessary)components,asdiscussedlater.Nowlet’scalculatethecontributionandthecumulativecontributionofprincipalcoponents.

# contributioneig$values / sum(eig$values)

17

17

## [1] 0.8372682 0.1627318

cumsum(eig$values) / sum(eig$values)

## [1] 0.8372682 1.0000000

summary(res)

## Importance of components:## PC1 PC2## Standard deviation 6.2117 2.7385## Proportion of Variance 0.8373 0.1627## Cumulative Proportion 0.8373 1.0000

Youcanseethatthefirstprincipalcomponentaccountsfor83.7%ofthetotalvariation(sumofthevarianceoftheoriginalvariables).Thisisthesameastheresultshownwhendisplayingtheresultofthefunctionprcompwiththefunctionsummary.

Principal Component Analysis Based on Correlation Matrix

Sofar,wehavediscussedprincipalcomponentanalysisbasedonthevariance-covariancematrix.Thismethodcannotbeappliedwhenthevariablesincludedifferentmeasurementscales.Thisisbecauseitisdifficulttogivemeaningtocovariancebetweenvariableswithdifferentmeasurementscales.

Forexample,whenconsideringthecovariancebetweentwovariablesoflengthandnumber,thesizeofthecovarianceisdifferentwhenmeasuringthelengthinunitsofmeterandwhenmeasuringinunitsofcenti-meter.(Thelatterwillbe100timeslarger).Therefore,theresultofprincipalcomponentanalysisbasedonthevariance-covariancematrixofthesetwovariableschangesdependingontheunitofmeasureoflength.

Also,forexample,evenwhenbothoftwovariablesaremeasuredinlength,ifoneisverylargecomparedtotheother,thelargervariablemainlydeterminesthemagnitudeofcovariance.Inprinciplecomponentanalysis,resultsareobtainedthatdependmainlyonthevariationinthelargervariable.

Theproblemaboveismainlycomefromtheestimateofcovariancehasthefollowingform:

3(&

!'$

𝑥! − 𝑥)(𝑦! − 𝑦)/(𝑛 − 1)

Nowlet’scalculateandchecktheproblemspecifically.First,let’sdrawascatterdiagrambyextractingthedataofpaniclelength(Panicle.length)andthenumberoffloretsinasinglepanicle(Florets.per.panicle).Thepaniclelengthisavariablemeasuredincenti-meter(cm),andthenumberoffloretsinasinglepanicleisavariablemeasuredasanumber.

# extract panicle length and florets per paniclemydata <- data.frame( panicle.length = alldata$Panicle.length, panicle.florets = alldata$Florets.per.panicle )missing <- apply(is.na(mydata), 1, sum) > 0mydata <- mydata[!missing, ]# look at the relationship between two variablesplot(mydata)

18

18

Next,let’sdoprincipalcomponentanalysisbasedonthevariancecovariancematrix.Thepointtonoteisthatthenextanalysisisan“incorrectanalysisexample”.

# the following analysis is wrongres <- prcomp(mydata)res

## Standard deviations (1, .., p=2):## [1] 3.5623427 0.2901551## ## Rotation (n x k) = (2 x 2):## PC1 PC2## panicle.length 0.99926174 -0.03841834## panicle.florets 0.03841834 0.99926174

Theanalysisresultsshowthatthefirstprincipalcomponentisavariablethatmainlyexplainspaniclelengthfromeigenvectors.

19

19

Let’sseewhathappensifthepaniclelengthismeasuredinmeters(thenextanalysisisalsoan“incorrectanalysisexample”).

# if panicle length is measured in meter unitmydata$panicle.length <- mydata$panicle.length / 100res.2 <- prcomp(mydata)res.2


Itturnsoutthatthefirstprincipalcomponentisavariablethatmainlyexplainsthevariationinthenumberofflorets.Inotherwords,theresultoftheprincipalcomponentanalysischangescompletelybecausethemeasurementscaleisdifferent.

Howcanwesolvethisproblem?Onewayistoperformtheprincipalcomponentanalysisafterscalingthevariablestomean0andvariance1respectively.Byperformingnormalizationinthisway,principalcomponentanalysiscanbeperformedwithoutbeinginfluencedbythedifferenceinmagnitudeofvariationineachvariable.Let’sactuallycalculateit.

# scalingmydata.scaled <- scale(mydata)var(mydata.scaled)

## panicle.length panicle.florets## panicle.length 1.0000000 0.4240264## panicle.florets 0.4240264 1.0000000

res.scaled <- prcomp(mydata.scaled)res.scaled


Asaresultofanalysis,itcanbeseenthatthefirstprincipalcomponentisavariablethatexplainsthatbothvariablesbecomelarge,andthesecondprincipalcomponentisavariablethatexplainsthattheotherbecomessmallerwhenonebecomeslarger.

Notethatthevariance-covariancematrixcalculatedbetweenvariablesscaledinthiswaymatchesthecorrelationmatrixcalculatedbetweenthevariablesbeforescaling.Thus,inotherwords,insteadoftheeigenvaluedecompositionofthevariance-covariancematrix,theeigenvaluedecompositionofthecorrelationmatrixproducesthesameresult.Let’sconfirmthisusingthefunctioneigen.

# cov and coreigen(cov(mydata.scaled))

20

20


eigen(cor(mydata))


Theprcompfunctionperformsprincipalcomponentanalysisbasedonthecorrelationmatrixwhentheoptionscale=Tisspecified.

# perform principal component analysis on scaled datares.scaled.2 <- prcomp(mydata, scale = T)res.scaled.2


res.scaled


Infact,principalcomponentanalysisbasedonacorrelationmatrixoftwovariablesalwayscalculatesthesameeigenvector.Also,ifthecorrelationbetweentwovariablesisr,theeigenvaluesarealways1 + 𝑟and1 − 𝑟.Youcanunderstandthemechanismbylookingattheformulashownbelow.

Assumingthatthecorrelationmatrixbetweentwovariablesis

𝐑 = @1 𝑟𝑟 1A

λmakingtheeigen(characteristic)equationequalto0is

|𝐑 − 𝜆𝐈| = 0

21

21

⇔ (1 − 𝜆)" − 𝑟" = 0

Itisobtainedasasolutionof

𝜆$ = 1 + 𝑟, 𝜆" = 1 − 𝑟

Whentheeigenvalueis𝜆$,theeigenvectorssatisfy𝐑𝐮$ = 𝜆$𝐮$.Thatis,

𝑢$$ + 𝑟𝑢$" = (1 + 𝑟)𝑢$$

𝑟𝑢$$ + 𝑢$" = (1 + 𝑟)𝑢$"

Ifyousolvethem、

𝑢$$ = 𝑢$" =1√2

≈ 0.71

Similarly,theeigenvectors𝜆"fortheeigenvalue𝐮"canbeobtained,

𝑢$$ = −1√2

, 𝑢$" =1√2

Application to multivariate data

Sofar,Ihaveexplainedtheprinciplecomponentanalysisbasedontwoexamplesofvariables.However,inmostcaseswhereprincipalcomponentanalysisisactuallyused,dataconsistingofalargenumberofvariablesisoftenanalyzed.Here,whileanalyzingdataconsistingofsevenvariables,wewillexplainhowtodeterminethenumberofprincipalcomponentsandhowtointerpretthemeaningofprincipalcomponents.

First,extractsevenvariables(leaf.length,leaf.width,plant.height,panicle.number,panicle.length,seed.length,seed.width).

# multivariate (>3) analysismydata <- data.frame( leaf.length = alldata$Flag.leaf.length, leaf.width = alldata$Flag.leaf.width, plant.height = alldata$Plant.height, panicle.number = alldata$Panicle.number, panicle.length = alldata$Panicle.length, seed.length = alldata$Seed.length, seed.width = alldata$Seed.width )missing <- apply(is.na(mydata), 1, sum) > 0mydata <- mydata[!missing, ]

Let’sperformprincipalcomponentanalysisbasedonthecorrelationmatrix.

# PCA based on a correlation matrixres <- prcomp(mydata, scale = T)summary(res)

## Importance of components:## PC1 PC2 PC3 PC4 PC5 PC6 PC7## Standard deviation 1.5626 1.2797 1.0585 0.77419 0.7251 0.64540 0.50854## Proportion of Variance 0.3488 0.2339 0.1601 0.08562 0.0751 0.05951 0.03694## Cumulative Proportion 0.3488 0.5827 0.7428 0.82844 0.9035 0.96306 1.00000

22

22

plot(res)

Whilesevenprincipalcomponentsarecalculatedfordataconsistingofsevenvariables,howmanyprincipalcomponentsshouldbeusedtosummarizethedata?Althoughvariousmethodshavebeenproposedasmethodsfordeterminingthenumberofeffectiveprincipalcomponents,hereIpresentsomesimplerules.

1. Adoptthenumberofprincipalcomponentswhosecumulativecontributionexceedsaspecifiedpercentage.70%to90%areoftenusedasapre-definedpercentage.

2. Adoptaprincipalcomponentwhosecontributionexceedstheaverageexplanatorypowerperoriginalvariable.Whenthenumberofvariablesisq,aprincipalcomponentwhosecontributionexceeds1/qisadopted.

3. Inthecaseofthecorrelationmatrix,theaboveruleadoptstheprincipalcomponentwhose“eigenvalueexceeds1”.However,thisstandardisoftentoostrict.Thereisalsoareportthatabout0.7isappropriate.

4. Inthegraphofeigenvalues,useasthenumberofcomponentsthepointthatchangesfromabruptchangetogentlechange.

Assumingthatthepercentagedeterminedbasedonthefirstruleis80%,thefirstfourprincipalcomponentswithacumulativecontributionof82.8%areselected.Next,basedonthesecondrule,thefirstthreeprincipalcomponentswhosecontributionrateexceeds1/7=14.3%areselected.Thisisthesamewiththethirdrule(However,iftheeigenvalueis0.7ormore,thefirstfiveprincipalcomponentsareselected).Finally,inthefourthrule,theeigenvaluedecreasesrapidlyuntilthefourthprincipalcomponent,andthendecreasesgradually.

23

23

Therefore,thefirstfourprincipalcomponentsarechosen.Combiningtheabove,thefirstthreeorfourprincipalcomponentsareconsideredtobetheappropriatenumberofprincipalcomponents.

Let’sdrawascatterplotofthefirstfourprincipalcomponents.Inaddition,let’scoloreachsubpopulation(Sub.population)inordertoseetherelationshipwithgeneticstructure.

# scatter plot principal component scoressubpop <- alldata$Sub.population[!missing]op <- par(mfrow = c(1,2))plot(res$x[,1:2], col = as.numeric(subpop))legend("topleft", levels(subpop), col = 1:nlevels(subpop), pch = 1)plot(res$x[,3:4], col = as.numeric(subpop))

par(op)

df <- data.frame(subpop = subpop, res$x[,1:3])plot_ly(data = df, x = ~PC1, y = ~PC2, z = ~PC3, color = ~subpop, type = "scatter3d", mode = "markers")

Whenyoudrawascatterplot,youcanseethatthedotsofthesamecolorareplottedclosely.Thissuggeststhatthereissomerelationshipbetweenprincipalcomponentscoresandgeneticbackground.

Whatkindofvariationdothefirstfourprincipalcomponentscapture?First,let’slookattheireigenvectors.

# understand the meaning of components (eigen vectors)res$rotation[,1:4]

## PC1 PC2 PC3 PC4## leaf.length 0.46468878 -0.0863622 0.33735685 -0.28604795## leaf.width 0.26873998 -0.5447022 0.19011509 -0.36565484## plant.height 0.43550572 0.2369710 0.35223063 0.55790981## panicle.number -0.03342277 0.6902669 0.07073948 -0.15465539## panicle.length 0.56777431 0.1140531 0.01542783 0.07533158

24

24

## seed.length 0.27961838 -0.2343565 -0.67403236 0.42404985## seed.width -0.34714081 -0.3086850 0.51615742 0.51361303

Lookingattheeigenvectors,thefirstprincipalcomponentisapositivevalueexceptforthenumberofpanicles(panicle.number)andthewidthoftheseed(seed.width),anditisavariablethatexplainsthe“size”excludingthewidthoftheseed.Thesecondprincipalcomponentshowsthatthenumberofpanicles(panicle.number)hasarelativelylargeweight.

Inthisway,itisdifficulttointerpretthemeaningofthemaincomponentbasedonthenumericalvalues.Tovisualizeit,drawabiplotgraph.

# understand the meaning of components (biplot)op <- par(mfrow = c(1,2))biplot(res, choices = 1:2)biplot(res, choices = 3:4)

par(op)

Thesizeandorientationofanarrowwithavariablenameindicatestherelationshipbetweeneachprincipalcomponentandthatvariable.Forexample,fromtheleftfigure,sampleswhichhavelargescoresofthefirstprincipalcomponent(forexample,174,230,etc.)havelargervaluesinplantheight,paniclelength,lengthofflagleaf.Amongthevariableswhoselengthwasmeasured,onlythewidthoftheseedhastheoppositedirectiontotheothervariables,suggestingthatthefirstprincipalcomponenttakesasmallerscorewhentheseedwidthislarger.Itcanbeseenthatthenumberofpaniclesisnotdirectedtothefirstprincipalcomponentdirection,suggestingthatitisnotinvolvedinthefirstprincipalcomponent.Thethirdandfourthprincipalcomponentscanbeinterpretedinthesameway.

Thereisastatisticcalledasfactorloading,whichrepresentstherelationbetweentheoriginalvariableandtheprincipalcomponent.Factorloadingisthecorrelationcoefficientbetweenthevalueoftheoriginalvariableandtheprincipalcomponentscore.Whentheabsolutevalueofthiscorrelationcoefficientiscloseto1,itindicatesthattherelationshipisstrong,while,whenitiscloseto0,itindicatesthattherelationshipisweakorabsent.

Nowlet’scalculatethefactorloadingsofthevariables.

25

25

# calculate factor loadingsfactor.loadings <- cor(mydata, res$x[,1:4])factor.loadings

## PC1 PC2 PC3 PC4## leaf.length 0.72611038 -0.1105166 0.35710695 -0.22145439## leaf.width 0.41992598 -0.6970483 0.20124513 -0.28308496## plant.height 0.68050970 0.3032487 0.37285150 0.43192611## panicle.number -0.05222553 0.8833255 0.07488083 -0.11973208## panicle.length 0.88718910 0.1459523 0.01633103 0.05832068## seed.length 0.43692427 -0.2999030 -0.71349267 0.32829357## seed.width -0.54243303 -0.3950201 0.54637516 0.39763215

Let’sdrawgraphsforthisresult.Thefactorloadingcanbedrawninthefollowinggraphsbecauseitfitsinacircleofradius1.

# draw factor loadingstheta <- seq(0, 2 * pi, 0.1)op <- par(mfrow = c(1,2))# plot pc1 vs. 2plot(factor.loadings[,1:2], xlim = c(-1,1), ylim = c(-1,1), pch = 4)text(factor.loadings[,1:2], rownames(factor.loadings), col = "red")lines(cos(theta), sin(theta), col = "gray")abline(v = 0, h = 0)# plot pc3 vs. 4plot(factor.loadings[,3:4], xlim = c(-1,1), ylim = c(-1,1), pch = 4)text(factor.loadings[,3:4], rownames(factor.loadings), col = "red")lines(cos(theta), sin(theta), col = "gray")abline(v = 0, h = 0)

par(op)

Intheprincipalcomponentanalysisbasedonthecorrelationmatrix,thefactorloadingcanalsobecalculatedasfollows.

26

26

factor.loadings <- t(res$sdev * t(res$rotation))[,1:4]head(factor.loadings, 3)

## PC1 PC2 PC3 PC4## leaf.length 0.7261104 -0.1105166 0.3571069 -0.2214544## leaf.width 0.4199260 -0.6970483 0.2012451 -0.2830850## plant.height 0.6805097 0.3032487 0.3728515 0.4319261

Finally,let’sperformprincipalcomponentanalysisofmarkergenotypedata.First,weextractmarkergenotypedata.

# prepare multivariate datamydata <- alldata[, 50:ncol(alldata)]dim(mydata)

## [1] 374 1311

head(mydata,3)[,1:5]

## id1000223 id1000556 id1000673 id1000830 id1000955## 1 2 0 0 0 0## 2 0 2 0 2 2## 3 0 2 2 2 2

Thisdataisdatawithaverylargenumberofvariables(1311variables).Anotherfeatureisthatthenumberofvariablesisgreaterthanthenumberofsamples.Let’susethisdatatoperformprincipalcomponentanalysisbasedonthevariance-covariancematrix.

# perform PCAres.pca <- prcomp(mydata)# summary(res.pca)plot(res.pca)

27

27

Accordingtothefourthruleshownabove,itcanbejudgedthatitisbettertousethefirstfourprincipalcomponents(otherrulescausethenumberofprincipalcomponentstobetoolarge).

Let’sdrawascatterplotofthefirstfourprincipalcomponents.

# plot principal component scoressubpop <- alldata$Sub.populationop <- par(mfrow = c(1,2))plot(res.pca$x[,1:2], col = as.numeric(subpop))plot(res.pca$x[,3:4], col = as.numeric(subpop))legend(-10, 20, levels(subpop), col = 1:nlevels(subpop), pch = 1, cex = 0.5)

28

28

par(op)

Varietiescanbedividedintofivegroups(+1group)basedonthescoresofthefirsttofourthprincipalcomponents

Let’salsodrawa3Dscatterplot.

# plot them with plotlydf <- data.frame(subpop = subpop, res.pca$x[,1:4])plot_ly(data = df, x = ~PC1, y = ~PC2, z = ~PC3, color = ~subpop, type = "scatter3d", mode = "markers")plot_ly(data = df, x = ~PC2, y = ~PC3, z = ~PC4, color = ~subpop, type = "scatter3d", mode = "markers")

Finally,let’scomparePC1-4includedinalldatawiththefirsttofourthprincipalcomponentscalculatedthistime.

# correlation between PC1-4 in alldata on one hand and PC1-4 just calculatedcor(alldata[,c("PC1","PC2","PC3","PC4")], res.pca$x[,1:4])

## PC1 PC2 PC3 PC4## PC1 0.988907541 -0.11988231 -0.03045304 -0.03589106## PC2 0.006737731 -0.07579808 0.96846220 -0.18191250## PC3 -0.129282100 -0.97046613 -0.08514082 -0.03141488## PC4 0.012470575 -0.02915991 0.16366284 0.87422607

Althoughthereisnotaperfectmatchbecausethecalculationmethodisalittledifferent,PC1-4(rows1-4)containedinalldataandthefirsttofourthprincipalcomponents(rows1-4)calculatedthistimeprovidealmostthesameinformation.

Multidimensional scaling method

Althoughitisnotpossibletodirectlymeasurethecharacteristicsofsamples,itmaybepossibletoassessdifferencesincharacteristicsbetweensamples.Inotherwords,althoughthecharacteristicsofsamplescannotbemeasuredaspointsinmulti-dimensionalspace,itmaybepossibletomeasuredistancesamongsamples.

29

29

Forexample,inpopulationgenetics,geneticdistancebetweenpopulationsiscalculatedbasedonpolymorphismsofgeneticmarkers.Inthiscase,althoughthedistancebetweenthepopulationsisknown,thegeneticcharacteristicsofthepopulationsarenotmeasuredasmultivariatedata.Anotherexampleisthehomologyofhumanimpressionstocertainobjects.Forexample,let’ssaythatyoushowphotographsofflowersofmanyvarietiesofpansyto100people,andaskthepeopletoclassifytheflowersintoanynumberofgroups.Asaresult,whentheflowersoftwovarietiesareclassifiedintothesamegroupbymanypeople,itmeansthatflowersaresimilarbetweentwovarieties.Conversely,iftheflowersoftwovarietiesareclassifiedintodifferentgroupsbymanypeople,itmeansthattheywerejudgedasunsimilar.Usingsuchdata,itispossibletosetthevalueasthedistancebetweenvarietiesbytotalinghowmanyofpeopleclassifiedthemintodifferentgroups.Again,inthiscase,althoughwedonottrytocharacterizetheflowerofeachvarietyinthemultidimensionalspaceofhumanimpression,itispossibletomeasurethedifferenceinthehumanimpressionamongthevarietiesasthedistance.

Here,basedonthedatameasuredasdistance,Iwilloutlinethemethodofsummarizingthevariationofsampleswithlow-dimensionalvariables.Thereareavarietyofsuchmethods.HereIwillintroduceclassicalmultidimensionalscaling.

Here,thedistancesamongriceaccessions(linesandvarieties)arecalculatedbasedonmarkergenotypedata,andanalysisbymultidimensionalscaling,whichisperformedbasedonthedistancematrix.

First,let’sextractmarkergenotypedataandcalculatethedistancematrixbasedonthem.

# extract marker datamydata <- alldata[, 50:ncol(alldata)]D <- dist(mydata)

Let’sperformthemultidimensionalscalingmethodbasedonthecalculateddistancematrix.

# perform MDSres.mds <- cmdscale(D, k = 10, eig = T)# res.mds

Eigenvaluesarecalculatedasinprincipalcomponentanalysis.Inaddition,thecontributionandthecumulativecontributioncanbecalculatedbasedontheeigenvalues.

# eigenvalues and contributionsres.mds$eig[1:10]

## [1] 115029.059 40849.407 20648.953 11530.683 10069.314 6591.745## [7] 4996.271 4819.066 3932.298 3581.676

res.mds$eig[1:10] / sum(res.mds$eig)

## [1] 0.310125555 0.110132562 0.055670871 0.031087446 0.027147501 0.017771758## [7] 0.013470260 0.012992505 0.010601722 0.009656423

cumsum(res.mds$eig[1:10]) / sum(res.mds$eig)

## [1] 0.3101256 0.4202581 0.4759290 0.5070164 0.5341639 0.5519357 0.5654060## [8] 0.5783985 0.5890002 0.5986566

barplot(res.mds$eig[1:10])

30

30

Accordingtotheprincipalcomponentanalysisrulebasedonthebarchartoftheeigenvaluesintheabovefigure,itisconsideredthatthevariationcontainedinthedatashouldberepresentedbyfourdimensions.

Nowlet’sdrawascatterplotthatshowstheplacementin4Dspace.

# draw the result of MDSsubpop <- alldata$Sub.populationop <- par(mfrow = c(1,2))plot(res.mds$points[,1:2], col = as.numeric(subpop))plot(res.mds$points[,3:4], col = as.numeric(subpop))legend(5, -10, levels(subpop), col = 1:nlevels(subpop), pch = 1, cex = 0.5)

31

31

par(op)

Let’sdrawathree-dimensionalscatterplotaswellasprincipalcomponentanalysis.

# draw the result of MDS with plotlydf <- data.frame(subpop = subpop, res.mds$points[,1:4])plot_ly(data = df, x = ~X1, y = ~X2, z = ~X3, color = ~subpop, type = "scatter3d", mode = "markers")plot_ly(data = df, x = ~X2, y = ~X3, z = ~X4, color = ~subpop, type = "scatter3d", mode = "markers")

Lookingattheresultsofmultidimensionalscalingseemstobeverysimilartotheresultsofprincipalcomponentanalysis.Let’scalculatethecorrelationbetweenthecoordinatevaluesandtheprincipalcomponentscoresobtainedbymultidimensionalscalingmethod.

## correlation between PC1-4 and scores in MDScor(res.pca$x[,1:4], res.mds$points[,1:4])

## [,1] [,2] [,3] [,4]## PC1 -1.000000e+00 6.392355e-15 -3.624340e-16 1.685324e-16## PC2 1.946379e-14 1.000000e+00 7.058496e-16 -2.099081e-16## PC3 -4.655895e-16 8.457081e-16 -1.000000e+00 -9.848098e-16## PC4 -4.493208e-16 -4.497497e-16 8.746548e-16 -1.000000e+00

Youcanseethatthefirstfourprincipalcomponents(rows)andthefirstfourdimensionsofthemultidimensionalscalingareeithercorrelated(-1or1)witheachother.Ifyoulookatthisresultfromaslightlydifferentpointofview,evenifthevaluesoftheoriginalvariablearenotgiven,theEuclideandistancematrixbasedonthevaluescanbeusedtoperformthesameanalysisasprincipalcomponentanalysis.

Givenadistancematrix,aswithprincipalcomponentanalysis,beingabletorepresentsamplevariationwithfewervariablesisoneofthemostusefulaspectsofmultidimensionalscaling.Iwillexplainthispointwithaslightlyspecificexample.Inthesecondlecture,itwasshownthatthereisastrongrelationshipbetweenprincipalcomponentscores(PC1-4)representinggeneticbackgroundandplantheight(Plant.height).Usingthemultidimensionalscalingmethod,itispossibletocalculateavariablewiththesamevariationastheprincipal

32

32

componentscorefrom“therelationshipbetweenaccessions(varietiesandlines)”.Basedonthat,Youcancheckwhetherthegeneticrelationshipaffectstheplantheight.Inotherwords,whetherthedistancerelationshipsamplessamplesisrelatedtothevariationinanothercharacteristicfoundinsamplesbycombiningmultidimensionalscalingandanotheranalysis(eg,multipleregressionanalysis).

Finally,let’smakeaconcretecalculationandchecktheabove-mentionedpoints.

# prepare datamydata <- data.frame( plant.height = alldata$Plant.height, res.mds$points[,1:4] )mydata <- na.omit(mydata)# analyze datamodel <- lm(plant.height ~ ., data = mydata)anova(model)

## Analysis of Variance Table## ## Response: plant.height## Df Sum Sq Mean Sq F value Pr(>F) ## X1 1 31015 31015.5 97.663 < 2.2e-16 ***## X2 1 4351 4351.2 13.701 0.0002496 ***## X3 1 11370 11370.2 35.803 5.494e-09 ***## X4 1 5598 5597.8 17.627 3.429e-05 ***## Residuals 343 108929 317.6 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(mydata$plant.height, predict(model))

33

33

Formulation of multidimensional scaling method

Here,let’sconsidertheproblemoffindingthearrangementof𝑛samplesdistributedin𝑞-dimensionalspacefromthe“Euclideandistancebetweensamples”.Now,thecoordinateswherethearrangementoftheithsampleinqdimensionalspaceisrepresentedbyacolumnvector

𝑥! = (𝑥!$, . . . , 𝑥!))#

Atthistime,usingan𝑛 × 𝑞matrix𝐗 = (𝐱$, . . . , 𝐱&)# obtainedbybundlingandtransposingcolumnvectorsof𝑛samples,amatrixthathasaninnerproductofcolumnvectorsofnsamplesastiselementcanbeexpressedas

𝐁 = 𝐗𝐗#

Here,the(𝑖, 𝑗)elementoftheinnerproductmatrix𝐁istheinnerproductofthe𝑖-th𝑗-thsample.Thatis

𝑏!% = 𝐱!#𝐱% =3𝑥!*

)

*'$

𝑥%*

Atthistime,theEuclideandistance𝑑!% betweenthe𝑖-thsampleandthe𝑗-thsamplecanbeexpressedas

34

34

𝑑!%" =3()

*'$

𝑥!* − 𝑥%*)"

=3𝑥!*")

*'$

+3𝑥%*")

*'$

− 23𝑥!*

)

*'$

𝑥%*

= 𝑏!! + 𝑏%% − 2𝑏𝑖𝑗

Ifthecenterofgravityof𝐱islocatedattheorigin,then

3𝑥!*

&

!'$

= 0

Ifbijissummedover𝑖and𝑗,

3𝑏!%

&

!'$

=33𝑥!*

)

*'$

&

!'$

𝑥%* =3𝑥%*

)

*'$

3𝑥!*

&

!'$

= 0

Therefore,thefollowingshold.

3𝑑!%"&

!'$

=3(&

!'$

𝑏!! + 𝑏%% − 2𝑏𝑖𝑗) =3𝑏!!

&

!'$

+ 𝑛𝑏%%

3𝑑!%"&

%'$

=3(&

%'$

𝑏!! + 𝑏%% − 2𝑏𝑖𝑗) =3𝑏%%

&

%'$

+ 𝑛𝑏!! =3𝑏!!

&

!'$

+ 𝑛𝑏!!

33𝑑!%"&

%'$

&

!'$

=33(&

%'$

&

!'$

𝑏!! + 𝑏%% − 2𝑏𝑖𝑗) = 𝑛3𝑏!!

&

!'$

+ 𝑛3𝑏%%

&

%'$

= 2𝑛3𝑏!!

&

!'$

Usingtheaboveequation,thefollowingrelationshipholds.

𝑏!% = −12 (𝑑!%

" − 𝑑!." − 𝑑.%" + 𝑑..")

(6)

here,

𝑑!." =1𝑛3𝑑!%"

&

%'$

=1𝑛3𝑏!!

&

!'$

+ 𝑏!!

𝑑.%" =1𝑛3𝑑!%"

&

!'$

=1𝑛3𝑏!!

&

!'$

+ 𝑏%%

𝑑.." =1𝑛"33𝑑!%"

&

%'$

&

!'$

=2𝑛3𝑏!!

&

!'$

Equation(6)meansthatifthedistancematrix𝐃 = {𝑑!%}isgiven,theinnerproductmatrix𝐁canbecalculatedreverselyfrom𝐃.Thecalculationprocedureisasfollows.Firstcalculatethe

35

35

squares𝑑!%" ofeachelementofthedistancematrix,andthencalculateitsrowmean𝑑!.",columnmean𝑑.%" ,andtotalmean𝑑..".Finally,theinnerproductmatrix𝐁isobtainedbycomputing𝑏!% accordingtoequation(6).

Oncetheinnerproductmatrix𝐁isobtained,itsufficestofind𝐗thatsatisfies𝐁 = 𝐗𝐗# .Tofind𝐗,performaspectraldecompositionofmatrix𝐁,asshownbelow.

Supposethematrix𝐁isasymmetricmatrixwith𝑞eigenvaluesandeigenvectors.Thatis,

𝐁𝐮$ = 𝜆$𝐮$, . . . , 𝐁𝐮) = 𝜆)𝐮)

Now,bundlingtheseexpressions,weobtain

𝐁(𝐮𝟏, . . . , 𝐮𝐪) = (𝐮𝟏, . . . , 𝐮𝐪) _𝜆$ 0

⋱0 𝜆)

a

Let𝐔 = (𝐮𝟏, . . . , 𝐮𝐪)and𝚲 = 𝑑𝑖𝑎𝑔(𝜆$, . . . , 𝜆))where𝑑𝑖𝑎𝑔(𝜆$, . . . , 𝜆))beadiagonalmatrixwithdiagonalelements𝜆$, . . . , 𝜆) ,thentheaboveequationbecomes

𝐁𝐔 = 𝐔𝚲

(7)Sincetheeigenvectorsareunitvectors(𝐮!#𝐮! = 1)andmutuallyorthogonal(𝐮!#𝐮% = 0),thematrix𝐔is

𝐔#𝐔 = 𝐈 ⇔ 𝐔# = 𝐔.$ ⇔ 𝐔𝐔# = 𝐈

(8)

Thatis,theinverseofthematrix𝐔issimplythetransposeof𝐔.

Fromequations(7)and(8),thematrix𝐁is

𝐁 = 𝐔𝚲𝐔#

Setting𝐮!∗ = 𝜆!$/"𝐮! ,theaboveequationis

𝐁 = 𝐔∗𝐔∗#

where𝐔∗ = (𝐮$∗ , . . . , 𝐮)∗ ).

Therefore,𝐗thatsatisfies𝐁 = 𝐗𝐗# is

𝐗 = 𝐔𝚲$/"

where𝚲$/" = 𝑑𝑖𝑎𝑔(𝜆$$/", . . . , 𝜆)

$/").

Thus,theclassicalmultidimensionalscalingmethodeventuallybecomestheeigenvalueproblemoftheinnerproductmatrix𝐁obtainedfromthedistancematrix𝐃.Andbysolvingtheeigenvalueproblem,wecanfindtheplacementof𝑛samplesinthe𝑞-dimensionalspace.

Nowlet’sanalyzeusingtheclassicalmultidimensionalscalingmethodwithoutusingthefunctioncmdscaleintheproceduredescribedabove.

First,preparethedata.Asbefore,weusemarkergenotypedata.CalculatetheEuclideandistanceusingthefunctiondistandpreparethedistancematrix𝐃.

36

36

#prepare data againmydata <- alldata[, 50:ncol(alldata)]D <- dist(mydata)

First,eachelementofthedistancematrix𝐃issquared.Next,findtherowmean,columnmean,andtotalmeanofthesquaredelements.Rowaverageandcolumnaveragecanbeeasilycalculatedusingthefunctionapply.Finally,calculatetheinnerproductmatrix𝐁accordingtoequation(6).ThecodeofRisasfollows.

#obtain B matrixD2 <- as.matrix(D^2)D2i. <- apply(D2, 1, mean)D2.j <- apply(D2, 2, mean)D2.. <- mean(D2)B <- - 0.5 * (sweep(sweep(D2, 1, D2i.), 2, D2.j) + D2..)

PerformseigenvaluedecompositionofinnerproductmatrixB.Also,calculatecoordinatevaluesaccordingtoequation(7).

#eigenvalue decomposition of B matrixeig <- eigen(B)eval <- eig$values[1:10]evec <- eig$vectors[,1:10]points <- evec * rep(sqrt(eval), each = nrow(evec))

Let’scompareitwiththeresultcalculatedusingthefunctioncmdscale.Youwillseethattheresultsareinagreement.

# compare resultshead(points, 4)

## [,1] [,2] [,3] [,4] [,5] [,6]## [1,] 20.7513541 -14.528382 0.326152 0.5387409 -0.2488285 2.6312328## [2,] -22.7959000 -2.141193 11.735408 -0.7558878 -0.8349865 0.8724051## [3,] -20.7507303 -1.412188 -10.140763 4.6532487 0.6355251 2.0224133## [4,] 0.4567869 1.313713 -9.419251 -23.2895091 -3.0974254 -11.0658283## [,7] [,8] [,9] [,10]## [1,] 2.3738573 2.4469507 0.7739924 -0.2665672## [2,] -8.4519348 4.1096465 1.4409389 0.1896848## [3,] -1.9286834 -0.6098694 -0.5289895 -1.8454303## [4,] 0.2631901 1.7704509 -0.1214236 -1.7151304

head(res.mds$points, 4)

## [,1] [,2] [,3] [,4] [,5] [,6]## [1,] 20.7513541 -14.528382 0.326152 0.5387409 -0.2488285 2.6312328## [2,] -22.7959000 -2.141193 11.735408 -0.7558878 -0.8349865 0.8724051## [3,] -20.7507303 -1.412188 -10.140763 4.6532487 0.6355251 2.0224133## [4,] 0.4567869 1.313713 -9.419251 -23.2895091 -3.0974254 -11.0658283## [,7] [,8] [,9] [,10]## [1,] -2.3738573 2.4469507 0.7739924 -0.2665672## [2,] 8.4519348 4.1096465 1.4409389 0.1896848## [3,] 1.9286834 -0.6098694 -0.5289895 -1.8454303## [4,] -0.2631901 1.7704509 -0.1214236 -1.7151304

Let’sdrawapicturelast.YouwillseethesamepictureasFigure15.Let’scheckit.

37

37

# draw graphsubpop <- alldata$Sub.populationop <- par(mfrow = c(1,2))plot(points[,1:2], col = as.numeric(subpop))plot(points[,3:4], col = as.numeric(subpop))legend(5, -10, levels(subpop), col = 1:nlevels(subpop), pch = 1, cex = 0.5)

par(op)

38

38

Report assignment 1. Principalcomponentanalysisbasedonparticle.number.per.plant,particle.length,

primary.particle.branch.number,seed.number.per.particle,andflorets.per.particle.Answerwhetherthisprincipalcomponentanalysisshouldbedonebasedonthecovariancematrixoronthecorrelationmatrix.

2. Performtheprincipalcomponentanalysisin1anddrawadiagramshowingthemagnitudeofthevarianceofeachprincipalcomponent.

3. Howmanyprincipalcomponentsaretobeseletecifthecontributionproportionexceedstheaverageexplanatorypowerperoriginalvariable(i.e.,ifthenumberofvariablesis𝑞,thecontributionpropotionexceeds1/𝑞)?

4. Forprincipalcomponentanalysisin1,drawascatterplotofprincipalcomponentscoresbetweenthefirstandsecondprincipalcomponents.Indoingso,coloreachsubpopulationbasedonthevariableSub.population.

5. Basedonthefigurein4,answerwhichkindofvaleus(largeorsmall,positiveornegativeetc.)ofthefirstandsecondprincipalcomponentscoresTEJandTRJtake.

6. Drawabiplotforthefirstandsecondprincipalcomponents.7. Calculatethefactorloadingsofthefirstandsecondprincipalcomponentsanddrawa

figureofthefactorloadings.8. Basedonthefiguresin6and7,answerforeachtraitwhethereachtraithasalargeor

smallvaluewhenthefirstprincipalcomponenthasalargevalue.

Submissionmethod:

• CreateareportasapdffileandsubmitittoITC-LMS.• WhenyoucannotsubmityourreporttoITC-LMSwithsomeissues,sendthereportto

[email protected]• Makesuretowritetheaffiliation,studentnumber,andnameatthebeginningofthe

report.• ThedeadlineforsubmissionisMay15th.

introduction to biostatistics the 3rd lecture · introduction to biostatistics the 3rd lecture...

Documents