handbooks for genebanks no. 9 statistical analysis · main r self-extracting executable file in it:...
TRANSCRIPT
Handbooks for Genebanks No. 9
Statistical Analysis for Plant Genetic Resources: Clustering and indices in made simple
Mikkel Grum and Fredrick Atieno
Handbooks for Genebanks No. 9
Mikkel Grum and Fredrick Atieno
Bioversity, c/o ICRAF, P.O. Box 30677, 00100 Nairobi, Kenya
Statistical Analysis for Plant Genetic Resources: Clustering and indices in made simple
ii
Handbooks for Genebanks No. 9
Bioversity International is an independent international scientific organization that seeks to improve the well-being of present and future generations of people by enhancing conservation and the deployment of agricultural biodiversity on farms and in forests. It is one of 15 centres supported by the Consultative Group on International Agricultural Research (CGIAR), an association of public and private members who support efforts to mobilize cutting-edge science to reduce hunger and poverty, improve human nutrition and health, and protect the environment. Bioversity has its headquarters in Maccarese, near Rome, Italy, with offices in more than 20 other countries worldwide. The Institute operates through four programmes: Diversity for Livelihoods; Understanding and Managing Biodiversity; Global Partnerships; and Commodities for Livelihoods. The international status of Bioversity is conferred under an Establishment Agreement which, by January 2007, had been signed by the Governments of Algeria, Australia, Belgium, Benin, Bolivia, Brazil, Burkina Faso, Cameroon, Chile, China, Congo, Costa Rica, Côte d’Ivoire, Cyprus, Czech Republic, Denmark, Ecuador, Egypt, Greece, Guinea, Hungary, India, Indonesia, Iran, Israel, Italy, Jordan, Kenya, Malaysia, Mali, Mauritania, Morocco, Norway, Pakistan, Panama, Peru, Poland, Portugal, Romania, Russia, Senegal, Slovakia, Sudan, Switzerland, Syria, Tunisia, Turkey, Uganda and Ukraine. Financial support for Bioversity’s research is provided by more than 150 donors, including governments, private foundations and international organizations. For details of donors and research activities please see Bioversity’s Annual Reports, which are available in printed form on request from [email protected] or from Bioversity’s Web site (www.bioversityinternational.org). The geographical designations employed and the presentation of material in this publication do not imply the expression of any opinion whatsoever on the part of Bioversity or the CGIAR concerning the legal status of any country, territory, city or area or its authorities, or concerning the delimitation of its frontiers or boundaries. Similarly, the views expressed are those of the authors and do not necessarily reflect the views of these organizations. Mention of a proprietary name does not constitute endorsement of the product and is given only for information. The rights of the owners of any trademarks used in the text are acknowledged.
Citation: Grum M and Atieno F. 2007. Statistical analysis for plant genetic resources: clustering and indices in R made simple. Handbooks for Genebanks, No. 9. Bioversity International, Rome, Italy.
ISBN: 978-92-9043-735-2
Bioversity InternationalVia dei Tre Denari, 472/a00057 Maccarese Rome, Italy
© Bioversity International, 2007
All rights reserved. Reproduction and dissemination of material in this information product for educational or other non-commercial purposes are authorized without any prior written permission from the copyright holders provided the source is fully acknowledged. Modification, translation or reproduction of material in this information product for resale or other commercial purposes is prohibited without written permission of the copyright holders. Applications for such permission should be addressed to the Information Dissemination and Communications Manager, Bioversity International, Via dei Tre Denari 472/a, 00057 Maccarese, Rome, Italy. E-mail: [email protected]
iii
Statistical Analysis for Plant Genetic Resources
iii
TAble oF coNTeNTS
Introduction 1
Chapter1DownloadingandinstallingR 3
Installation 5
Chapter2HierarchicalClustering 8
Basics 8
Interpretingahierarchicalclusterdendrogram 12
Importingyourowndata 15
Lightdatavalidation 17
Datatypes–criticaldistinctions 19
Savingyourwork,exitingRandgettingback 21
Selectingasubsetofyourdatatoanalyse 22
Clusteranalysiswithmixeddatatypes 23
Howgoodistheclusterdendrogram? 24
Pruningdendrograms 26
Advancedclustering 29
References 30
Chapter3Diversityindices 31
Richness,Shannon-WeaverandSimpson 31
Rollingyourownfunction 34
Dissimilaritymeasuresforestimatingdiversity 36
Summary 38
References 38
Chapter4Troubleshooting 39
iv
Handbooks for Genebanks No. 9
�
INTRoDUcTIoNThis handbook aims to give researchers toolsandmethodstoanalysemathematicallycomplexmorphologicalcharacterizationdata.
Risapowerfulsoftwarelanguageandenvironmentforstatisticalcomputingandgraphics,developedby volunteers from around the world, workingtogetherovertheInternet.Thecoredevelopmentteamconsistsoffewerthantwentyprogrammerswho maintain and develop the general Renvironment,withcontributionsfrommanyothers.A much larger number of contributors writeindividual statisticalpackages forR that addressspecificneeds.
Statisticians and researchers write many ofthesestatisticalpackagesfortheirownresearchpurposes, ensuring that R is in the forefrontof developments in statistics. Another of R’sstrengths is the ease with which well-designedpublication-quality plots can be produced,including mathematical symbols and formulaewhereneeded.
On the down side, if you are used to point-and-click interfaces, you may find that R hasa relatively steep learning curve. As one of thedevelopers of R once put it, R is not designedtomakeeasythingssimple,buttomakedifficultthingspossible.Nonetheless,severalprojectsareworkingonmakingReasiertouse,andinterfacesshouldimproveinthenearfuture.
WewouldnotproposeRhadwenotfoundthatithasmanyfacilitiesusefultotheanalysisofplantgenetic resources data, facilities not commonlyfound in other statistical packages. We alsobelievethatwithastep-by-stepguide,providingexamples of typical analyses required in plantgeneticresourcesresearch,mostresearcherswillbeabletoimplementsuchanalysesquitesimplyontheirowndata.
IntroductionChapter 1 Downloading
and installing RInstallation
Chapter 2 Hierarchical ClusteringBasicsInterpretingahierarchical
clusterdendrogramImportingyourowndataLightdatavalidationDatatypes–critical
distinctionsSavingyourwork,exiting
RandgettingbackSelectingasubsetofyour
datatoanalyseClusteranalysiswithmixed
datatypesHowgoodisthecluster
dendrogram?PruningdendrogramsAdvancedclusteringReferences
Chapter 3 Diversity indicesRichness,Shannon-
WeaverandSimpsonRollingyourownfunctionDissimilaritymeasuresfor
estimatingdiversitySummaryReferences
Chapter 4 Troubleshooting
Intr
oducti
on
�
Handbooks for Genebanks No. 9
Thistutorialprovidessuchaguide,beginningwithsimpleexamplesthat soon demonstrate the power of R, and allows users togradually become acquainted with the more complex aspects.Justenoughexplanationofthestatisticsisgivenforreaderswithabasicunderstandingofthesubjecttounderstandtheoutputoftheanalyses.
ForbroaderandmorecompleteintroductionstoR,ormanualsonspecific aspects, or reference works, the reader is referred to thewiderangeoffreelyavailablemanualsthatcanbedownloadedbygoing to www.r-project.org and clicking on the link “Contributed”undertheheading“Documentation”(lefthandcolumnofthepage).
�
cHAPTeR � DowNloADING AND INSTAllING RR can be downloaded from the main websitehttp://www.cran.r-project.org
Under the heading Download and Install R,the Windows bullet is the hypertext link fordownloading the binary version of R for a PCrunningWindows.ThebasesubdirectoryhasthemainRself-extractingexecutablefileinit:
IntroductionChapter 1 Downloading
and installing RInstallation
Chapter 2 Hierarchical ClusteringBasicsInterpretingahierarchical
clusterdendrogramImportingyourowndataLightdatavalidationDatatypes–critical
distinctionsSavingyourwork,exiting
RandgettingbackSelectingasubsetofyour
datatoanalyseClusteranalysiswithmixed
datatypesHowgoodisthecluster
dendrogram?PruningdendrogramsAdvancedclusteringReferences
Chapter 3 Diversity indicesRichness,Shannon-
WeaverandSimpsonRollingyourownfunctionDissimilaritymeasuresfor
estimatingdiversitySummaryReferences
Chapter 4 Troubleshooting
chapte
r �
�
Handbooks for Genebanks No. 9
Clickingthehypertextlinkbasegivesthescreen:
TheR-X.X.X-win32.exefilecanbesavedtoyourlocaldesktopbya“right mouse-click” on the R-X.X.X-win32.exe hypertext link. In theexampleabove,thefileisR-2.6.1-win32.exe,butnewversionsofRarereleasedatleasteverysixmonthsandthenumberchangeswitheachrelease.Select“SaveTargetAs ...”;browsethe filesystemtofindtheDesktopandthen“Save”.
�
Statistical Analysis for Plant Genetic Resources
InstallationDoubleclickingthisdownloadedfileshouldgivethefollowingseriesofinstallationscreens:
Choose the language you would like to use during installation andpressOK.Youwillbetakentothenextscreen.
�
Handbooks for Genebanks No. 9
Select the Next button. Continue to choose the default selectionsin the installationwindow. If installation isproceedingproperly, youwillseetheinstallationscreenbelow,fromwhichyoucanselectthecomponentsofRtoinstall.However,werecommendsimplychoosingthedefaultselections.Onceinstallationisfinished,anRiconwillbeplacedontheDesktop.
ClickingNextwillgiveyou thewindowbelow,whichallowsyou toeitherchoosetocustomizethestartupornot.WerecommendyouchooseYES,forcustomizedstartup.ThiswillenableyoutoconfigureaccesstoHelpandtheInternetforinstallingcontributedpackages.
�
Statistical Analysis for Plant Genetic Resources
Choose Next, which will give you the window below to select theR interface.While it issafetogowiththedefault installationoption(MDI),werecommendSDI,whichwillgiveyouseparatewindowsforgraphicsandtheconsole,makingyourworkeasier.
InthenextwindowforRsetupchooseDefaultandpressNext,whichbringsyoutothewindowbelow,whereyouaretochoosethetypeofInternetconnectiontouse.WerecommendyouchooseInternet2,whichwillallowyoutousetheproxysettingsofyourlocalserverincaseyouareworkingbehindafirewall.
From here onwards, just press Next at every step, choosing thedefaultsuntil the installation iscomplete.ThenpressFINISH.AnRiconwillbeplacedonthedesktop.YouarenowreadytouseR.Congratulations!!
�
IntroductionChapter 1 Downloading
and installing RInstallation
Chapter 2 Hierarchical ClusteringBasicsInterpretingahierarchical
clusterdendrogramImportingyourowndataLightdatavalidationDatatypes–critical
distinctionsSavingyourwork,exiting
RandgettingbackSelectingasubsetofyour
datatoanalyseClusteranalysiswithmixed
datatypesHowgoodisthecluster
dendrogram?PruningdendrogramsAdvancedclusteringReferences
Chapter 3 Diversity indicesRichness,Shannon-
WeaverandSimpsonRollingyourownfunctionDissimilaritymeasuresfor
estimatingdiversitySummaryReferences
Chapter 4 Troubleshooting
chapte
r � cHAPTeR �
HIeRARcHIcAl clUSTeRING
basicsOneofthemostfrequentlyusedstatisticalanalysesinplantgeneticresourcesworkishierarchicalclusteranalysis.Basedonpassportdata,characterizationdata or evaluation data, this analysis produces adendrogramshowingtheapproximaterelationshipsamongasetofaccessions.
R has a very wide range of options for clusteranalysis.Thissectionlooksatthemostcommonlyusedandstraightforwardones.
We will start with an example using Anderson’sclassical data set on Iris (Anderson, 1935). ThisdatasetcomesasastandardwithR.Trytypingthefollowingattheprompttoloadandviewthedata(press“Enter”tosubmiteachline)¹(Figure1):
data(iris)iris
¹ Note that R is case sensitive and you must therefore uselower-caseandcapitallettersconsistently.
Figure 1. TheRinterface.
9
Statistical Analysis for Plant Genetic Resources
Youwill see that there are150accessionsbelonging to3differentspecies. Each accession has been characterized for four differentdescriptors:sepal length,sepalwidth,petal lengthandpetalwidth.Theseareallstraightforwardnumericalvariables;sotodothemostcommonhierarchicalclusteranalysisjusttype:
distiris<-dist(iris[,1:4])hclustiris<-hclust(distiris)plot(hclustiris)
Why cluster?Cluster analysis may serve a number of purposes in plant geneticresourceswork:
Where at least some characterization data exists for an entirecollection,agenebankcuratormayuseclusteringtodecidetheorderinwhich theplotsare laidout in the field, so thatmorphologicallysimilaraccessions are next to each other. This facilitates the comparison ofsimilarvarietiesinthefieldandthedetectionofduplicates.Theclusteringmayalsoindicateaccessionswheremolecularmarkersmaybenecessarytodetectduplicates.
Clusteringcanbeusedinselectingacorecollection.Morphologicaldataalonewouldbeaninadequatebasisfordeterminingacorecollection,butcanbecombinedwithgeographicaldatatocreatearobustprocedure.Oneapproachforthis istodividethecollection intogeographicalunitsandsampleoneaccessionfromeachmorphologicalclusterwithineachgeographicalunit.Thedesiredsizeofthecorecollectiondeterminesthenumberofgeographicalunitsandclusterstouse.ThisisconsistentwiththeapproachesrecommendedbyvanHintumet al.(2000).
Forthebreeder,clusteringgivesasenseoftherelationshipsamongaccessions and can help the selection of parents needed to maintainadequatediversityinthebreedingprogramme.Itcanalsohelpdeterminewhichvarietiestoevaluateinwhichlocalities,asinformationisgatheredonwhichmorphotypesdobestinwhichareas.
For the researcher, clustering can help understand the structure ofdiversityinrelationtoalargenumberoffactors,e.g.farmervarietynames;distribution of diversity among farmers, villages, districts or other units;andthelevelsofsimilarity,dissimilarityordiversityoftheaccessionsfoundin different units. Finally, clustering may be used in calculating diversityindices,asdescribedinthenextchapter.Thisprovidesanunderstandingthatisessentialforthedevelopmentofconservationstrategies,prioritizinginterventionsandfacilitatingtheuseofgermplasm.
Nonetheless, it is worth mentioning a few caveats. Clustering isonlyanexploratorytechnique,andonethat ishighlydependentonthesubjectivechoiceofmethodsforcalculatingdistancesandagglomeration.Itdoesnotdirectlyshowyouwhichvariablescontributetotheobservedstructure. Ordination methods may be more useful in understandingthe underlying variables in clusters, though posterior tabulation andregressionmayalsohelptoidentifycontributingvariables.
�0
Handbooks for Genebanks No. 9
The three lines of code above illustrate the three steps of ahierarchical cluster analysis: creating a distance or dissimilaritymatrix [dist()]; clustering or agglomeration [hclust()]; andplotting [plot()]. The terms distiris and hclustiris arenamesthatwehavegiventheoutputsoftheeachoftheanalyses.We have tried to make these names descriptive, so that we caneasily remember them, but we could have used any name. In R,theseoutputsareknownasobjects.
The command iris[,1:4] indicates that you are analysing thedataincolumnsonetofouroftheirisdataset.Thesquarebracketsareusedtoselectasubsetofthedataset,withnumbersbeforethecomma (in this case,none) indicating rows, andnumbersafter thecommaindicatingthecolumnstoselect.
ThisanalysiswillgiveyouFigure2.Youcanclickthemaximizebuttoninthetoprightcornerofthescreentostretchthegraphhorizontally.Toprintthegraphlegibly,youwillneedtochangetoLandscapemodeunder“Properties...”inthe“Print”dialoguebox.
Inthisexample.weusedthedefaultoptionsforhierarchicalclusteranalysis inR.However, toget theanalysis thatwe reallywant,wewouldneedtochangesomeoftheseoptions.Thehclust()function
Figure 2. DefaultclusterdendrogramforAnderson’sIrisdata.
��
Statistical Analysis for Plant Genetic Resources
Remember that capitalization and accurate spelling are important. If you write iris$species you will not get your labels. You need the capital S in Species, because that is how it is spelt in the data set. Even worse, if you write “Average” with a capital A or make a spelling mistake in that option, R will use the default “complete” option and you may never notice the mistake!
inRuses“complete”linkageasitsdefaultmethod.Weprefertouseeither the “average” (akin to the popular UPGMA or Unweightedpair-groupmethod,arithmeticaverage)or“ward”(Ward’smethodisanalternativeapproachbasedonminimizingthesumofsquares).
Wewouldalsolikethelabelsonthedendrogramtobemoreinformativeandthewritingsmaller,sothatthelabelsdonotoverlap.Todothis,usethearrow-upkeytorecallthecommandsyoupreviouslytyped,andeditthetexttomatchthelinesbelow.Usetheleftarrowkeytogettothepartyouwanttoedit.Herewehavechangedthemethodto“average”,addedthespeciesnamesaslabelsandreducedtheirfontsizewiththeoptioncex = 0.5:
hclustiris<-hclust(distiris,method=”average”)plot(hclustiris,labels=iris$Species,cex=0.5)
ThiswillgiveyouFigure3below.
Figure 3. Cluster dendrogram for Anderson’s Iris data, using the“average”methodofagglomeration.Thelabelsarespeciesnames.
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sa seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sa seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa se
tosa
seto
sa seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa
seto
sase
tosa vi
rgin
ica
virg
inic
avi
rgin
ica vi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor vers
icol
orve
rsic
olor
vers
icol
orvi
rgin
ica
vers
icol
orve
rsic
olor virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica
vers
icol
orvi
rgin
ica
virg
inic
avi
rgin
ica
virg
inic
avi
rgin
ica vers
icol
orve
rsic
olor
virg
inic
avi
rgin
ica
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor
vers
icol
orve
rsic
olor0
12
34
Cluster Dendrogram
hclust (*, "average")distiris
Hei
ght
��
Handbooks for Genebanks No. 9
Wecannowseethat theclusteranalysisseparatestheaccessionsof Iris setosa clearly from the other two species, but that with thelimiteddatawehave,wearenotabletoseparatetheaccessionsofI. versicolorandI. virginicaeffectively.
Tryaddingtheoptionhang=-1totheplotfunction(justdonotaskwhy–1),byusing theupwardarrowonyourkeyboard to recall thelinewiththeplot()function,addinghang=-1asindicatedbelowandpressing‘Enter’:
plot(hclustiris,labels=iris$Species,cex=0.5, hang=-1)
Noticethatthe labelsarenowallaligned.This isapurelycosmeticchange,whichsomeprefer. Itdoesnotchange theanalysis inanyway. You can also try changing the cex value, which determinesrelativefontsizes.
Now try doing the analysis using the option method=“ward” byusingtheupwardarrowonyourkeyboardtorecallthelinewiththehclust() function, replacing theword “average”with “ward”,andpressing‘Enter’.Usetheupwardarrowagaintorecallthelinewith the plot() function and press ‘Enter’. How did the figurechange?
Interpreting a hierarchical cluster dendrogramThe hierarchical cluster dendrogram represents the relationshipsamong the accessions in terms of approximate distances, basedon morphological traits. The distance between two accessions isapproximatelyproportional to theheightof thehorizontal line thatjoinsthetwogroupsthattheaccessionsarein.SolookingatFigure2, produced earlier, the distance between accessions 63 and 42isapproximately4.Notethat thenumberofaccessions‘between’twoaccessionsisnotatallimportant.Thus,thedistancebetweenaccessions107and23isalsoapproximately4.
Whatistheapproximatedifferencebetweenaccessions107and63?
In hierarchical cluster displays, the program needs to decide ateach join to specify which sub-tree should go on the left andwhichontheright.Thealgorithmusedin‘hclust’istoorderthesub-treesothatthetighterclusterisontheleft(thelast,i.e.mostrecent,joinoftheleftsub-treeisatalowervaluethanthelastjoinoftherightsub-tree).
We also sneaked in the $ sign in iris$Species. This is used when you want to select a specific component of the dataset, so in this case iris$Species indicates the variable Species in the dataset iris.
��
Statistical Analysis for Plant Genetic Resources
R is written by a large number of people around the globe that each maintains a package (or library) of functions (or procedures). Some of the functions that you typed above, like hclust() and plot() are base functions and can be called without loading any library, but the function reorder.hclust() is in the package gclus, which therefore needs to be downloaded and loaded before the function can be used.
It ispossible inR to re-order theclusters togiveamore intuitivedisplay, where clusters are ordered so that the accession in oneclusterthathasthesmallestdistancetotheaccessionsinthenextcluster istheaccessionthat isplacedadjacenttothenextcluster(seeFigure5).Inordertodothis,yourequirethepackagegclus,whichdoesnotautomaticallycomewithR,but iswhat iscalleda“contributed”package.
If you are connected to the Internet it is easy enough to installgclus. Just click “Packages|Install package(s)…”, select gclusfromthemenuandclickOK(Figure4).
If you are not connected to the Internet, but have the softwarepackage on a CD, the procedure is almost the same. Just click“Installpackage(s) from localzip file…”andnavigate to the foldercontainingthedownloadedpackages.
Figure 4. InstallinganewpackagefromtheInternet
��
Handbooks for Genebanks No. 9
Whenyouhavedownloadedthepackage,compareFigures2and3withthefollowingplot(Figure5):
library(gclus)hclustreordered<-reorder.hclust(hclustiris,distiris)plot(hclustreordered,cex=0.5)
Withthisre-ordering,youcanseethataccession99ismoresimilartoaccession42than107isto42.Sincethedendrogramstillshowsboth distances as approximately Height 4, you have no way oftellinghowmuchcloser the first twoaccessionsare.Thisdisplaycouldbeveryusefulfororderingaccessionsindemonstrationplotsor for further characterization,where there is advantage inhavingsimilaraccessionsclosertoeachother.
Figure 5. Thedendrogramre-orderedsothatmore-similaraccessionsareclosertoeachother,evenwhentheyareindifferentclusters.
15 1634 33
2314 43
39 9 46 2 1310 35
26 31 30 4 483
712 25
4522 20
4749 11
176 19 37
32 2124 27 44
38 5 4118 1 28 29 40 8
5036
4299
58 9461
6391 56
67 8562
96 97 89 100
9568
93 8365 80
82 8170 90 54
6010
711
512
210
214
311
412
712
414
7 7313
484 15
012
813
971
120
69 8878 77
87 53 5166 76
59 5572
98 75 7992 64
74 52 5786
135
109
116
149
137
101
145
141
121
144
125
105
133
129
112
104
117
138 11
114
811
314
0 146
142
132
118
110
130
126 10
313
110
813
610
612
311
901
23
4
Cluster Dendrogram
hclust (*, "average")distiris
Hei
ght
��
Statistical Analysis for Plant Genetic Resources
Importing your own dataEnteringyourdatadirectly intoR isa tediousprocess, tosay theleast; it is simpler to enter yourdata into aprogram likeExcel orAccess first, in rows and columns, and then export/save it as atab-delimitedtextfile.Thistutorialassumesthatyouhaveone(andonlyone)rowatthetopthatgivesthenameofeachvariable(seeFigure6). For the following exercises we will use a data set fromAppaRaoandMushonga (1987) that containsbothpassport andmorphological characterization data (see Table1 for details). Thisdatasetcanbedownloadedfromhttp://www.bioversityinternational.org/Publications/Sorghum_Data.xls.
Figure 6. ThefileSorghumDatainExcel.
A word of warning about Excel: if some cells outside the data area contain, or have previously contained, information this can make it impossible to import the exported text file into R. The solution is always to copy just the part of the spreadsheet containing the data that you want onto a new, clean spreadsheet before saving this as a tab-delimited text file. I recommend using tabs rather than a comma or a space to separate variables, because tabs are rarely found in normal text.
Figure 7. The tab-delimited file Sorghum_Data.txt opened innotepad.
��
Handbooks for Genebanks No. 9
Table 1. Variablesinthesorghumdataset.
Variable Descriptor R mode
KEY Key Integer
LON Latitude Numeric
LAT Longitude Numeric
ALT Altitude Integer
TGR Collectorno. Factor
DATE Collectingdate Factor
PROV Province Factor
TOWN Town/village Factor
KM Distancefromtown Integer
DR Directionfromtown Factor
SS Integer
ST Integer
LOCAL.NAME Farmervarietyname Factor
REMARKS Remarks Factor
EVALUATION.DATA Presenceofevaluationdata Integer
PC Plantcolouratharvest Factor
MC Leafmidribcolour Factor
SJ Stalkjuiciness Factor
JQ Juicequality Factor
LD Lodging Orderedfactor
SN Synchronyofflowering Factor
EN Headexertion Numeric
BD Birddamage Orderedfactor
PA Overallplantaspect Orderedfactor
EV Earlyvigour Orderedfactor
FL Daysto50%flowering Integer
HT Plantheight Integer
HL Headlength Integer
HW Headwidth Integer
PT Productivetillers(#) Integer
KW Kernelweight Integer
EX Endospermtexture Factor
EC Endospermcolour Factor
ET Endospermtype Factor
KL Kernellustre Factor
SC Sub-coat Factor
KP Kernelplumpness Factor
F1-F12 Meanmonthlyrainfall Numeric
F13-F24 Meantemperature Numeric
F25-F36 Meandiurnaltemperaturefluctuation Numeric
��
Statistical Analysis for Plant Genetic Resources
InR,select“File|Changedir…|Browse”fromthemenutosetyourworkingdirectoryandpointRtowardsthedirectorywhereyouhaveyourfiles(Figure8).
NowimportyourdatafiletoRwith:
sorghum<-read.table(“Sorghum_Data.txt”, header=T,sep=”\t”)
substituting thenamesSorghum_Data.txtandsorghumwithyour own choice of file names. You have just created a dataobject (‘sorghum’ or whatever name you gave it) of the typeR calls a dataframe. The option header=T or header=TRUEindicates that you have a header row with variable names. Theoptionsep=”\t” tellstheprogramthatyouhavedelimitedyourvariablesusingtabs.
R has a long list of other ways to import data, including linkingdirectly to and from almost any database, e.g. Access. For moreontheseotheroptions,click“Help|Manuals|RDataImport/Export”andreadthemanual, though inmostcases Iwouldstickwiththeapproachsuggestedabove.
Figure 8. ChangingyourworkingdirectoryinRforWindows.
��
Handbooks for Genebanks No. 9
light data validationBefore going on to analyse your data, you should look closely atyournewdataframetosearchfortrendsanderrorsinthedata.Thecommandsbelowwillhelpyoudothis.
dim(sorghum)
willgiveyou:
[1] 137 74
whichisthenumberofrowsandcolumnsyouhave.
Nowtry:
sorghum[1:10,1:10]sorghum[128:137,65:74]
Thiswillgiveyouthedatainthetopleftcorner(thetoptenrowsandthefirsttencolumns)andthebottomrightcorner(thelasttenrowsandthelasttencolumns)ofthetablerespectively.
str(sorghum)
willtellyouhowmanyobservationsandvariablesyouhave,listthevariables inyourdataset,saywhat thevariabletype is,andevenshowyousamplesofthedatainthevariable.Wewillreturntothevariable type in the next section, because if you want to do thisanalysisright,thereareafewthingsthatyouwillneedtodo.
summary(sorghum)
isanextremelyusefulcommandforspottingtrendsanderrors inyour data. Look carefully at the maximum and minimum valuesforeachnumericalvariable.Are theywithin theexpected range?Forfactorvariableswithonlyafewlevels,youwillseehowmanyentries there are for each level. For factor variables with manylevels,e.g.alistofvarietynameswiththevariablenameLOCAL.NAME, the table() command may be more useful (remembercapitalization!):
table(sorghum$LOCAL.NAME)
There are several examples of inconsistently spelled names. Canyouspotsomeofthem?
�9
Statistical Analysis for Plant Genetic Resources
Bedhlana Bimba Bimbe Chetacheya Chibedlani
1 1 1 2 1
Chibedyane Chibuku Chibulanye Chifumbata Chikombe
1 1 1 8 2
Chikondoma Chikota Chikrochani Chimhondoro Chimondo
1 7 1 1 1
Chimutode Chinzetu Chiota Chiponda Chitate
1 1 1 1 2Chiunda Chivedyana Chivendaw Compact
headDewe
1 2 1 1 1
Fumbata Gangara Gangare Gwusi Impala
1 4 1 1 1
Isifumbata Jayta Kabhuruwayo Kanzwono Loose head
1 1 2 1 1
Maboyana Mabumbe Malandisa Mashava Masintiri
1 1 3 2 1
Matipa Matserendende Mavoyana Mayakayaka Mayere
2 1 1 1 1
Mbwende Mhando Mhonda Mhonde Mososo
1 1 1 1 1
Muchaini Muise Mungonellengo Mungore Murara
2 1 2 2 1
Mururu Musoso Mutanda Mutode Ngaima
1 2 1 3 1Ngumane Nzende Nzetu Pumbata
WhiteRed Swazi
4 1 2 1 1
Rongwe Rudebare Rugare Rundende Rundewda
1 2 1 3 1
Rundunde Rushutura Rusutura Segobane Tswedani
1 1 1 1 2
Tsweta Udayakayaka Umkumbe Vedyana Zangewaya
17 1 1 3 1
Data types – critical distinctionsWe indicated earlier that we would talk more about data types,so here goes. The states that form a descriptor may be ofqualitativeorquantitativenature,assuchdescriptorstakeseveralmathematical types. In the IPGRI/Bioversity descriptor lists,the descriptors are classified as biological (characterizationand evaluation descriptors), taxonomic (species, sub-species),geographical (longitude, latitude, topography), ethnobotanical(ethnic group, vernacular name, plant uses), etc. However, themathematicalclassificationofdescriptorsisindependentofthesediscipline-basedgroupings.
�0
Handbooks for Genebanks No. 9
The mathematical type of a descriptor determines the statisticalfunctions that can be employed for its analysis. For example,parametric correlations (Pearson’s r) may be calculated betweenquantitative descriptors, while non-parametric correlations (suchas Kendall’s τ ) may be used on ordered descriptors that are notquantitative, and chi-squared for qualitative descriptors. This is acritical point. It is not just that, for example, correlation does notworkwithfactors,butthatitdoesnotmakesense.Thethreemainmathematical types of descriptors that we will deal with here areclassifiedasNumeric,FactorandOrderedfactorsinR.Table2showshowtheserelatetosomecommonmorphologicaldescriptors.
It is regrettably common to see unordered qualitative (nominal orcategorical)descriptorsanalysedasiftheywerenumeric,simplybyreplacing,forexample,eachcolourwithanumber,i.e.red,blueandgreenwith1,2and3.Thisistantamounttosayingthatredisclosertobluethanredistogreen,whichcanobviouslydistorttheresultsofastatisticalanalysisthatismadebasedonthispremise.
Table 2.DifferentmathematicaltypesofdescriptorsandhowtheyaredefinedinR.
R-mode Descriptor types Examples
Factor Binary(twostates,presence-absence) Planttype(indeterminate,determinate)
Nodulationactivity(absent,present)
Prominenceofleafvein(yes,no)
Multi-state(manystates)
Non-ordered(qualitative,nominal,attributes)
Leafpubescence(uppersurfaceofleaf,lowersurface,bothupperandlowersurfaces,onlyleafmargin)
Seedshape(oblate,triangular,rhomboid,square,obtriangular,spherical)
Seedcoatcolour(grey,brown,yellow-green,pink,red-purple,grey-mottled)
Orderedfactor
Ordered
Semi-quantitative(rank-ordered,ordinal)
Plantgrowthhabit(prostrate,spreading,semi-erect,erect)
Leafcolour(lightgreen,green,darkgreen)
Soilparticlesizeclasses(clay,finesilt,coarsesilt,veryfinesand,finesand,mediumsand,coarsesand,verycoarsesand)
NumericorInteger
Quantitative(metric,measurement)
Discontinuous(meristic,discrete) Numberofseedperpod(integers)
Continuous Maturationperiod(days)
Pedunclelength(cm)
Seedvolume(cm³)
Seedyieldperplant(g)
��
Statistical Analysis for Plant Genetic Resources
Abitofgoodpractice:solvetheproblematitsrootanddonotcodelevels of qualitative factors as 1, 2, 3. In the example above use“red”,“blue”and“green”.Thatwayyouavoidforgettingtorecodethevariablesandavoidforgettingwhatthecodingmeans!
Rwill automatically identifyalphanumericvariables (variableswithletters,orlettersandnumbers)asFactorsandwillidentifynumericvariables as either numeric (if they have decimals) or integers (iftheyhavenodecimals).However,whenFactorvariablesarecodedusingnumbers,Rcanobviouslynotdetectthisbyitself.Rcanalsonotdetect ifavariable isanordered factor,ordetect theorderofunnumbered variable states. In these cases, we must tell R whatvariabletypeswehavebeforeweproceedwiththeanalysis.
Inthecaseofoursorghumdata,thevariablesweredescribedinTable1.VariablesPCtoKParemorphologicaldescriptorsandcontainthedatathatweareinterestedinanalysinginthefirstinstance.
However,someofthesevariablesarecodedwithnumbers,thoughtheyare in fact factorsororderedfactors.Weneedto letRknowthis,whichisdonewiththefollowingcommands:
sorghum$LD<-ordered(sorghum$LD)sorghum$SN<-factor(sorghum$SN)sorghum$EN<-ordered(sorghum$EN)sorghum$BD<-ordered(sorghum$BD)sorghum$PA<-ordered(sorghum$PA)sorghum$EX<-ordered(sorghum$EX)sorghum$EC<-factor(sorghum$EC)sorghum$ET<-factor(sorghum$ET)sorghum$KL<-factor(sorghum$KL)sorghum$SC<-factor(sorghum$SC)sorghum$KP<-factor(sorghum$KP)
Nowtrylookingatthestructureoncemore:
str(sorghum)
Saving your work, exiting R and getting backAtthispointyoumaywanttosaveyourwork,closeRandtakearest.Click “File|Savehistory…”andall yourcommands, includingeverymistakethatyoumadeontheway,getsavedtoatextfile.Ifyougivethefileanamewitha.Rextension,e.g.temp.R,youcanopenthisfileinR’stexteditorwith“File|Opensript…”.Youmightwanttoedityourerrorsordeletecertainlines.
��
Handbooks for Genebanks No. 9
You can exit R with “File|Exit”. When prompted to save yourworkspace image (Figure9),myadvice is toclick “No”.Thatwayyouavoidbuildingupgarbageinyourworkspace.
WhenyoustartRagain,yousimplyset theworkingdirectoryandopenyoursavedscript.YourunthecommandsinRbyhighlightingtheminR’seditorandpressingCtrl-R.Thiswillgetyourightbacktowhereyouleftoff.Youcanrunthewholefile,oryoucanusejustthelinesthatyouneedtogetyouwhereyouwanttobe.
Selecting a subset of your data to analyseForalmostallanalysesthatyoudo,youwillwanttoselectasubsetof the data. In this case, you want to select the variables withquantitativemorphologicalcharacterizationdata.
quantsorghum<-subset(sorghum,select=EV:KW)
Seewhatthisgavewith:
str(quantsorghum)
Youcangetmoreinformationandfurtherexamplesbytyping:
?subset
Nowanalysethequantitativedataaswiththeirisdataset:
distquant<-dist(quantsorghum)hclustquant<-hclust(distquant)plot(hclustquant)
Figure 9. Makesurethatyouhavesavedthecommandsthatyouwanttouseagainbeforeclicking“No”onexit.
��
Statistical Analysis for Plant Genetic Resources
Though you get a pretty nice dendrogram, it was based on onlyone-third of your variables, and even they should probably havebeenstandardizedsothatavariabledoesnotbecomeathousandtimesmoreimportantbecauseyoumeasureit inmillimetresratherthanmetres.
Thefunctiondist() calculatestheEuclidiandistance(thesquareroot of the sum of the squared distances along each dimension/variable) by default, but can also calculate maximum, Manhattan,Canberraorbinarydistances;noneofthemaresuitableforFactorvariables. In practical terms, this means that we cannot use thefunctiondist().ButRhassolutions,soreadon!
cluster analysis with mixed data typesIt is disturbingly common to see a cluster analysis of mixed datatypes carried out as though all the data was numeric by simplygivinganumbertoeachvariablestate,e.g.red=1,blue=2,green=3.Mathematically,thisislikesayingthatredisclosertobluethanitistogreen.Thisisofcoursenonsensethatcanseriouslydistortyouranalysis.
As an alternative, R provides a function daisy() in the librarycluster (Struyff et al., 1996) based on a general coefficient ofdissimilarity proposed by Gower (1971). This combines differenttypesofdescriptors,andprocesseseachoneaccordingtoitsownmathematicaltype.
So this time round you want to select all the morphologicalvariables.ThefirstoftheseisPCandthelastisKPandincludesallthevariablesinbetween,socanbewritten:
morphsorghum<-subset(sorghum,select=PC:KP)
Sincethedaisy() function is inthecluster library,youmust firstloadtheclusterlibrary.Todothis,type:
library(cluster)daisymorph<-daisy(morphsorghum)hclustmorph<-hclust(daisymorph)plot(hclustmorph)
Before going on, select ‘History|Recording’ on the menu in thegraphicswindow.Thiswillstorethegraphsthatyoucreatefortherestof thesessionandyoucanmovebackand forth through thesession’sgraphsbyusingthepageupandpagedownkeys.
��
Handbooks for Genebanks No. 9
Nowtryembellishingthelasttwolinesabittochangethemethodofagglomeration,addvarietynamesaslabelsandreducethefontsize.There isnoneed toexecute the first two linesagainas youhave already loaded the library cluster and created the objectdaisymorph.Theoptionylab=givesthelabelforthey-axis.
hclustmorph<-hclust(daisymorph,method=”average”) plot(hclustmorph,labels=sorghum$LOCAL.NAME,cex=0.5, ylab=”Dissimilarity”)library(gclus)hclustreord<-reorder.hclust(hclustmorph,daisymorph) plot(hclustreord,labels=sorghum$LOCAL.NAME,cex=0.5,
ylab=”Dissimilarity”)
Have a careful look at how the farmer variety names are spreadacross the cluster dendrogram in Figure10. This shows just howloosely farmer variety names are used in Zimbabwe. The namesoftenrefertoonetothreespecifictraits,andallotheraspectsoftheplantcanvaryconsiderablyfromonesitetoanother.Havealookateachaccession’stownoforigin,bychangingthelastlineto:
plot(hclustreord,labels=sorghum$TOWN,cex=0.5, ylab=”Dissimilarity”)
How good is the cluster dendrogram?Withsomanydifferentwaysofdoingclusteranalyses,andeachonegivingadifferentresult,itisnaturaltoaskwhichisthebestmethod?Theansweristhattherereallyisnowayofknowingforcertain,butthereareafewwaysinwhichyoucanattempttoanswerpartofthe
What does daisy() do?ThefunctionslightlyextendsGower’sdefinitionsoastocoverordinalandratio variables in addition to quantitative and multi-state variables. Thedefinitionofthedissimilarityindaisy(),d(i,j),is
d(i,j) = ∈ [0,1]
where
d=contributionofvariableftod(i,j),whichdependsonitstype:• fbinaryornominal:d=0ifxif=xjf,andd =1otherwise,
• fintervalscaled:d = ,
• fordinalorratio-scaled:computeranksrifandzif= andtreatthesezifasinterval-scaled,
andδ = weight of variable f , where:δ = 0, if xif or xjf is missing;δ=0,ifxif=xjf=0andvariablefisasymmetricbinary;otherwiseδ=1.
p
f=1∑ δ d(f )ij
(f )ij
p
f=1∑ δ f ij
(f )ij
(f )ij
(f )ij
| xif–xjf |
maxh xhf–minhfxjf rif–1
maxh rhf–1
(f )ij
(f )ij
(f )ij
(f )ij
��
Statistical Analysis for Plant Genetic Resources
questionwiththehelpofR.
Thefirstpartofthequestioniswhetherthemeasureofdissimilarityordistanceisthemostappropriate.Thoughthiscaninfluenceyourresultsconsiderably,wearenotgoingtoenterintothisdiscussionherebeyondthepointswehavealreadymadeaboutusingmeasuresappropriatetothetypeofdatathatyouhave.AgoodintroductorydiscussionofthisissueandotheraspectsoftheanalysisofgeneticdiversitycanbefoundinMohammadiandPrasanna(2003)andinmoredetailinLegendreandLegendre(1998).
Thesecondpartofthequestionrelatestohowwellthedistanceordissimilarity matrix is represented graphically. One way of lookingatthisisthatthedistancesrepresentedonthediagramshouldbeascloseaspossibletothedistancesinthedistanceordissimilaritymatrix. The following test extracts the distances (height) in thediagramusingthefunctioncophenetic() andthencorrelatesthiswiththeoriginaldistanceordissimilaritymatrix:
Figure 10. Thedendrogramshowingfarmervarietynames.
Dew
eM
atip
aLo
ose
head
Mut
ode
Mho
nda
Chi
fum
bata
Rug
are
Mui
seM
ayer
eM
osos
oM
avoy
ana
Bedh
lana
Mab
oyan
aU
mku
mbe
Ngu
man
eM
ungo
reM
hand
oR
ongw
eM
ucha
ini
Mur
ara
Rud
ebar
eM
asin
tiri
Mho
nde
Jayt
aIs
ifum
bata
Chi
bula
nye
Mas
hava
Fum
bata
Vedy
ana
Chi
tate
Chi
bedy
ane
Tsw
eta
Chi
kom
beTs
wet
aC
hifu
mba
taC
heta
chey
aTs
wet
aN
gum
ane
Sego
bane
Gw
usi
Chi
pond
aTs
wet
aN
gum
ane
Tsw
eta
Tsw
eta R
udeb
are
Tsw
eta
Tsw
eta
Tsw
eta
May
akay
aka
Tsw
edan
iVe
dyan
aC
hifu
mba
taC
hita
teC
him
ondo
Com
pact
hea
dC
hikr
ocha
niVe
dyan
aN
gum
ane
Tsw
eta
Che
tach
eya
Mat
sere
nden
deM
alan
disa
Mal
andi
saM
alan
disa
Mut
ode
Chi
fum
bata
Impa
laR
ushu
tura
Tsw
eta
Chi
vedy
ana
Chi
vedy
ana
Chi
bedl
ani
Run
dend
eM
atip
aM
ungo
reM
ucha
ini
Uda
yaka
yaka
Chi
fum
bata
Run
dend
eR
unde
nde
Gan
gara
Zang
eway
aM
ungo
nelle
ngo
Mun
gone
lleng
oKa
bhur
uway
oPu
mba
ta W
hite
Mbw
ende
Chi
fum
bata
Tsw
eta
Chi
fum
bata
Mur
uru
Tsw
eta
Chi
fum
bata Run
dund
eM
asha
vaC
hiun
daTs
wet
aG
anga
raG
anga
raR
unde
wda
Chi
kota
Chi
vend
awN
gaim
aR
ed S
waz
iG
anga
raM
usos
oM
usos
oC
hiko
taC
hiko
ndom
aN
zetu
Nze
tuC
hiko
taC
hiko
taKa
nzw
ono
Bim
beM
utan
daC
hiko
taC
hiko
taC
hiko
taC
hiot
aC
him
hond
oro
Nze
nde
Mab
umbe
Bim
baC
hinz
etu
Tsw
edan
iM
utod
eR
usut
ura
Tsw
eta
Chi
kom
beTs
wet
aC
hibu
kuG
anga
reKa
bhur
uway
oTs
wet
aC
him
utod
e
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Cluster Dendrogram
hclust (*, "average")daisymorph
Dis
sim
ilarit
y
��
Handbooks for Genebanks No. 9
cophenmorph<-cophenetic(hclustreord)
First download the package vegan as described earlier for thepackagegclus,andloadthelibrary.
library(vegan)mantel(cophenmorph,daisymorph)
Themanteltestwillgiveyouthecorrelationbetweentwomatrices.Bygoingbackandreconstructingtheobjecthclustmorphusingdifferentmethodsofclusteringoragglomeration,wecancomparetheabilityofthesemethodstoreflectthedistancesordissimilaritiesthatwehavecalculated,e.g.:
hclustward<-hclust(daisymorph,method=”ward”)hclustreward<-reorder.hclust(hclustward,daisymorph)cophenreward<-cophenetic(hclustreward)mantel(cophenreward,daisymorph)
Whenwecomparethecoefficientsobtainedfromthetwomanteltests,wesee that in thiscase themethod“average”,commonlyknownasUPGMA(UnweightedPairGroupMethodwithArithmeticmean),hasbyfarthehighestcorrelationcoefficient,whichmakesit easy to interpret and probably goes a long way to explainingits popularity. This does not necessarily imply that it correctlyaddresses a third question, namely “How well does it clusteror agglomerate the accessions?” This is a much more difficultquestiontoanswer,asthere isnocorrectresultagainstwhichtotestdifferentmethods.
Pruning dendrogramsWith larger data sets, dendrograms become unwieldy and canbe impossible to display on one page. It can then be useful toshow just the upper part of the dendrogram. This can be donewithanumberofdifferent functions inR,butmyownpreferenceis the function clip.clust() from the package maptree. Aswith vegan,maptree is a contributed package and needs to bedownloadedandinstalledpriortouse.Sincesomeofthecommandsinmaptreedependon functions in thepackagecombinat, thispackageisdownloadedautomaticallywithmaptree.
How do you decide where to prune a hierarchical cluster tree?Kelley et al. (1996) proposed a method that can help you decidebycalculatingapenaltyfunctionforahierarchicalclustertree.Thisfunctioncomparesthemeanacrossallclusterswiththemeanwithinclustersofthedissimilaritymeasure.
TheRfunctioniskgs()andalsocomesinthepackagemaptree.
��
Statistical Analysis for Plant Genetic Resources
butinadditionrequiresthatthepackagecombinat isloaded.Tocalculatethepenaltyfunction,type:
library(maptree)kgsmorph<-kgs(hclustmorph,daisymorph,maxclust=50)plot(names(kgsmorph),kgsmorph)
The lowest value shows you the optimum number of clusters,in this case 9 clusters (see Figure 11)². Like with so much incluster analysis, do not take this as the definitive answer. Youmay well have other reasons to select a different number, suchas a requirement to select a specific number of accessions thatrepresent a collection as well as possible. There are almostcertainly other methods that will give you other conclusions.Nonetheless,thisisausefulguide.
So we have chosen to cut the dendrogram into 9 clusters(Figure12):
²Theoption‘maxclust=’setsthemaximumnumberofclustersforwhichtocomputethemeasure,asitisnotusefultocalculateitforveryhighnumbersandthecalculationstakealongtime.
Figure 11. Kelley-Gardner-Sutcliffepenaltyfunctionforthesorghumdatashowingthattheoptimumnumberofclustersisnine.
10 20 30 40 50
2030
4050
names(kgsmorph)
kgsm
orph
��
Handbooks for Genebanks No. 9
clipreord<-clip.clust(hclustreord,sorghum,k=9)plot(clipreord)
Now, itwouldbeuseful toknowhowmanyaccessionsbelong toeachofthegroups.Thefollowingcode:
groupk9<-group.clust(hclustreord,k=9)table(groupk9)
givesthenumberofaccessionsineachgroup:
groupk9 1 2 3 4 5 6 7 8 9 6 10 2 55 17 23 13 8 3
Typingtheobjectnamegroupk9willgiveyougroupmembershipforeachaccession in theorder that theaccessionsappear in thedataframe:
groupk9
The output can be a bit difficult to decipher, so try binding thegroups to the original data file so that you can see clearly whichaccessionsbelongtowhichgroup:
sorghum<-cbind(sorghum,groupk9)str(sorghum)
Figure 12. Cluster dendrogram pruned to eight clusters usingclip.clust().
1 2
3
4 5
6 7
8
9
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Cluster Dendrogram
hclust (*, "average")daisymorph
Hei
ght
�9
Statistical Analysis for Plant Genetic Resources
Youcanseegroupmembershipagainsteachaccessionnumberwith:
subset(sorghum,select=c(TGR,groupk9))
The select statement selects the variables to display. Thisarrangement of the dataframe now opens up table views thatcangiveabetter ideaof thepropertiesofeachcluster.Using thefunction table(), you can now look at how different qualitativetraitsarespreadoverthedifferentclusters,e.g.:
table(sorghum$MC,sorghum$groupk9)
whichwillgiveyou:
1 2 3 4 5 6 7 8 9C 1 0 1 52 17 23 13 8 3D 4 10 1 3 0 0 0 0 0Y 1 0 0 0 0 0 0 0 0
Tryothervariablestootolookforinterestingpatternsinthedataset.
Advanced clusteringTheexamplesofhierarchicalclusteringthatwehavegonethroughaboveareinrealityonlyafewofverymanydifferentwaysofdoingclustering.Forexample,becausethesizeofthedissimilaritymatrixrises exponentially with the number of accessions, hierarchicalclustering quickly runs into problems with very large datasets(approximatelyover500accessions).Roffersaverylargenumberofotherpossibilities,beingperhapsbestknownforthefunctionsinthepackagecluster,describedbyStruyffet al.(1996).
Adding molecular markersThis tutorial focuses on the analysis of morphological characterizationdata,astherearemanyotherguidestotheanalysisofmoleculardata.However, theremaybe timeswhenyouwouldwant todoacombinedanalysisofyourmorphologicalandmoleculardata.Thefunctiondaisy()allowsyoutodothisbydefiningyourmolecularvariablesasasymmetricbinary(seetheboxondaisy()above).Thisisequivalenttothesimplematchingcoefficient,whichisfrequentlyusedwithmoleculardata.
Contrary to the way that other variables are defined in thedataframe itself, asymmetric binary variables are defined whenexecuting daisy(). For example, if the data set sorghum hadcontained data for twenty molecular marker variables calledmm1 tomm20,wewouldhavehad:
morphmm<-subset(sorghum,select=c(PC:KP,mm1:mm20) daimorpmm<-daisy(morphmm,type=list(asymm=c(mm1:mm20)))
�0
Handbooks for Genebanks No. 9
Toseewhatotherfunctionsthepackagecontainstrytyping:
library(help=cluster)
Another package with a more eclectic range of functions thatperform what is called model-based clustering is the packagemclust,describedbyFraleyandRaftery(2002).
R isdevelopingsoquicklythat it isnotpossibletogiveanup-to-dateoverviewofallthefunctionsforclustering,butacompletelistcanbeobtainedbydownloadingallthecontributedpackagesandusingtheHTMLhelp:click‘Help|HTMLhelp’,thenclickonthelink‘SearchEngineandKeywords’andmovedownthepagetothelink‘cluster’.
ReferencesAnderson E. 1935. The irises of the Gaspe Peninsula. Bulletin of the
American Iris Society,59:2–5.
Appa Rao S, Mushonga JN. 1987. A catalogue of passport andcharacterization data of sorghum, pearl millet and finger milletgermplasmfromZimbabwe.IBPGR,Rome,Italy.
Fraley C, Raftery AE. 2002. MCLUST: Software for model-basedclustering, density estimation and discriminant analysis.Technical Report, Department of Statistics, University ofWashington, Seattle, Washington, USA. Available from:http://www.stat.washington.edu/mclust Date accessed: 19 July2007.
Gower JC. 1971. A general coefficient of similarity and some of itsproperties.Biometrics27:857–871.
Kelley LA, Gardner SP, Sutcliffe MJ. 1996. An automated approachfor clustering an ensemble of NMR-derived protein structuresinto conformationally-related subfamilies. Protein Engineering,9:1063–1065.
LegendreP,LegendreL.1998.Numericalecology.2ndEnglishedition.ElsevierScienceBV,Amsterdam,TheNetherlands.xv+853p.
Mohammadi SA, Prasanna BM. 2003. Analysis of genetic diversityin crop plants – salient statistical tools and considerations. Crop Science,43:1235–1248.
Struyff A, Hubert M, Rousseeuw PJ. 1996. Clustering in an object-oriented environment. Journal of Statistical Software 1(4) [on-line].Available from: http://www.jstatsoft.org/v01/i04/paper/clus.pdf.Dateaccessed:19July2007.
Van Hintum ThJL, Brown AHD, Spillane C, Hodgkin T. 2000. Corecollectionsofplantgeneticresources.IPGRITechnical BulletinNo.3.BioversityInternational,Rome,Italy.
��
cHAPTeR � DIveRSITy INDIceS
By Mikkel Grum
The following commands from the previouschapterwillgetyouwhereyouneedtobetocarryouttheexercisesinthischapter.
sorghum<-read.table(“Sorghum_Data .txt”,header=T,sep=”\t”)sorghum$LD<-ordered(sorghum$LD)sorghum$SN<-factor(sorghum$SN)sorghum$EN<-ordered(sorghum$EN)sorghum$BD<-ordered(sorghum$BD)sorghum$PA<-ordered(sorghum$PA)sorghum$EX<-ordered(sorghum$EX)sorghum$EC<-factor(sorghum$EC)sorghum$ET<-factor(sorghum$ET)sorghum$KL<-factor(sorghum$KL)sorghum$SC<-factor(sorghum$SC)sorghum$KP<-factor(sorghum$KP)
morphsorghum<-subset(sorghum, select=PC:KP)library(cluster)daisymorph<-daisy(morphsorghum)hclustmorph<-hclust(daisymorph, method=”average”)library(gclus)hclustreord<-reorder.hclust(hclust morph,daisymorph)library(maptree)groupk9<-group.clust(hclustreord,k=9)
sorghum<-cbind(sorghum,groupk9)
Ready!
Richness, Shannon-weaver and SimpsonIn ecological research, a number of indices areoftenusedtomeasurespeciesdiversityindifferentcommunities. After ‘richness’, which is simply thenumber of species represented in the community,the two most common indices are probably theShannon-Weaver index and the Simpson index.
IntroductionChapter 1 Downloading
and installing RInstallation
Chapter 2 Hierarchical ClusteringBasicsInterpretingahierarchical
clusterdendrogramImportingyourowndataLightdatavalidationDatatypes–critical
distinctionsSavingyourwork,exiting
RandgettingbackSelectingasubsetofyour
datatoanalyseClusteranalysiswithmixed
datatypesHowgoodisthecluster
dendrogram?PruningdendrogramsAdvancedclusteringReferences
Chapter 3 Diversity indicesRichness,Shannon-
WeaverandSimpsonRollingyourownfunctionDissimilaritymeasuresfor
estimatingdiversitySummaryReferences
Chapter 4 Troubleshooting
chapte
r �
��
Handbooks for Genebanks No. 9
The Shannon-Weaver index combines a measure of richness witha measure of evenness. Higher values indicate more diversity. TheSimpson index is between zero and one. It measures evenness ofspeciesorgroupmembership.Itcanalsobeinterpretedasthelikelihoodoftworandomlyselectedaccessionsbeingdifferentfromeachother.
Whenlookingatintra-specificdiversitywecouldusemorphologicalclusters instead of species. Magurran (1998) and Kindt and Coe(2005) provide more detailed descriptions of diversity indices andare good places to continue for the reader who is interested inworkingfurtherwiththemeasurementofbiodiversity.
Theoutputtablefromthecommand:
provk9<-table(sorghum$PROV,sorghum$groupk9)provk9
gives us the number of accessions collected from each group ineachprovince.Thecolumnsrepresentthegroupsandtherowstheprovinces:
1 2 3 4 5 6 7 8 9 MDL 0 7 0 19 2 4 0 5 1 MLC 2 1 0 2 3 3 0 0 1 MLN 0 0 0 8 1 1 0 0 0 MLW 0 0 1 4 0 4 4 0 0 MNL 1 1 0 3 3 7 9 1 1 MSV 3 1 1 19 8 4 0 2 0
Thistypeoftable,withthenumbersorfrequenciesofeachspeciesorgroupmembershipforeachsiteorarea,isknownasacommunitydata matrix. The matrix can be transposed to give a view that ismoreeasilycomparedtotheplotsthatfollowusingthecommand:
t(provk9)
Richness,definedhereasthenumberofgroupsrepresentedineachprovince,canbecalculatedusing:
specnumber(provk9)
withtheresults:
MDL MLC MLN MLW MNL MSV 6 6 3 4 8 7
Wecangettoseethecodebehindthesecalculationsbytyping:
specnumber
whichgivesus:
function (x, MARGIN = 1) {
��
Statistical Analysis for Plant Genetic Resources
apply(x > 0, MARGIN, sum)}
Thefunctionsimplyappliesthefunctionsumalongthe‘margin 1’,i.e. inour example, it calculates the sumof varieties found (morethan0times)ineachprovinceinthetableprovk9.
For comparison, the number of accessions collected in eachprovincecanbeobtainedusing:
table(sorghum$PROV)MDL MLC MLN MLW MNL MSV 38 12 10 13 26 38
The package vegan has a function, diversity() that cancalculateboth theShannon-Weaver indexand theSimpson indexon a community data matrix. The default is the Shannon-Weaverindex(Figure13):
library(vegan)swi<-diversity(provk9)barplot(swi)
Figure 13. Shannon-Weaver index for intra-specific sorghumdiversitybasedonninemorphologicalgroups.
MDL MLC MLN MLW MNL MSV
0.0
0.5
1.0
1.5
��
Handbooks for Genebanks No. 9
³Thesquarerootofthevalueisused,becausethisgivesanalmostlinearcorrelationwithrichnessinmostcasesandrichnessisthediversitymeasurethatmostpeoplefeelmostcomfortablewith.
Addcoloursusingthefollowingcommands:
barplot(swi,col=rainbow(20))barplot(swi,col=”lightblue”)
Thefirstcommandwillgiveyourbarsarangeofcolours fromtherainbow,andthesecondwillmakeyourbarslightblue.
TogettheSimpsonindex,youneedtospecifyit:
simp<-diversity(provk9,index=”simpson”)barplot(simp)
YoucangetthecorrelationcoefficientbetweentheShannon-WeaverandSimpsonindiceswith:
cor(swi,simp)
Whatisthecorrelationcoefficientbetweenthetwoifyoucalculatethemforeachtown,ratherthanprovince?Whatdoesthistellyouaboutthetwoindices?
Rolling your own functionIfyourobjectiveweretoevaluateatown’scontributiontoZimbabwe’sconservation effort, then the richness, diversity and evennessmeasuresabovedonottakeaccountofhowcommontheclustersareinZimbabweasawhole.
SinceIcouldnotfindanyfunctioninanysoftwarethattackledthisquestiontomysatisfaction, Idevelopedmyown index.This isanindexthatweightseachclusterforrareness,bydividingthenumberof observations of that cluster in the province (or town, farm, orwhateverotherunityoumightwanttodotheanalysison)withthetotalnumberofobservationsofthatclusterinyourstudyarea(inthiscaseZimbabwe).Thesquarerootofthevalue³foreachtownisthensummedacrossallvarietiesasin:
CI=∑√where
aij isthenumberofobservationsofvarietyiinprovincej,andAiisthetotalnumberofobservationsofvarietyiinthestudyarea.
Wewritethisasafunctionwiththefollowingscript:
naij
Ai
��
Statistical Analysis for Plant Genetic Resources
ci <- function(x, MARGIN = 2){ x <- as.matrix(x) Ai <- apply(x, MARGIN, sum) qq <- function (aij) { sqrt(aij/Ai) } Q <- apply(x, 1, qq) P <- apply(Q, 2, sum) P}
Oncethefunction iscreatedbyrunningthe linesabove, itcanbeusedtocalculateournewindexinthesamewaythatthediversityfunctionwasused.
ci(provk9)
Plottingthesevaluesagainstrichnessusing
plot(specnumber(provk9),ci(provk9), type="n")text(specnumber(provk9),ci(provk9),row. names(provk9))
Figure 14. Plot of CI scores against richness in each province inZimbabwe.
3 4 5 6 7 8
1.0
1.5
2.0
2.5
3.0
3.5
4.0
specnumber(provk9)
ci(p
rovk
9)
MDL
MLC
MLN
MLW
MNL
MSV
��
Handbooks for Genebanks No. 9
wesee thatalthoughbothMDLandMLChavesixclusterseach,MDL scores higher on the CI scale, because the clusters in theprovincearerarerandthereforecontributemoretotheoveralllevelof sorghum diversity in Zimbabwe. This would seem like valuableinformationinprioritizingconservationareas!
Forcodecomparisonandabetterunderstandingofhowyoumightconstructyourownfunction,youmightwanttotakea lookatthecodeforthediversity()functionwith:
diversity
Frequentuseofthepossibilityofseeingthecodeusedinfunctionsisoneofthebestwaysoflearninghowtomakeyourownfunctions.
Dissimilarity measures for estimating diversityMostoftheseindicesdonottakegeneticdistancesintoaccount,sotwocloselyrelatedgroupsareconsideredasdiverseastworelativelyunrelatedgroups.Thisisafairlyacceptableapproachwhenusedforspecies diversity, because barriers to gene flow among speciesresult in fairlydiscreteunits.However,whenused for theanalysisof intra-specificdiversity theassumptionofdistinctclasseswithinspecies is less well-founded, partly because there may be quitesignificantgeneflowwithinspeciesandpartlybecauseofdifficultiesin fixing criteria for intra-specific classification. Therefore a betterapproachmaybetomakedirectuseofthedissimilaritymeasurestoobtainestimatesofdiversity.
Oneanalysis inR thatgivesausefulpictureofdiversityacrossanumberofunits is the functionanosim() in thepackagevegan.This analysis ranks all the dissimilarities among accessions andproducesaboxplotoftheranksofdissimilaritieswithinagivenunit,e.g.provinces.Thecommands:
anomorph<-anosim(daisymorph,sorghum$PROV)plot(anomorph)
willgivetheplotinFigure15.Thisplotisveryinformative,althoughits interpretation can be a bit challenging. The boxes show theminimum,firstquartile,median,thirdquartileandmaximumrankofthedissimilaritieswithineachprovince,e.g.themediandissimilaritywithin the Manicaland Province (MNL) is ranked 3837th and thehighestdissimilarity is ranked9295thof the total9316dissimilarityvaluesinthematrix.Theboxesaredrawnwithwidthsproportionaltothesquare-rootsofthenumberofobservationsinthegroups.
One observation that we make in Figure 15 is that the mediandissimilarity rank for MLC is substantially higher than any of the
��
Statistical Analysis for Plant Genetic Resources
others, indicating that diversity may in fact be high in MLC. Withfewer accessions collected, 12 accessions in MLC against 26accessionsinMNL,thelowerShannon-WeaverindexforMLCmightsimplybearesultofasmallsamplesize.
Figure 15. Boxplotoftherankorderofdissimilaritiesbyprovince.
Between MDL MLC MLN MLW MNL MSV
020
0040
0060
0080
00
R = 0.082 , P = 0.001
Figure 16. BoxplotoftherankorderofdissimilaritiesbyspeciesinAnderson'sirisdataset.
Between setosa versicolor virginica
020
0040
0060
0080
0010
000
R = 0.879 , P = < 0.001
��
Handbooks for Genebanks No. 9
Anotherquickexampleshowshowdifferenttheresultscanbe:
data(iris)distiris<-dist(iris[,1:4])anoiris<-anosim(distiris,iris$Species)plot(anoiris)
In this example, Figure 16 shows, not surprisingly, that most ofthe dissimilarity is between species, with much less dissimilaritywithin species. Most of the samples of the species Iris setosaareverysimilar toeachother,but therearea fewoutliersthataremorphologicallydistinctfromtherestofthesamplesofthespecies.Their higher dissimilarity is shown as dots above the whiskers ofthebox.
SummaryInthischapterweextendedtheworkfromthepreviouschaptertolook at how we could use some of the common diversity indiceswithintra-specificdiversitybasedonmorphologicaldata.Webrieflylooked at how we might develop our own function, in this caseto take account of each region’s contribution to overall diversityin the study area. We also looked at how we could deal with thecontinuous nature of intra-specific diversity, which rarely exists indiscreteunits.
This tutorial has been a stepwise guide through the analysis ofmorphologicalcharacterizationdataandthespecificproblemsthattheir mathematical properties present. I have seen users with noexperienceofRgothroughthesestepsonelineatatime,modifythecodetodealwith theirowndataandget the results that theywant. At the same time, this is not a complete introduction to R,and thoseusers thatwould like tomoveon tootheranalyseswillneedadditionalreferences.ThesoftwareRcomeswithabeginnersmanualentitled“AnintroductiontoR”thatcanbefoundinthemenuunder“Help|Manuals(inPDF)”.ManyotherreferencescanbefoundontheRProjectwebsite:www.r-project.org/.
ReferencesMagurranAE. 1988. Ecological diversity and its measurement.Croom
Helm,London,UK.
KindtR,CoeR.2005.Treediversityanalysis:amanualandsoftwareforcommonstatisticalmethodsforecologicalandbiodiversitystudies.ICRAF,Nairobi,Kenya.203p.
�9
Statistical Analysis for Plant Genetic Resources
cHAPTeR � TRoUbleSHooTING
Thissection listssomeofthecommonproblemsanderrormessageswecameacrossinusingthistutorial. For other error messages encounteredwhenusingR,thebestoptionistouseanyoneofthesearchenginesfoundontheRprojectwebsiteforhelp(http://www.r-project.org/).
1. “Error: Object " . . . " not found”Youhavetriedtoexecutea functionwithout firstloading the library or package that contains thefunction.Solution:loadthelibraryorpackage.
2. I can’t read the text in the cluster diagrams; it’s too small and cramped.Have you maximized the graphics window?Double-clickingonthebluebaracrossthetopofthegraphicswindowwilldothis.ForprintingfromR,youcansetthepagetolandscapebyclickingonthepropertiesbuttonintheprintdialogueboxand selecting ’landscape‘. You can also use theoptioncex=0.5oranothernumbertoreducethesizeofthefont.
3. CapitalizationR is sensitive to capitalization; if the variable iscallediris$Species, theniris$specieswillnot work. If options are capitalized incorrectly,e.g.writingmethod=Average,ormethod=Ward,thedefaultmethodisused.Inboththeexamplesused,themethodshouldhavebeenwrittenwithoutcapital letters,sothedefaultmethod=completeisusedinstead.Nowarningisgiven.
4. “Error in file(file, "r") : unable to open connection”You forgot to set the working directory. Clicking’File|Changedir…|Browse‘fromthemenuwillleadyouthroughthis(seeFigure8).
IntroductionChapter 1 Downloading
and installing RInstallation
Chapter 2 Hierarchical ClusteringBasicsInterpretingahierarchical
clusterdendrogramImportingyourowndataLightdatavalidationDatatypes–critical
distinctionsSavingyourwork,exiting
RandgettingbackSelectingasubsetofyour
datatoanalyseClusteranalysiswithmixed
datatypesHowgoodisthecluster
dendrogram?PruningdendrogramsAdvancedclusteringReferences
Chapter 3 Diversity indicesRichness,Shannon-
WeaverandSimpsonRollingyourownfunctionDissimilaritymeasuresfor
estimatingdiversitySummaryReferences
Chapter 4 Troubleshooting
chapte
r �
�0
Handbooks for Genebanks No. 9
5. My data won’t import (a)Servesyouright!YouareusingExcelanddidnotbelievemewhenIsaidthatyoushouldcutandpasteyourdata,andonlythedata,toanewspreadsheetbeforesavingitasatab-delimitedfile.
6. My data won’t import (b)The only other problem I know of that messes up importing of asimpledatasetofrowsandcolumnsiswhenyouhaveapostrophesinsomeofyourvariables,e.g.namesliken’goni.AsfarasIknow,you have to remove the apostrophes before importing the data.Accentsandmostothersymbolspresentnoproblem.
ISBN 978-92-9043-735-2
IPGRIandINIBAPoperateunderthenameBioversityInternational
SupportedbytheCGIAR