handbooks for genebanks no. 9 statistical analysis · main r self-extracting executable file in it:...

Handbooks for Genebanks No. 9

Statistical Analysis for Plant Genetic Resources: Clustering and indices in made simple

Mikkel Grum and Fredrick Atieno


Mikkel Grum and Fredrick Atieno

Bioversity, c/o ICRAF, P.O. Box 30677, 00100 Nairobi, Kenya

Statistical Analysis for Plant Genetic Resources: Clustering and indices in made simple

ii


Bioversity International is an independent international scientific organization that seeks to improve the well-being of present and future generations of people by enhancing conservation and the deployment of agricultural biodiversity on farms and in forests. It is one of 15 centres supported by the Consultative Group on International Agricultural Research (CGIAR), an association of public and private members who support efforts to mobilize cutting-edge science to reduce hunger and poverty, improve human nutrition and health, and protect the environment. Bioversity has its headquarters in Maccarese, near Rome, Italy, with offices in more than 20 other countries worldwide. The Institute operates through four programmes: Diversity for Livelihoods; Understanding and Managing Biodiversity; Global Partnerships; and Commodities for Livelihoods. The international status of Bioversity is conferred under an Establishment Agreement which, by January 2007, had been signed by the Governments of Algeria, Australia, Belgium, Benin, Bolivia, Brazil, Burkina Faso, Cameroon, Chile, China, Congo, Costa Rica, Côte d’Ivoire, Cyprus, Czech Republic, Denmark, Ecuador, Egypt, Greece, Guinea, Hungary, India, Indonesia, Iran, Israel, Italy, Jordan, Kenya, Malaysia, Mali, Mauritania, Morocco, Norway, Pakistan, Panama, Peru, Poland, Portugal, Romania, Russia, Senegal, Slovakia, Sudan, Switzerland, Syria, Tunisia, Turkey, Uganda and Ukraine. Financial support for Bioversity’s research is provided by more than 150 donors, including governments, private foundations and international organizations. For details of donors and research activities please see Bioversity’s Annual Reports, which are available in printed form on request from [email protected] or from Bioversity’s Web site (www.bioversityinternational.org). The geographical designations employed and the presentation of material in this publication do not imply the expression of any opinion whatsoever on the part of Bioversity or the CGIAR concerning the legal status of any country, territory, city or area or its authorities, or concerning the delimitation of its frontiers or boundaries. Similarly, the views expressed are those of the authors and do not necessarily reflect the views of these organizations. Mention of a proprietary name does not constitute endorsement of the product and is given only for information. The rights of the owners of any trademarks used in the text are acknowledged.

Citation: Grum M and Atieno F. 2007. Statistical analysis for plant genetic resources: clustering and indices in R made simple. Handbooks for Genebanks, No. 9. Bioversity International, Rome, Italy.

ISBN: 978-92-9043-735-2

Bioversity InternationalVia dei Tre Denari, 472/a00057 Maccarese Rome, Italy

© Bioversity International, 2007

All rights reserved. Reproduction and dissemination of material in this information product for educational or other non-commercial purposes are authorized without any prior written permission from the copyright holders provided the source is fully acknowledged. Modification, translation or reproduction of material in this information product for resale or other commercial purposes is prohibited without written permission of the copyright holders. Applications for such permission should be addressed to the Information Dissemination and Communications Manager, Bioversity International, Via dei Tre Denari 472/a, 00057 Maccarese, Rome, Italy. E-mail: [email protected]

iii

Statistical Analysis for Plant Genetic Resources

iii

TAble oF coNTeNTS

Introduction 1

Chapter1DownloadingandinstallingR 3

Installation 5

Chapter2HierarchicalClustering 8

Basics 8

Interpretingahierarchicalclusterdendrogram 12

Importingyourowndata 15

Lightdatavalidation 17

Datatypes–criticaldistinctions 19

Savingyourwork,exitingRandgettingback 21

Selectingasubsetofyourdatatoanalyse 22

Clusteranalysiswithmixeddatatypes 23

Howgoodistheclusterdendrogram? 24

Pruningdendrograms 26

Advancedclustering 29

References 30

Chapter3Diversityindices 31

Richness,Shannon-WeaverandSimpson 31

Rollingyourownfunction 34

Dissimilaritymeasuresforestimatingdiversity 36

Summary 38

References 38

Chapter4Troubleshooting 39

iv


�

INTRoDUcTIoNThis handbook aims to give researchers toolsandmethodstoanalysemathematicallycomplexmorphologicalcharacterizationdata.

Risapowerfulsoftwarelanguageandenvironmentforstatisticalcomputingandgraphics,developedby volunteers from around the world, workingtogetherovertheInternet.Thecoredevelopmentteamconsistsoffewerthantwentyprogrammerswho maintain and develop the general Renvironment,withcontributionsfrommanyothers.A much larger number of contributors writeindividual statisticalpackages forR that addressspecificneeds.

Statisticians and researchers write many ofthesestatisticalpackagesfortheirownresearchpurposes, ensuring that R is in the forefrontof developments in statistics. Another of R’sstrengths is the ease with which well-designedpublication-quality plots can be produced,including mathematical symbols and formulaewhereneeded.

On the down side, if you are used to point-and-click interfaces, you may find that R hasa relatively steep learning curve. As one of thedevelopers of R once put it, R is not designedtomakeeasythingssimple,buttomakedifficultthingspossible.Nonetheless,severalprojectsareworkingonmakingReasiertouse,andinterfacesshouldimproveinthenearfuture.

WewouldnotproposeRhadwenotfoundthatithasmanyfacilitiesusefultotheanalysisofplantgenetic resources data, facilities not commonlyfound in other statistical packages. We alsobelievethatwithastep-by-stepguide,providingexamples of typical analyses required in plantgeneticresourcesresearch,mostresearcherswillbeabletoimplementsuchanalysesquitesimplyontheirowndata.

IntroductionChapter 1 Downloading

and installing RInstallation

Chapter 2 Hierarchical ClusteringBasicsInterpretingahierarchical

clusterdendrogramImportingyourowndataLightdatavalidationDatatypes–critical

distinctionsSavingyourwork,exiting

RandgettingbackSelectingasubsetofyour

datatoanalyseClusteranalysiswithmixed

datatypesHowgoodisthecluster

dendrogram?PruningdendrogramsAdvancedclusteringReferences

Chapter 3 Diversity indicesRichness,Shannon-

WeaverandSimpsonRollingyourownfunctionDissimilaritymeasuresfor

estimatingdiversitySummaryReferences

Chapter 4 Troubleshooting

Intr

oducti

on

�


Thistutorialprovidessuchaguide,beginningwithsimpleexamplesthat soon demonstrate the power of R, and allows users togradually become acquainted with the more complex aspects.Justenoughexplanationofthestatisticsisgivenforreaderswithabasicunderstandingofthesubjecttounderstandtheoutputoftheanalyses.

ForbroaderandmorecompleteintroductionstoR,ormanualsonspecific aspects, or reference works, the reader is referred to thewiderangeoffreelyavailablemanualsthatcanbedownloadedbygoing to www.r-project.org and clicking on the link “Contributed”undertheheading“Documentation”(lefthandcolumnofthepage).

�

cHAPTeR � DowNloADING AND INSTAllING RR can be downloaded from the main websitehttp://www.cran.r-project.org

Under the heading Download and Install R,the Windows bullet is the hypertext link fordownloading the binary version of R for a PCrunningWindows.ThebasesubdirectoryhasthemainRself-extractingexecutablefileinit:














chapte

r �

�


Clickingthehypertextlinkbasegivesthescreen:

TheR-X.X.X-win32.exefilecanbesavedtoyourlocaldesktopbya“right mouse-click” on the R-X.X.X-win32.exe hypertext link. In theexampleabove,thefileisR-2.6.1-win32.exe,butnewversionsofRarereleasedatleasteverysixmonthsandthenumberchangeswitheachrelease.Select“SaveTargetAs ...”;browsethe filesystemtofindtheDesktopandthen“Save”.

�


InstallationDoubleclickingthisdownloadedfileshouldgivethefollowingseriesofinstallationscreens:

Choose the language you would like to use during installation andpressOK.Youwillbetakentothenextscreen.

�


Select the Next button. Continue to choose the default selectionsin the installationwindow. If installation isproceedingproperly, youwillseetheinstallationscreenbelow,fromwhichyoucanselectthecomponentsofRtoinstall.However,werecommendsimplychoosingthedefaultselections.Onceinstallationisfinished,anRiconwillbeplacedontheDesktop.

ClickingNextwillgiveyou thewindowbelow,whichallowsyou toeitherchoosetocustomizethestartupornot.WerecommendyouchooseYES,forcustomizedstartup.ThiswillenableyoutoconfigureaccesstoHelpandtheInternetforinstallingcontributedpackages.

�


Choose Next, which will give you the window below to select theR interface.While it issafetogowiththedefault installationoption(MDI),werecommendSDI,whichwillgiveyouseparatewindowsforgraphicsandtheconsole,makingyourworkeasier.

InthenextwindowforRsetupchooseDefaultandpressNext,whichbringsyoutothewindowbelow,whereyouaretochoosethetypeofInternetconnectiontouse.WerecommendyouchooseInternet2,whichwillallowyoutousetheproxysettingsofyourlocalserverincaseyouareworkingbehindafirewall.

From here onwards, just press Next at every step, choosing thedefaultsuntil the installation iscomplete.ThenpressFINISH.AnRiconwillbeplacedonthedesktop.YouarenowreadytouseR.Congratulations!!

�














chapte

r � cHAPTeR �

HIeRARcHIcAl clUSTeRING

basicsOneofthemostfrequentlyusedstatisticalanalysesinplantgeneticresourcesworkishierarchicalclusteranalysis.Basedonpassportdata,characterizationdata or evaluation data, this analysis produces adendrogramshowingtheapproximaterelationshipsamongasetofaccessions.

R has a very wide range of options for clusteranalysis.Thissectionlooksatthemostcommonlyusedandstraightforwardones.

We will start with an example using Anderson’sclassical data set on Iris (Anderson, 1935). ThisdatasetcomesasastandardwithR.Trytypingthefollowingattheprompttoloadandviewthedata(press“Enter”tosubmiteachline)¹(Figure1):

data(iris)iris

¹ Note that R is case sensitive and you must therefore uselower-caseandcapitallettersconsistently.

Figure 1. TheRinterface.

9


Youwill see that there are150accessionsbelonging to3differentspecies. Each accession has been characterized for four differentdescriptors:sepal length,sepalwidth,petal lengthandpetalwidth.Theseareallstraightforwardnumericalvariables;sotodothemostcommonhierarchicalclusteranalysisjusttype:

distiris<-dist(iris[,1:4])hclustiris<-hclust(distiris)plot(hclustiris)

Why cluster?Cluster analysis may serve a number of purposes in plant geneticresourceswork:

Where at least some characterization data exists for an entirecollection,agenebankcuratormayuseclusteringtodecidetheorderinwhich theplotsare laidout in the field, so thatmorphologicallysimilaraccessions are next to each other. This facilitates the comparison ofsimilarvarietiesinthefieldandthedetectionofduplicates.Theclusteringmayalsoindicateaccessionswheremolecularmarkersmaybenecessarytodetectduplicates.

Clusteringcanbeusedinselectingacorecollection.Morphologicaldataalonewouldbeaninadequatebasisfordeterminingacorecollection,butcanbecombinedwithgeographicaldatatocreatearobustprocedure.Oneapproachforthis istodividethecollection intogeographicalunitsandsampleoneaccessionfromeachmorphologicalclusterwithineachgeographicalunit.Thedesiredsizeofthecorecollectiondeterminesthenumberofgeographicalunitsandclusterstouse.ThisisconsistentwiththeapproachesrecommendedbyvanHintumet al.(2000).

Forthebreeder,clusteringgivesasenseoftherelationshipsamongaccessions and can help the selection of parents needed to maintainadequatediversityinthebreedingprogramme.Itcanalsohelpdeterminewhichvarietiestoevaluateinwhichlocalities,asinformationisgatheredonwhichmorphotypesdobestinwhichareas.

For the researcher, clustering can help understand the structure ofdiversityinrelationtoalargenumberoffactors,e.g.farmervarietynames;distribution of diversity among farmers, villages, districts or other units;andthelevelsofsimilarity,dissimilarityordiversityoftheaccessionsfoundin different units. Finally, clustering may be used in calculating diversityindices,asdescribedinthenextchapter.Thisprovidesanunderstandingthatisessentialforthedevelopmentofconservationstrategies,prioritizinginterventionsandfacilitatingtheuseofgermplasm.

Nonetheless, it is worth mentioning a few caveats. Clustering isonlyanexploratorytechnique,andonethat ishighlydependentonthesubjectivechoiceofmethodsforcalculatingdistancesandagglomeration.Itdoesnotdirectlyshowyouwhichvariablescontributetotheobservedstructure. Ordination methods may be more useful in understandingthe underlying variables in clusters, though posterior tabulation andregressionmayalsohelptoidentifycontributingvariables.

�0


The three lines of code above illustrate the three steps of ahierarchical cluster analysis: creating a distance or dissimilaritymatrix [dist()]; clustering or agglomeration [hclust()]; andplotting [plot()]. The terms distiris and hclustiris arenamesthatwehavegiventheoutputsoftheeachoftheanalyses.We have tried to make these names descriptive, so that we caneasily remember them, but we could have used any name. In R,theseoutputsareknownasobjects.

The command iris[,1:4] indicates that you are analysing thedataincolumnsonetofouroftheirisdataset.Thesquarebracketsareusedtoselectasubsetofthedataset,withnumbersbeforethecomma (in this case,none) indicating rows, andnumbersafter thecommaindicatingthecolumnstoselect.

ThisanalysiswillgiveyouFigure2.Youcanclickthemaximizebuttoninthetoprightcornerofthescreentostretchthegraphhorizontally.Toprintthegraphlegibly,youwillneedtochangetoLandscapemodeunder“Properties...”inthe“Print”dialoguebox.

Inthisexample.weusedthedefaultoptionsforhierarchicalclusteranalysis inR.However, toget theanalysis thatwe reallywant,wewouldneedtochangesomeoftheseoptions.Thehclust()function

Figure 2. DefaultclusterdendrogramforAnderson’sIrisdata.

��


Remember that capitalization and accurate spelling are important. If you write iris$species you will not get your labels. You need the capital S in Species, because that is how it is spelt in the data set. Even worse, if you write “Average” with a capital A or make a spelling mistake in that option, R will use the default “complete” option and you may never notice the mistake!

inRuses“complete”linkageasitsdefaultmethod.Weprefertouseeither the “average” (akin to the popular UPGMA or Unweightedpair-groupmethod,arithmeticaverage)or“ward”(Ward’smethodisanalternativeapproachbasedonminimizingthesumofsquares).

Wewouldalsolikethelabelsonthedendrogramtobemoreinformativeandthewritingsmaller,sothatthelabelsdonotoverlap.Todothis,usethearrow-upkeytorecallthecommandsyoupreviouslytyped,andeditthetexttomatchthelinesbelow.Usetheleftarrowkeytogettothepartyouwanttoedit.Herewehavechangedthemethodto“average”,addedthespeciesnamesaslabelsandreducedtheirfontsizewiththeoptioncex = 0.5:

hclustiris<-hclust(distiris,method=”average”)plot(hclustiris,labels=iris$Species,cex=0.5)

ThiswillgiveyouFigure3below.

Figure 3. Cluster dendrogram for Anderson’s Iris data, using the“average”methodofagglomeration.Thelabelsarespeciesnames.

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sa seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sa seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa se

tosa

seto

sa seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa

seto

sase

tosa vi

rgin

ica

virg

inic

avi

rgin

ica vi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor vers

icol

orve

rsic

olor

vers

icol

orvi

rgin

ica

vers

icol

orve

rsic

olor virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica

vers

icol

orvi

rgin

ica

virg

inic

avi

rgin

ica

virg

inic

avi

rgin

ica vers

icol

orve

rsic

olor

virg

inic

avi

rgin

ica

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor

vers

icol

orve

rsic

olor0

12

34

Cluster Dendrogram

hclust (*, "average")distiris

Hei

ght

��


Wecannowseethat theclusteranalysisseparatestheaccessionsof Iris setosa clearly from the other two species, but that with thelimiteddatawehave,wearenotabletoseparatetheaccessionsofI. versicolorandI. virginicaeffectively.

Tryaddingtheoptionhang=-1totheplotfunction(justdonotaskwhy–1),byusing theupwardarrowonyourkeyboard to recall thelinewiththeplot()function,addinghang=-1asindicatedbelowandpressing‘Enter’:

plot(hclustiris,labels=iris$Species,cex=0.5, hang=-1)

Noticethatthe labelsarenowallaligned.This isapurelycosmeticchange,whichsomeprefer. Itdoesnotchange theanalysis inanyway. You can also try changing the cex value, which determinesrelativefontsizes.

Now try doing the analysis using the option method=“ward” byusingtheupwardarrowonyourkeyboardtorecallthelinewiththehclust() function, replacing theword “average”with “ward”,andpressing‘Enter’.Usetheupwardarrowagaintorecallthelinewith the plot() function and press ‘Enter’. How did the figurechange?

Interpreting a hierarchical cluster dendrogramThe hierarchical cluster dendrogram represents the relationshipsamong the accessions in terms of approximate distances, basedon morphological traits. The distance between two accessions isapproximatelyproportional to theheightof thehorizontal line thatjoinsthetwogroupsthattheaccessionsarein.SolookingatFigure2, produced earlier, the distance between accessions 63 and 42isapproximately4.Notethat thenumberofaccessions‘between’twoaccessionsisnotatallimportant.Thus,thedistancebetweenaccessions107and23isalsoapproximately4.

Whatistheapproximatedifferencebetweenaccessions107and63?

In hierarchical cluster displays, the program needs to decide ateach join to specify which sub-tree should go on the left andwhichontheright.Thealgorithmusedin‘hclust’istoorderthesub-treesothatthetighterclusterisontheleft(thelast,i.e.mostrecent,joinoftheleftsub-treeisatalowervaluethanthelastjoinoftherightsub-tree).

We also sneaked in the $ sign in iris$Species. This is used when you want to select a specific component of the dataset, so in this case iris$Species indicates the variable Species in the dataset iris.

��


R is written by a large number of people around the globe that each maintains a package (or library) of functions (or procedures). Some of the functions that you typed above, like hclust() and plot() are base functions and can be called without loading any library, but the function reorder.hclust() is in the package gclus, which therefore needs to be downloaded and loaded before the function can be used.

It ispossible inR to re-order theclusters togiveamore intuitivedisplay, where clusters are ordered so that the accession in oneclusterthathasthesmallestdistancetotheaccessionsinthenextcluster istheaccessionthat isplacedadjacenttothenextcluster(seeFigure5).Inordertodothis,yourequirethepackagegclus,whichdoesnotautomaticallycomewithR,but iswhat iscalleda“contributed”package.

If you are connected to the Internet it is easy enough to installgclus. Just click “Packages|Install package(s)…”, select gclusfromthemenuandclickOK(Figure4).

If you are not connected to the Internet, but have the softwarepackage on a CD, the procedure is almost the same. Just click“Installpackage(s) from localzip file…”andnavigate to the foldercontainingthedownloadedpackages.

Figure 4. InstallinganewpackagefromtheInternet

��


Whenyouhavedownloadedthepackage,compareFigures2and3withthefollowingplot(Figure5):

library(gclus)hclustreordered<-reorder.hclust(hclustiris,distiris)plot(hclustreordered,cex=0.5)

Withthisre-ordering,youcanseethataccession99ismoresimilartoaccession42than107isto42.Sincethedendrogramstillshowsboth distances as approximately Height 4, you have no way oftellinghowmuchcloser the first twoaccessionsare.Thisdisplaycouldbeveryusefulfororderingaccessionsindemonstrationplotsor for further characterization,where there is advantage inhavingsimilaraccessionsclosertoeachother.

Figure 5. Thedendrogramre-orderedsothatmore-similaraccessionsareclosertoeachother,evenwhentheyareindifferentclusters.

15 1634 33

2314 43

39 9 46 2 1310 35

26 31 30 4 483

712 25

4522 20

4749 11

176 19 37

32 2124 27 44

38 5 4118 1 28 29 40 8

5036

4299

58 9461

6391 56

67 8562

96 97 89 100

9568

93 8365 80

82 8170 90 54

6010

711

512

210

214

311

412

712

414

7 7313

484 15

012

813

971

120

69 8878 77

87 53 5166 76

59 5572

98 75 7992 64

74 52 5786

135

109

116

149

137

101

145

141

121

144

125

105

133

129

112

104

117

138 11

114

811

314

0 146

142

132

118

110

130

126 10

313

110

813

610

612

311

901

23

4

Cluster Dendrogram

hclust (*, "average")distiris

Hei

ght

��


Importing your own dataEnteringyourdatadirectly intoR isa tediousprocess, tosay theleast; it is simpler to enter yourdata into aprogram likeExcel orAccess first, in rows and columns, and then export/save it as atab-delimitedtextfile.Thistutorialassumesthatyouhaveone(andonlyone)rowatthetopthatgivesthenameofeachvariable(seeFigure6). For the following exercises we will use a data set fromAppaRaoandMushonga (1987) that containsbothpassport andmorphological characterization data (see Table1 for details). Thisdatasetcanbedownloadedfromhttp://www.bioversityinternational.org/Publications/Sorghum_Data.xls.

Figure 6. ThefileSorghumDatainExcel.

A word of warning about Excel: if some cells outside the data area contain, or have previously contained, information this can make it impossible to import the exported text file into R. The solution is always to copy just the part of the spreadsheet containing the data that you want onto a new, clean spreadsheet before saving this as a tab-delimited text file. I recommend using tabs rather than a comma or a space to separate variables, because tabs are rarely found in normal text.

Figure 7. The tab-delimited file Sorghum_Data.txt opened innotepad.

��


Table 1. Variablesinthesorghumdataset.

Variable Descriptor R mode

KEY Key Integer

LON Latitude Numeric

LAT Longitude Numeric

ALT Altitude Integer

TGR Collectorno. Factor

DATE Collectingdate Factor

PROV Province Factor

TOWN Town/village Factor

KM Distancefromtown Integer

DR Directionfromtown Factor

SS Integer

ST Integer

LOCAL.NAME Farmervarietyname Factor

REMARKS Remarks Factor

EVALUATION.DATA Presenceofevaluationdata Integer

PC Plantcolouratharvest Factor

MC Leafmidribcolour Factor

SJ Stalkjuiciness Factor

JQ Juicequality Factor

LD Lodging Orderedfactor

SN Synchronyofflowering Factor

EN Headexertion Numeric

BD Birddamage Orderedfactor

PA Overallplantaspect Orderedfactor

EV Earlyvigour Orderedfactor

FL Daysto50%flowering Integer

HT Plantheight Integer

HL Headlength Integer

HW Headwidth Integer

PT Productivetillers(#) Integer

KW Kernelweight Integer

EX Endospermtexture Factor

EC Endospermcolour Factor

ET Endospermtype Factor

KL Kernellustre Factor

SC Sub-coat Factor

KP Kernelplumpness Factor

F1-F12 Meanmonthlyrainfall Numeric

F13-F24 Meantemperature Numeric

F25-F36 Meandiurnaltemperaturefluctuation Numeric

��


InR,select“File|Changedir…|Browse”fromthemenutosetyourworkingdirectoryandpointRtowardsthedirectorywhereyouhaveyourfiles(Figure8).

NowimportyourdatafiletoRwith:

sorghum<-read.table(“Sorghum_Data.txt”, header=T,sep=”\t”)

substituting thenamesSorghum_Data.txtandsorghumwithyour own choice of file names. You have just created a dataobject (‘sorghum’ or whatever name you gave it) of the typeR calls a dataframe. The option header=T or header=TRUEindicates that you have a header row with variable names. Theoptionsep=”\t” tellstheprogramthatyouhavedelimitedyourvariablesusingtabs.

R has a long list of other ways to import data, including linkingdirectly to and from almost any database, e.g. Access. For moreontheseotheroptions,click“Help|Manuals|RDataImport/Export”andreadthemanual, though inmostcases Iwouldstickwiththeapproachsuggestedabove.

Figure 8. ChangingyourworkingdirectoryinRforWindows.

��


light data validationBefore going on to analyse your data, you should look closely atyournewdataframetosearchfortrendsanderrorsinthedata.Thecommandsbelowwillhelpyoudothis.

dim(sorghum)

willgiveyou:

[1] 137 74

whichisthenumberofrowsandcolumnsyouhave.

Nowtry:

sorghum[1:10,1:10]sorghum[128:137,65:74]

Thiswillgiveyouthedatainthetopleftcorner(thetoptenrowsandthefirsttencolumns)andthebottomrightcorner(thelasttenrowsandthelasttencolumns)ofthetablerespectively.

str(sorghum)

willtellyouhowmanyobservationsandvariablesyouhave,listthevariables inyourdataset,saywhat thevariabletype is,andevenshowyousamplesofthedatainthevariable.Wewillreturntothevariable type in the next section, because if you want to do thisanalysisright,thereareafewthingsthatyouwillneedtodo.

summary(sorghum)

isanextremelyusefulcommandforspottingtrendsanderrors inyour data. Look carefully at the maximum and minimum valuesforeachnumericalvariable.Are theywithin theexpected range?Forfactorvariableswithonlyafewlevels,youwillseehowmanyentries there are for each level. For factor variables with manylevels,e.g.alistofvarietynameswiththevariablenameLOCAL.NAME, the table() command may be more useful (remembercapitalization!):

table(sorghum$LOCAL.NAME)

There are several examples of inconsistently spelled names. Canyouspotsomeofthem?

�9


Bedhlana Bimba Bimbe Chetacheya Chibedlani

1 1 1 2 1

Chibedyane Chibuku Chibulanye Chifumbata Chikombe

1 1 1 8 2

Chikondoma Chikota Chikrochani Chimhondoro Chimondo

1 7 1 1 1

Chimutode Chinzetu Chiota Chiponda Chitate

1 1 1 1 2Chiunda Chivedyana Chivendaw Compact

headDewe

1 2 1 1 1

Fumbata Gangara Gangare Gwusi Impala

1 4 1 1 1

Isifumbata Jayta Kabhuruwayo Kanzwono Loose head

1 1 2 1 1

Maboyana Mabumbe Malandisa Mashava Masintiri

1 1 3 2 1

Matipa Matserendende Mavoyana Mayakayaka Mayere

2 1 1 1 1

Mbwende Mhando Mhonda Mhonde Mososo

1 1 1 1 1

Muchaini Muise Mungonellengo Mungore Murara

2 1 2 2 1

Mururu Musoso Mutanda Mutode Ngaima

1 2 1 3 1Ngumane Nzende Nzetu Pumbata

WhiteRed Swazi

4 1 2 1 1

Rongwe Rudebare Rugare Rundende Rundewda

1 2 1 3 1

Rundunde Rushutura Rusutura Segobane Tswedani

1 1 1 1 2

Tsweta Udayakayaka Umkumbe Vedyana Zangewaya

17 1 1 3 1

Data types – critical distinctionsWe indicated earlier that we would talk more about data types,so here goes. The states that form a descriptor may be ofqualitativeorquantitativenature,assuchdescriptorstakeseveralmathematical types. In the IPGRI/Bioversity descriptor lists,the descriptors are classified as biological (characterizationand evaluation descriptors), taxonomic (species, sub-species),geographical (longitude, latitude, topography), ethnobotanical(ethnic group, vernacular name, plant uses), etc. However, themathematicalclassificationofdescriptorsisindependentofthesediscipline-basedgroupings.

�0


The mathematical type of a descriptor determines the statisticalfunctions that can be employed for its analysis. For example,parametric correlations (Pearson’s r) may be calculated betweenquantitative descriptors, while non-parametric correlations (suchas Kendall’s τ ) may be used on ordered descriptors that are notquantitative, and chi-squared for qualitative descriptors. This is acritical point. It is not just that, for example, correlation does notworkwithfactors,butthatitdoesnotmakesense.Thethreemainmathematical types of descriptors that we will deal with here areclassifiedasNumeric,FactorandOrderedfactorsinR.Table2showshowtheserelatetosomecommonmorphologicaldescriptors.

It is regrettably common to see unordered qualitative (nominal orcategorical)descriptorsanalysedasiftheywerenumeric,simplybyreplacing,forexample,eachcolourwithanumber,i.e.red,blueandgreenwith1,2and3.Thisistantamounttosayingthatredisclosertobluethanredistogreen,whichcanobviouslydistorttheresultsofastatisticalanalysisthatismadebasedonthispremise.

Table 2.DifferentmathematicaltypesofdescriptorsandhowtheyaredefinedinR.

R-mode Descriptor types Examples

Factor Binary(twostates,presence-absence) Planttype(indeterminate,determinate)

Nodulationactivity(absent,present)

Prominenceofleafvein(yes,no)

Multi-state(manystates)

Non-ordered(qualitative,nominal,attributes)

Leafpubescence(uppersurfaceofleaf,lowersurface,bothupperandlowersurfaces,onlyleafmargin)

Seedshape(oblate,triangular,rhomboid,square,obtriangular,spherical)

Seedcoatcolour(grey,brown,yellow-green,pink,red-purple,grey-mottled)

Orderedfactor

Ordered

Semi-quantitative(rank-ordered,ordinal)

Plantgrowthhabit(prostrate,spreading,semi-erect,erect)

Leafcolour(lightgreen,green,darkgreen)

Soilparticlesizeclasses(clay,finesilt,coarsesilt,veryfinesand,finesand,mediumsand,coarsesand,verycoarsesand)

NumericorInteger

Quantitative(metric,measurement)

Discontinuous(meristic,discrete) Numberofseedperpod(integers)

Continuous Maturationperiod(days)

Pedunclelength(cm)

Seedvolume(cm³)

Seedyieldperplant(g)

��


Abitofgoodpractice:solvetheproblematitsrootanddonotcodelevels of qualitative factors as 1, 2, 3. In the example above use“red”,“blue”and“green”.Thatwayyouavoidforgettingtorecodethevariablesandavoidforgettingwhatthecodingmeans!

Rwill automatically identifyalphanumericvariables (variableswithletters,orlettersandnumbers)asFactorsandwillidentifynumericvariables as either numeric (if they have decimals) or integers (iftheyhavenodecimals).However,whenFactorvariablesarecodedusingnumbers,Rcanobviouslynotdetectthisbyitself.Rcanalsonotdetect ifavariable isanordered factor,ordetect theorderofunnumbered variable states. In these cases, we must tell R whatvariabletypeswehavebeforeweproceedwiththeanalysis.

Inthecaseofoursorghumdata,thevariablesweredescribedinTable1.VariablesPCtoKParemorphologicaldescriptorsandcontainthedatathatweareinterestedinanalysinginthefirstinstance.

However,someofthesevariablesarecodedwithnumbers,thoughtheyare in fact factorsororderedfactors.Weneedto letRknowthis,whichisdonewiththefollowingcommands:

sorghum$LD<-ordered(sorghum$LD)sorghum$SN<-factor(sorghum$SN)sorghum$EN<-ordered(sorghum$EN)sorghum$BD<-ordered(sorghum$BD)sorghum$PA<-ordered(sorghum$PA)sorghum$EX<-ordered(sorghum$EX)sorghum$EC<-factor(sorghum$EC)sorghum$ET<-factor(sorghum$ET)sorghum$KL<-factor(sorghum$KL)sorghum$SC<-factor(sorghum$SC)sorghum$KP<-factor(sorghum$KP)

Nowtrylookingatthestructureoncemore:

str(sorghum)

Saving your work, exiting R and getting backAtthispointyoumaywanttosaveyourwork,closeRandtakearest.Click “File|Savehistory…”andall yourcommands, includingeverymistakethatyoumadeontheway,getsavedtoatextfile.Ifyougivethefileanamewitha.Rextension,e.g.temp.R,youcanopenthisfileinR’stexteditorwith“File|Opensript…”.Youmightwanttoedityourerrorsordeletecertainlines.

��


You can exit R with “File|Exit”. When prompted to save yourworkspace image (Figure9),myadvice is toclick “No”.Thatwayyouavoidbuildingupgarbageinyourworkspace.

WhenyoustartRagain,yousimplyset theworkingdirectoryandopenyoursavedscript.YourunthecommandsinRbyhighlightingtheminR’seditorandpressingCtrl-R.Thiswillgetyourightbacktowhereyouleftoff.Youcanrunthewholefile,oryoucanusejustthelinesthatyouneedtogetyouwhereyouwanttobe.

Selecting a subset of your data to analyseForalmostallanalysesthatyoudo,youwillwanttoselectasubsetof the data. In this case, you want to select the variables withquantitativemorphologicalcharacterizationdata.

quantsorghum<-subset(sorghum,select=EV:KW)

Seewhatthisgavewith:

str(quantsorghum)

Youcangetmoreinformationandfurtherexamplesbytyping:

?subset

Nowanalysethequantitativedataaswiththeirisdataset:

distquant<-dist(quantsorghum)hclustquant<-hclust(distquant)plot(hclustquant)

Figure 9. Makesurethatyouhavesavedthecommandsthatyouwanttouseagainbeforeclicking“No”onexit.

��


Though you get a pretty nice dendrogram, it was based on onlyone-third of your variables, and even they should probably havebeenstandardizedsothatavariabledoesnotbecomeathousandtimesmoreimportantbecauseyoumeasureit inmillimetresratherthanmetres.

Thefunctiondist() calculatestheEuclidiandistance(thesquareroot of the sum of the squared distances along each dimension/variable) by default, but can also calculate maximum, Manhattan,Canberraorbinarydistances;noneofthemaresuitableforFactorvariables. In practical terms, this means that we cannot use thefunctiondist().ButRhassolutions,soreadon!

cluster analysis with mixed data typesIt is disturbingly common to see a cluster analysis of mixed datatypes carried out as though all the data was numeric by simplygivinganumbertoeachvariablestate,e.g.red=1,blue=2,green=3.Mathematically,thisislikesayingthatredisclosertobluethanitistogreen.Thisisofcoursenonsensethatcanseriouslydistortyouranalysis.

As an alternative, R provides a function daisy() in the librarycluster (Struyff et al., 1996) based on a general coefficient ofdissimilarity proposed by Gower (1971). This combines differenttypesofdescriptors,andprocesseseachoneaccordingtoitsownmathematicaltype.

So this time round you want to select all the morphologicalvariables.ThefirstoftheseisPCandthelastisKPandincludesallthevariablesinbetween,socanbewritten:

morphsorghum<-subset(sorghum,select=PC:KP)

Sincethedaisy() function is inthecluster library,youmust firstloadtheclusterlibrary.Todothis,type:

library(cluster)daisymorph<-daisy(morphsorghum)hclustmorph<-hclust(daisymorph)plot(hclustmorph)

Before going on, select ‘History|Recording’ on the menu in thegraphicswindow.Thiswillstorethegraphsthatyoucreatefortherestof thesessionandyoucanmovebackand forth through thesession’sgraphsbyusingthepageupandpagedownkeys.

��


Nowtryembellishingthelasttwolinesabittochangethemethodofagglomeration,addvarietynamesaslabelsandreducethefontsize.There isnoneed toexecute the first two linesagainas youhave already loaded the library cluster and created the objectdaisymorph.Theoptionylab=givesthelabelforthey-axis.

hclustmorph<-hclust(daisymorph,method=”average”) plot(hclustmorph,labels=sorghum$LOCAL.NAME,cex=0.5, ylab=”Dissimilarity”)library(gclus)hclustreord<-reorder.hclust(hclustmorph,daisymorph) plot(hclustreord,labels=sorghum$LOCAL.NAME,cex=0.5,

ylab=”Dissimilarity”)

Have a careful look at how the farmer variety names are spreadacross the cluster dendrogram in Figure10. This shows just howloosely farmer variety names are used in Zimbabwe. The namesoftenrefertoonetothreespecifictraits,andallotheraspectsoftheplantcanvaryconsiderablyfromonesitetoanother.Havealookateachaccession’stownoforigin,bychangingthelastlineto:

plot(hclustreord,labels=sorghum$TOWN,cex=0.5, ylab=”Dissimilarity”)

How good is the cluster dendrogram?Withsomanydifferentwaysofdoingclusteranalyses,andeachonegivingadifferentresult,itisnaturaltoaskwhichisthebestmethod?Theansweristhattherereallyisnowayofknowingforcertain,butthereareafewwaysinwhichyoucanattempttoanswerpartofthe

What does daisy() do?ThefunctionslightlyextendsGower’sdefinitionsoastocoverordinalandratio variables in addition to quantitative and multi-state variables. Thedefinitionofthedissimilarityindaisy(),d(i,j),is

d(i,j) = ∈ [0,1]

where

d=contributionofvariableftod(i,j),whichdependsonitstype:• fbinaryornominal:d=0ifxif=xjf,andd =1otherwise,

• fintervalscaled:d = ,

• fordinalorratio-scaled:computeranksrifandzif= andtreatthesezifasinterval-scaled,

andδ = weight of variable f , where:δ = 0, if xif or xjf is missing;δ=0,ifxif=xjf=0andvariablefisasymmetricbinary;otherwiseδ=1.

p

f=1∑ δ d(f )ij

(f )ij

p

f=1∑ δ f ij

(f )ij

(f )ij

(f )ij

| xif–xjf |

maxh xhf–minhfxjf rif–1

maxh rhf–1

(f )ij

(f )ij

(f )ij

(f )ij

��


questionwiththehelpofR.

Thefirstpartofthequestioniswhetherthemeasureofdissimilarityordistanceisthemostappropriate.Thoughthiscaninfluenceyourresultsconsiderably,wearenotgoingtoenterintothisdiscussionherebeyondthepointswehavealreadymadeaboutusingmeasuresappropriatetothetypeofdatathatyouhave.AgoodintroductorydiscussionofthisissueandotheraspectsoftheanalysisofgeneticdiversitycanbefoundinMohammadiandPrasanna(2003)andinmoredetailinLegendreandLegendre(1998).

Thesecondpartofthequestionrelatestohowwellthedistanceordissimilarity matrix is represented graphically. One way of lookingatthisisthatthedistancesrepresentedonthediagramshouldbeascloseaspossibletothedistancesinthedistanceordissimilaritymatrix. The following test extracts the distances (height) in thediagramusingthefunctioncophenetic() andthencorrelatesthiswiththeoriginaldistanceordissimilaritymatrix:

Figure 10. Thedendrogramshowingfarmervarietynames.

Dew

eM

atip

aLo

ose

head

Mut

ode

Mho

nda

Chi

fum

bata

Rug

are

Mui

seM

ayer

eM

osos

oM

avoy

ana

Bedh

lana

Mab

oyan

aU

mku

mbe

Ngu

man

eM

ungo

reM

hand

oR

ongw

eM

ucha

ini

Mur

ara

Rud

ebar

eM

asin

tiri

Mho

nde

Jayt

aIs

ifum

bata

Chi

bula

nye

Mas

hava

Fum

bata

Vedy

ana

Chi

tate

Chi

bedy

ane

Tsw

eta

Chi

kom

beTs

wet

aC

hifu

mba

taC

heta

chey

aTs

wet

aN

gum

ane

Sego

bane

Gw

usi

Chi

pond

aTs

wet

aN

gum

ane

Tsw

eta

Tsw

eta R

udeb

are

Tsw

eta

Tsw

eta

Tsw

eta

May

akay

aka

Tsw

edan

iVe

dyan

aC

hifu

mba

taC

hita

teC

him

ondo

Com

pact

hea

dC

hikr

ocha

niVe

dyan

aN

gum

ane

Tsw

eta

Che

tach

eya

Mat

sere

nden

deM

alan

disa

Mal

andi

saM

alan

disa

Mut

ode

Chi

fum

bata

Impa

laR

ushu

tura

Tsw

eta

Chi

vedy

ana

Chi

vedy

ana

Chi

bedl

ani

Run

dend

eM

atip

aM

ungo

reM

ucha

ini

Uda

yaka

yaka

Chi

fum

bata

Run

dend

eR

unde

nde

Gan

gara

Zang

eway

aM

ungo

nelle

ngo

Mun

gone

lleng

oKa

bhur

uway

oPu

mba

ta W

hite

Mbw

ende

Chi

fum

bata

Tsw

eta

Chi

fum

bata

Mur

uru

Tsw

eta

Chi

fum

bata Run

dund

eM

asha

vaC

hiun

daTs

wet

aG

anga

raG

anga

raR

unde

wda

Chi

kota

Chi

vend

awN

gaim

aR

ed S

waz

iG

anga

raM

usos

oM

usos

oC

hiko

taC

hiko

ndom

aN

zetu

Nze

tuC

hiko

taC

hiko

taKa

nzw

ono

Bim

beM

utan

daC

hiko

taC

hiko

taC

hiko

taC

hiot

aC

him

hond

oro

Nze

nde

Mab

umbe

Bim

baC

hinz

etu

Tsw

edan

iM

utod

eR

usut

ura

Tsw

eta

Chi

kom

beTs

wet

aC

hibu

kuG

anga

reKa

bhur

uway

oTs

wet

aC

him

utod

e

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Cluster Dendrogram

hclust (*, "average")daisymorph

Dis

sim

ilarit

y

��


cophenmorph<-cophenetic(hclustreord)

First download the package vegan as described earlier for thepackagegclus,andloadthelibrary.

library(vegan)mantel(cophenmorph,daisymorph)

Themanteltestwillgiveyouthecorrelationbetweentwomatrices.Bygoingbackandreconstructingtheobjecthclustmorphusingdifferentmethodsofclusteringoragglomeration,wecancomparetheabilityofthesemethodstoreflectthedistancesordissimilaritiesthatwehavecalculated,e.g.:

hclustward<-hclust(daisymorph,method=”ward”)hclustreward<-reorder.hclust(hclustward,daisymorph)cophenreward<-cophenetic(hclustreward)mantel(cophenreward,daisymorph)

Whenwecomparethecoefficientsobtainedfromthetwomanteltests,wesee that in thiscase themethod“average”,commonlyknownasUPGMA(UnweightedPairGroupMethodwithArithmeticmean),hasbyfarthehighestcorrelationcoefficient,whichmakesit easy to interpret and probably goes a long way to explainingits popularity. This does not necessarily imply that it correctlyaddresses a third question, namely “How well does it clusteror agglomerate the accessions?” This is a much more difficultquestiontoanswer,asthere isnocorrectresultagainstwhichtotestdifferentmethods.

Pruning dendrogramsWith larger data sets, dendrograms become unwieldy and canbe impossible to display on one page. It can then be useful toshow just the upper part of the dendrogram. This can be donewithanumberofdifferent functions inR,butmyownpreferenceis the function clip.clust() from the package maptree. Aswith vegan,maptree is a contributed package and needs to bedownloadedandinstalledpriortouse.Sincesomeofthecommandsinmaptreedependon functions in thepackagecombinat, thispackageisdownloadedautomaticallywithmaptree.

How do you decide where to prune a hierarchical cluster tree?Kelley et al. (1996) proposed a method that can help you decidebycalculatingapenaltyfunctionforahierarchicalclustertree.Thisfunctioncomparesthemeanacrossallclusterswiththemeanwithinclustersofthedissimilaritymeasure.

TheRfunctioniskgs()andalsocomesinthepackagemaptree.

��


butinadditionrequiresthatthepackagecombinat isloaded.Tocalculatethepenaltyfunction,type:

library(maptree)kgsmorph<-kgs(hclustmorph,daisymorph,maxclust=50)plot(names(kgsmorph),kgsmorph)

The lowest value shows you the optimum number of clusters,in this case 9 clusters (see Figure 11)². Like with so much incluster analysis, do not take this as the definitive answer. Youmay well have other reasons to select a different number, suchas a requirement to select a specific number of accessions thatrepresent a collection as well as possible. There are almostcertainly other methods that will give you other conclusions.Nonetheless,thisisausefulguide.

So we have chosen to cut the dendrogram into 9 clusters(Figure12):

²Theoption‘maxclust=’setsthemaximumnumberofclustersforwhichtocomputethemeasure,asitisnotusefultocalculateitforveryhighnumbersandthecalculationstakealongtime.

Figure 11. Kelley-Gardner-Sutcliffepenaltyfunctionforthesorghumdatashowingthattheoptimumnumberofclustersisnine.

10 20 30 40 50

2030

4050

names(kgsmorph)

kgsm

orph

��


clipreord<-clip.clust(hclustreord,sorghum,k=9)plot(clipreord)

Now, itwouldbeuseful toknowhowmanyaccessionsbelong toeachofthegroups.Thefollowingcode:

groupk9<-group.clust(hclustreord,k=9)table(groupk9)

givesthenumberofaccessionsineachgroup:

groupk9 1 2 3 4 5 6 7 8 9 6 10 2 55 17 23 13 8 3

Typingtheobjectnamegroupk9willgiveyougroupmembershipforeachaccession in theorder that theaccessionsappear in thedataframe:

groupk9

The output can be a bit difficult to decipher, so try binding thegroups to the original data file so that you can see clearly whichaccessionsbelongtowhichgroup:

sorghum<-cbind(sorghum,groupk9)str(sorghum)

Figure 12. Cluster dendrogram pruned to eight clusters usingclip.clust().

1 2

3

4 5

6 7

8

9

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Cluster Dendrogram

hclust (*, "average")daisymorph

Hei

ght

�9


Youcanseegroupmembershipagainsteachaccessionnumberwith:

subset(sorghum,select=c(TGR,groupk9))

The select statement selects the variables to display. Thisarrangement of the dataframe now opens up table views thatcangiveabetter ideaof thepropertiesofeachcluster.Using thefunction table(), you can now look at how different qualitativetraitsarespreadoverthedifferentclusters,e.g.:

table(sorghum$MC,sorghum$groupk9)

whichwillgiveyou:

1 2 3 4 5 6 7 8 9C 1 0 1 52 17 23 13 8 3D 4 10 1 3 0 0 0 0 0Y 1 0 0 0 0 0 0 0 0

Tryothervariablestootolookforinterestingpatternsinthedataset.

Advanced clusteringTheexamplesofhierarchicalclusteringthatwehavegonethroughaboveareinrealityonlyafewofverymanydifferentwaysofdoingclustering.Forexample,becausethesizeofthedissimilaritymatrixrises exponentially with the number of accessions, hierarchicalclustering quickly runs into problems with very large datasets(approximatelyover500accessions).Roffersaverylargenumberofotherpossibilities,beingperhapsbestknownforthefunctionsinthepackagecluster,describedbyStruyffet al.(1996).

Adding molecular markersThis tutorial focuses on the analysis of morphological characterizationdata,astherearemanyotherguidestotheanalysisofmoleculardata.However, theremaybe timeswhenyouwouldwant todoacombinedanalysisofyourmorphologicalandmoleculardata.Thefunctiondaisy()allowsyoutodothisbydefiningyourmolecularvariablesasasymmetricbinary(seetheboxondaisy()above).Thisisequivalenttothesimplematchingcoefficient,whichisfrequentlyusedwithmoleculardata.

Contrary to the way that other variables are defined in thedataframe itself, asymmetric binary variables are defined whenexecuting daisy(). For example, if the data set sorghum hadcontained data for twenty molecular marker variables calledmm1 tomm20,wewouldhavehad:

morphmm<-subset(sorghum,select=c(PC:KP,mm1:mm20) daimorpmm<-daisy(morphmm,type=list(asymm=c(mm1:mm20)))

�0


Toseewhatotherfunctionsthepackagecontainstrytyping:

library(help=cluster)

Another package with a more eclectic range of functions thatperform what is called model-based clustering is the packagemclust,describedbyFraleyandRaftery(2002).

R isdevelopingsoquicklythat it isnotpossibletogiveanup-to-dateoverviewofallthefunctionsforclustering,butacompletelistcanbeobtainedbydownloadingallthecontributedpackagesandusingtheHTMLhelp:click‘Help|HTMLhelp’,thenclickonthelink‘SearchEngineandKeywords’andmovedownthepagetothelink‘cluster’.

ReferencesAnderson E. 1935. The irises of the Gaspe Peninsula. Bulletin of the

American Iris Society,59:2–5.

Appa Rao S, Mushonga JN. 1987. A catalogue of passport andcharacterization data of sorghum, pearl millet and finger milletgermplasmfromZimbabwe.IBPGR,Rome,Italy.

Fraley C, Raftery AE. 2002. MCLUST: Software for model-basedclustering, density estimation and discriminant analysis.Technical Report, Department of Statistics, University ofWashington, Seattle, Washington, USA. Available from:http://www.stat.washington.edu/mclust Date accessed: 19 July2007.

Gower JC. 1971. A general coefficient of similarity and some of itsproperties.Biometrics27:857–871.

Kelley LA, Gardner SP, Sutcliffe MJ. 1996. An automated approachfor clustering an ensemble of NMR-derived protein structuresinto conformationally-related subfamilies. Protein Engineering,9:1063–1065.

LegendreP,LegendreL.1998.Numericalecology.2ndEnglishedition.ElsevierScienceBV,Amsterdam,TheNetherlands.xv+853p.

Mohammadi SA, Prasanna BM. 2003. Analysis of genetic diversityin crop plants – salient statistical tools and considerations. Crop Science,43:1235–1248.

Struyff A, Hubert M, Rousseeuw PJ. 1996. Clustering in an object-oriented environment. Journal of Statistical Software 1(4) [on-line].Available from: http://www.jstatsoft.org/v01/i04/paper/clus.pdf.Dateaccessed:19July2007.

Van Hintum ThJL, Brown AHD, Spillane C, Hodgkin T. 2000. Corecollectionsofplantgeneticresources.IPGRITechnical BulletinNo.3.BioversityInternational,Rome,Italy.

��

cHAPTeR � DIveRSITy INDIceS

By Mikkel Grum

The following commands from the previouschapterwillgetyouwhereyouneedtobetocarryouttheexercisesinthischapter.

sorghum<-read.table(“Sorghum_Data .txt”,header=T,sep=”\t”)sorghum$LD<-ordered(sorghum$LD)sorghum$SN<-factor(sorghum$SN)sorghum$EN<-ordered(sorghum$EN)sorghum$BD<-ordered(sorghum$BD)sorghum$PA<-ordered(sorghum$PA)sorghum$EX<-ordered(sorghum$EX)sorghum$EC<-factor(sorghum$EC)sorghum$ET<-factor(sorghum$ET)sorghum$KL<-factor(sorghum$KL)sorghum$SC<-factor(sorghum$SC)sorghum$KP<-factor(sorghum$KP)

morphsorghum<-subset(sorghum, select=PC:KP)library(cluster)daisymorph<-daisy(morphsorghum)hclustmorph<-hclust(daisymorph, method=”average”)library(gclus)hclustreord<-reorder.hclust(hclust morph,daisymorph)library(maptree)groupk9<-group.clust(hclustreord,k=9)

sorghum<-cbind(sorghum,groupk9)

Ready!

Richness, Shannon-weaver and SimpsonIn ecological research, a number of indices areoftenusedtomeasurespeciesdiversityindifferentcommunities. After ‘richness’, which is simply thenumber of species represented in the community,the two most common indices are probably theShannon-Weaver index and the Simpson index.














chapte

r �

��


The Shannon-Weaver index combines a measure of richness witha measure of evenness. Higher values indicate more diversity. TheSimpson index is between zero and one. It measures evenness ofspeciesorgroupmembership.Itcanalsobeinterpretedasthelikelihoodoftworandomlyselectedaccessionsbeingdifferentfromeachother.

Whenlookingatintra-specificdiversitywecouldusemorphologicalclusters instead of species. Magurran (1998) and Kindt and Coe(2005) provide more detailed descriptions of diversity indices andare good places to continue for the reader who is interested inworkingfurtherwiththemeasurementofbiodiversity.

Theoutputtablefromthecommand:

provk9<-table(sorghum$PROV,sorghum$groupk9)provk9

gives us the number of accessions collected from each group ineachprovince.Thecolumnsrepresentthegroupsandtherowstheprovinces:

1 2 3 4 5 6 7 8 9 MDL 0 7 0 19 2 4 0 5 1 MLC 2 1 0 2 3 3 0 0 1 MLN 0 0 0 8 1 1 0 0 0 MLW 0 0 1 4 0 4 4 0 0 MNL 1 1 0 3 3 7 9 1 1 MSV 3 1 1 19 8 4 0 2 0

Thistypeoftable,withthenumbersorfrequenciesofeachspeciesorgroupmembershipforeachsiteorarea,isknownasacommunitydata matrix. The matrix can be transposed to give a view that ismoreeasilycomparedtotheplotsthatfollowusingthecommand:

t(provk9)

Richness,definedhereasthenumberofgroupsrepresentedineachprovince,canbecalculatedusing:

specnumber(provk9)

withtheresults:

MDL MLC MLN MLW MNL MSV 6 6 3 4 8 7

Wecangettoseethecodebehindthesecalculationsbytyping:

specnumber

whichgivesus:

function (x, MARGIN = 1) {

��


apply(x > 0, MARGIN, sum)}

Thefunctionsimplyappliesthefunctionsumalongthe‘margin 1’,i.e. inour example, it calculates the sumof varieties found (morethan0times)ineachprovinceinthetableprovk9.

For comparison, the number of accessions collected in eachprovincecanbeobtainedusing:

table(sorghum$PROV)MDL MLC MLN MLW MNL MSV 38 12 10 13 26 38

The package vegan has a function, diversity() that cancalculateboth theShannon-Weaver indexand theSimpson indexon a community data matrix. The default is the Shannon-Weaverindex(Figure13):

library(vegan)swi<-diversity(provk9)barplot(swi)

Figure 13. Shannon-Weaver index for intra-specific sorghumdiversitybasedonninemorphologicalgroups.

MDL MLC MLN MLW MNL MSV

0.0

0.5

1.0

1.5

��


³Thesquarerootofthevalueisused,becausethisgivesanalmostlinearcorrelationwithrichnessinmostcasesandrichnessisthediversitymeasurethatmostpeoplefeelmostcomfortablewith.

Addcoloursusingthefollowingcommands:

barplot(swi,col=rainbow(20))barplot(swi,col=”lightblue”)

Thefirstcommandwillgiveyourbarsarangeofcolours fromtherainbow,andthesecondwillmakeyourbarslightblue.

TogettheSimpsonindex,youneedtospecifyit:

simp<-diversity(provk9,index=”simpson”)barplot(simp)

YoucangetthecorrelationcoefficientbetweentheShannon-WeaverandSimpsonindiceswith:

cor(swi,simp)

Whatisthecorrelationcoefficientbetweenthetwoifyoucalculatethemforeachtown,ratherthanprovince?Whatdoesthistellyouaboutthetwoindices?

Rolling your own functionIfyourobjectiveweretoevaluateatown’scontributiontoZimbabwe’sconservation effort, then the richness, diversity and evennessmeasuresabovedonottakeaccountofhowcommontheclustersareinZimbabweasawhole.

SinceIcouldnotfindanyfunctioninanysoftwarethattackledthisquestiontomysatisfaction, Idevelopedmyown index.This isanindexthatweightseachclusterforrareness,bydividingthenumberof observations of that cluster in the province (or town, farm, orwhateverotherunityoumightwanttodotheanalysison)withthetotalnumberofobservationsofthatclusterinyourstudyarea(inthiscaseZimbabwe).Thesquarerootofthevalue³foreachtownisthensummedacrossallvarietiesasin:

CI=∑√where

aij isthenumberofobservationsofvarietyiinprovincej,andAiisthetotalnumberofobservationsofvarietyiinthestudyarea.

Wewritethisasafunctionwiththefollowingscript:

naij

Ai

��


ci <- function(x, MARGIN = 2){ x <- as.matrix(x) Ai <- apply(x, MARGIN, sum) qq <- function (aij) { sqrt(aij/Ai) } Q <- apply(x, 1, qq) P <- apply(Q, 2, sum) P}

Oncethefunction iscreatedbyrunningthe linesabove, itcanbeusedtocalculateournewindexinthesamewaythatthediversityfunctionwasused.

ci(provk9)

Plottingthesevaluesagainstrichnessusing

plot(specnumber(provk9),ci(provk9), type="n")text(specnumber(provk9),ci(provk9),row. names(provk9))

Figure 14. Plot of CI scores against richness in each province inZimbabwe.

3 4 5 6 7 8

1.0

1.5

2.0

2.5

3.0

3.5

4.0

specnumber(provk9)

ci(p

rovk

9)

MDL

MLC

MLN

MLW

MNL

MSV

��


wesee thatalthoughbothMDLandMLChavesixclusterseach,MDL scores higher on the CI scale, because the clusters in theprovincearerarerandthereforecontributemoretotheoveralllevelof sorghum diversity in Zimbabwe. This would seem like valuableinformationinprioritizingconservationareas!

Forcodecomparisonandabetterunderstandingofhowyoumightconstructyourownfunction,youmightwanttotakea lookatthecodeforthediversity()functionwith:

diversity

Frequentuseofthepossibilityofseeingthecodeusedinfunctionsisoneofthebestwaysoflearninghowtomakeyourownfunctions.

Dissimilarity measures for estimating diversityMostoftheseindicesdonottakegeneticdistancesintoaccount,sotwocloselyrelatedgroupsareconsideredasdiverseastworelativelyunrelatedgroups.Thisisafairlyacceptableapproachwhenusedforspecies diversity, because barriers to gene flow among speciesresult in fairlydiscreteunits.However,whenused for theanalysisof intra-specificdiversity theassumptionofdistinctclasseswithinspecies is less well-founded, partly because there may be quitesignificantgeneflowwithinspeciesandpartlybecauseofdifficultiesin fixing criteria for intra-specific classification. Therefore a betterapproachmaybetomakedirectuseofthedissimilaritymeasurestoobtainestimatesofdiversity.

Oneanalysis inR thatgivesausefulpictureofdiversityacrossanumberofunits is the functionanosim() in thepackagevegan.This analysis ranks all the dissimilarities among accessions andproducesaboxplotoftheranksofdissimilaritieswithinagivenunit,e.g.provinces.Thecommands:

anomorph<-anosim(daisymorph,sorghum$PROV)plot(anomorph)

willgivetheplotinFigure15.Thisplotisveryinformative,althoughits interpretation can be a bit challenging. The boxes show theminimum,firstquartile,median,thirdquartileandmaximumrankofthedissimilaritieswithineachprovince,e.g.themediandissimilaritywithin the Manicaland Province (MNL) is ranked 3837th and thehighestdissimilarity is ranked9295thof the total9316dissimilarityvaluesinthematrix.Theboxesaredrawnwithwidthsproportionaltothesquare-rootsofthenumberofobservationsinthegroups.

One observation that we make in Figure 15 is that the mediandissimilarity rank for MLC is substantially higher than any of the

��


others, indicating that diversity may in fact be high in MLC. Withfewer accessions collected, 12 accessions in MLC against 26accessionsinMNL,thelowerShannon-WeaverindexforMLCmightsimplybearesultofasmallsamplesize.

Figure 15. Boxplotoftherankorderofdissimilaritiesbyprovince.

Between MDL MLC MLN MLW MNL MSV

020

0040

0060

0080

00

R = 0.082 , P = 0.001

Figure 16. BoxplotoftherankorderofdissimilaritiesbyspeciesinAnderson'sirisdataset.

Between setosa versicolor virginica

020

0040

0060

0080

0010

000

R = 0.879 , P = < 0.001

��


Anotherquickexampleshowshowdifferenttheresultscanbe:

data(iris)distiris<-dist(iris[,1:4])anoiris<-anosim(distiris,iris$Species)plot(anoiris)

In this example, Figure 16 shows, not surprisingly, that most ofthe dissimilarity is between species, with much less dissimilaritywithin species. Most of the samples of the species Iris setosaareverysimilar toeachother,but therearea fewoutliersthataremorphologicallydistinctfromtherestofthesamplesofthespecies.Their higher dissimilarity is shown as dots above the whiskers ofthebox.

SummaryInthischapterweextendedtheworkfromthepreviouschaptertolook at how we could use some of the common diversity indiceswithintra-specificdiversitybasedonmorphologicaldata.Webrieflylooked at how we might develop our own function, in this caseto take account of each region’s contribution to overall diversityin the study area. We also looked at how we could deal with thecontinuous nature of intra-specific diversity, which rarely exists indiscreteunits.

This tutorial has been a stepwise guide through the analysis ofmorphologicalcharacterizationdataandthespecificproblemsthattheir mathematical properties present. I have seen users with noexperienceofRgothroughthesestepsonelineatatime,modifythecodetodealwith theirowndataandget the results that theywant. At the same time, this is not a complete introduction to R,and thoseusers thatwould like tomoveon tootheranalyseswillneedadditionalreferences.ThesoftwareRcomeswithabeginnersmanualentitled“AnintroductiontoR”thatcanbefoundinthemenuunder“Help|Manuals(inPDF)”.ManyotherreferencescanbefoundontheRProjectwebsite:www.r-project.org/.

ReferencesMagurranAE. 1988. Ecological diversity and its measurement.Croom

Helm,London,UK.

KindtR,CoeR.2005.Treediversityanalysis:amanualandsoftwareforcommonstatisticalmethodsforecologicalandbiodiversitystudies.ICRAF,Nairobi,Kenya.203p.

�9


cHAPTeR � TRoUbleSHooTING

Thissection listssomeofthecommonproblemsanderrormessageswecameacrossinusingthistutorial. For other error messages encounteredwhenusingR,thebestoptionistouseanyoneofthesearchenginesfoundontheRprojectwebsiteforhelp(http://www.r-project.org/).

1. “Error: Object " . . . " not found”Youhavetriedtoexecutea functionwithout firstloading the library or package that contains thefunction.Solution:loadthelibraryorpackage.

2. I can’t read the text in the cluster diagrams; it’s too small and cramped.Have you maximized the graphics window?Double-clickingonthebluebaracrossthetopofthegraphicswindowwilldothis.ForprintingfromR,youcansetthepagetolandscapebyclickingonthepropertiesbuttonintheprintdialogueboxand selecting ’landscape‘. You can also use theoptioncex=0.5oranothernumbertoreducethesizeofthefont.

3. CapitalizationR is sensitive to capitalization; if the variable iscallediris$Species, theniris$specieswillnot work. If options are capitalized incorrectly,e.g.writingmethod=Average,ormethod=Ward,thedefaultmethodisused.Inboththeexamplesused,themethodshouldhavebeenwrittenwithoutcapital letters,sothedefaultmethod=completeisusedinstead.Nowarningisgiven.

4. “Error in file(file, "r") : unable to open connection”You forgot to set the working directory. Clicking’File|Changedir…|Browse‘fromthemenuwillleadyouthroughthis(seeFigure8).














chapte

r �

�0


5. My data won’t import (a)Servesyouright!YouareusingExcelanddidnotbelievemewhenIsaidthatyoushouldcutandpasteyourdata,andonlythedata,toanewspreadsheetbeforesavingitasatab-delimitedfile.

6. My data won’t import (b)The only other problem I know of that messes up importing of asimpledatasetofrowsandcolumnsiswhenyouhaveapostrophesinsomeofyourvariables,e.g.namesliken’goni.AsfarasIknow,you have to remove the apostrophes before importing the data.Accentsandmostothersymbolspresentnoproblem.

ISBN 978-92-9043-735-2

IPGRIandINIBAPoperateunderthenameBioversityInternational

SupportedbytheCGIAR

handbooks for genebanks no. 9 statistical analysis · main r self-extracting executable file in it:...

Documents