[borcard, daniel] multivariate analysis(book4me.org)

Multivariate analysisDaniel Borcard

Département de sciences biologiquesUniversité de Montréal

C.P. 6128, succursale Centre VilleMontréal QC

H3C 3J7 [email protected]

Foreword: this document is heavily based on the following book, withpermission of Pierre Legendre:

Legendre, P. & L. Legendre. 1998. Numerical ecology. SecondEnglish Edition. Elsevier, Amsterdam.

This book is a MUST! It contains, among many other topics, all themathematical developments that have been deliberately excluded fromthis summary. Many of the paragraphs, phrases, and several figuresand tables come directly from this book. To Pierre Legendre I expressmy deepest thanks for his permission to use this material, as well as forhis willingness to answer many, sometimes contorted questions.

i. Additional references, software and definitionsi.1 Additional references

Jongman, R. H. G., C. J. F. ter Braak & O. F. R. van Tongeren. 1995.Data analysis in community and landscape ecology. CambridgeUniversity Press, Cambridge.

Mainly regression, ordination and spatial analysis.

Legendre, L. & P. Legendre. 1984. Ecologie numérique. Vol. 1 & 2.Masson, Paris.

The French edition, still useful; many important topics were notavailabe at that time, though.

Université Laval Multivariate analysis - February 2006 2

Daniel Borcard Université de Montréal

Ter Braak, C.J.F. & P. Smilauer. 2002. CANOCO reference manualand CanoDraw for Windows user’s guide: software for canonicalcommunity ordination (version 4.5). Microcomputer Power,Ithaca.

Much more than a simple user's manual of the latest version of thetime-honored program Canoco. Very important for people interestedin canonical analysis of experimental data coming from ANOVAdesigns.

i.2 Software

• Excel XP for Windows (2002 or X for Mac OS)

Preparation of tabular data, simple statistics and graphics.

• R 2.1.0 for Windows, Mac OS X or Linux

General statistics, matrix algebra, multivariate statistics (clusteranalysis, ordination...). Open source, clone of S-Plus.

Packages dedicated to numerical ecology: vegan, labdsv, ade4

http://stat.ethz.ch/CRAN/

http://cc.oulu.fi/~jarioksa/softhelp/vegan.html

http://labdsv.nr.usu.edu/

http://pbil.univ-lyon1.fr/R/rplus/

• CANOCO 4.5 for Windows (3.1 for Mac OS)

Constrained and unconstrained ordination

Commercial software developed by Cajo Ter Braak

http://www.plant.dlo.nl/default.asp?section=products&page=

/products/canoco/right.htm

http://www.microcomputerpower.com



• R Package 4.0d8 for Mac OS (Classic environnement)

Assocation matrices (many coefficients), constrained andunconstrained clustering, unconstrained ordination, spatial analysis,ordination graphical support, Mantel test. Freeware, work inprogress (P. Legendre and Ph. Casgrain). Not to be confused withthe R language!

http://www.bio.umontreal.ca/legendre/

i.3 Definitions

Numerical ecology: “the field of quantitative ecology devoted to thenumerical analysis of ecological data sets. (...) The purpose ofnumerical ecology is to describe and interpret the structure of data setsby combining a variety of numerical approaches. Numerical ecologydiffers from descriptive or inferential biological statistics in that itextensively uses non-statistical procedures, and systematicallycombines relevant multidimensional statistical methods with non-statistical numerical techniques (e.g. cluster analysis) (...) ” (Legendre& Legendre, 1998).

Let us add that a great number of the methods in numerical ecology,especially the new approaches developed since the 80's, have beendevised by ecologists (and not pure statisticians), in response tospecific ecological problems.

Multivariate, multidimensional analysis: methods of numericalanalysis addressing whole data tables where every observation, i.e.every sampling or experimental unit is characterised by severalvariables: species abundances, climatic measures, and so on.



1. The data1.1 Data matrices

Instead of treating dependent variables one at a time, multivariateanalysis considers data tables. The ecological data table is generally arectangular matrix of the following form (Table I):

Table I - Structure of an ecological data table

Descriptors

Objects Variable 1 Variable 2 Variable j Variable p

Object 1 y11 y12 ... y1j ... y1p

Object 2 y21 y22 ... y2j ... y2p

.Object i yi1 yi2 ... yij ... yip

.Object n yn1 yn2 ... ynj ... ynp

The objects are the observations (sites, relevés...).

The best-known example of an ecological data table is the one wherethe variables are species (represented as counts, presence-absence, orany appropriate form of numerical coding) and the objects are sites,vegetation relevés, field observations, traps, and so on.

An ecological data table can also be made of environmentalvariables (climate, chemical variables...) that will be used either toexplain the structure of a species table, or directly to characterise agroup of sites.

Finally, another such table may contain the geographical coordinatesor any appropriate coding of the spatial structure of the data set.



Species

Obj

ects

1

n

p 1

n

m

Environnementalvariables

Spatialvariables

1

n

q

Descriptors

Figure 1 - The ecologist's data matrices.

The methods addressed in this document are aimed at:

- measuring resemblance among objects or variables of a data table;

- clustering the objects or variables according to these resemblances;

- ordinating them in a reduced space allowing to reveal their mainstructures (especially gradients);

- modelling the relationships between response data tables andexplanatory variables;

- testing these relationships for statistical significance.



1.2 Data transformation

There are instances where one needs to transform the data prior toanalysis. The main reasons are given below.

1. Make comparable descriptors that have been measured in differentunits

This is often done using ranging or standardization of the variables.It is useful because many methods are sensitive to the scale ofmeasurements of the variables. While this is sometimes a desirableproperty, in other cases one prefers to assess the ecological structuresindependently of the units of the variables.

Ranging is made of two operations: a) subtract the minimumobserved in each variable; b) divide by the range. This reduces thevalues of the variable to the interval [0;1]:

yi' =

yi − ymin

ymax − ymin

The transformation above is used on variables where the zero value ischosen arbitrarily (called interval scale variables; an example is theCelsius temperature scale).

For variables with a true zero and no negative values (called relativescale variables), ranging is simplified as

yi' =

yi

ymax

Standardization: subtract the mean of the variable from each value(i.e. centre the variable), and divide the results by the standarddeviation of the variable (i.e. scale the variable). This yield the so-called "z-scores":



yi' = zi =

yi − y

sy

This results in a variable that has zero mean and unit variance (andhence standard deviation = 1 as well). Therefore, all variables thathave been standardized can be directly compared and used together inmethods that are sensitive to differences in scales of measurement,since they are now dimensionless and expressed as standard deviationunits.

2. Normalize the data and stabilize their variance

This is done to make the frequency distribution of the values look likea normal curve - or, at least, as symmetric as possible. This is donebecause several multivariate methods used in ecology have beendeveloped under the assumption that the variables are normallydistributed. Full normality is generally not necessary, however, but theperformance of these methods is better with unskewed data.

Normalizing can be done in different ways, that require theexamination of the frequency distribution of the data. In many cases,ecologists encounter data that are strongly skewed to the right (longtail in the high values), because, in a sample set of species abundances,a species is abundant in a few observation units, fairly abundant inmore, present in even more, and absent from many units. Dependingon the skewness observed and the types of data, various correctingtransformations can be applied.

• Square root transformation (Figure 2): y'i = √(yi+c). The leastdrastic transformation, used when the data have a Poisson distribution;the constant c must be added to the data if there are negative values.So one first makes a translation of the data (c being equal to theabsolute value of the most negative observation) prior to thetransformation itself.



0

10

20

30

0 2 4 6 8 10 12Nb. Syrphid Diptera

.5 1 1.5 2 2.5 3 3.5 4Square root(Nb. Syrphids)

Nb.

obs

erva

tion

hour

sFigure 2 - The square root transformation.

• Log transformation: y'i = ln(yi+c). Frequently applied to speciesabundance data, of which many tend to follow a lognormaldistribution. The base of the logarithm is not important, but the mostlyused are the Napierian (natural) logarithms. The constant c is added tomake the data strictly positive. With species abundance data, c isgenerally set equal to 1. Thus, zero values are translated to 1, andbecome zero again with the log transformation.

02468

10121416

0 .5 1 1.5 2 2.5 3 3.5 4 4.5ln(nb. ind.+1)

05

1015202530

0 10 20 30 40 50 60 70Nb. individuals

Nb.

of

soil

core

sOppiella nova

Figure 3 - The log transformation.



• Arcsine transformation: appropriate for percentages or proportions(which are generally platykurtic), but the analytical results based onarcsine-transformed data may be difficult to interpret: yi

' = arcsin yi .

02468

10121416

0 20 40 60 80 100% fertility

Nb.

test

tube

s

05

1015202530

0 20 40 60 80Angle (degrees)

Figure 4 - The arcsine transformation. Data: Sokal & Rohlf (1981).

• Box-Cox transformation: when there is no a priori reason to selectone of the transformations above, the Box-Cox method allows one toempirically (and iteratively) estimate the most appropriate exponent ofthe following general transformation function:

yi' = (yi −1)/ (for ≠ 0)

yi' = ln(yi ) (for = 0)

Normalizing transformation generally also have the property ofstabilizing the variances; homoscedasticity (stability or homogeneityof variances) is an essential property of the data for several analysis,including ANOVA and its multivariate counterparts, and this, even ifthe tests are conducted using permutations (see Chapter 5).



3. Linearize the relationships among variables

Comparison coefficients like covariance or Pearson correlation aremade to detect linear relationships. Thus, if the relationships amongvariables are monotonic but nonlinear, a transformation may beapplied. For instance, if a dependent variable is an exponential functionof an independent variable, then the dependent variable may be log-transformed. The reverse may occur also. Note that it will be easier tointerpret the results if the transformation applied has a ground inecological theory. An example is the Malthusian exponential growthcurve:

Nt = N0ert

Data of a time series showing this curve may be log-transformed sothat ln(Nt) becomes linearly related to time t: ln(Nt) = ln(N0)+rt.

4. Modify the weights of the variables or objects

Standardization, log transformation or exponential transformation alsohave the effect of modifying the relative weight of the variables. Othertransformations may also explicitly change the weight of theobservations, as for instance the normalization of the object orvariable vectors to 1 (do not confuse with the normalizingtransformations above!). This operation consists in dividing each valueof the vector by the vector's length (called the norm of the vector),which is defined following Pythagora's formula:

Vector norm = b12 + b2

2 + ... + bn2

where b are the observations and 1, 2... are the object indices (so thisexample deals with a variable).

The normalized vector is thus defined as:



b1 b12 + b2

2 + ... + bn2

b2 b12 + b2

2 + ... + bn2

...

bn b12 + b2

2 + ... + bn2

=1

b12 + b2

2 + ... + bn2

b1

b2

...

bn

The length of any normalized vector, in the n-dimensional space, is 1.

5. Recode semi-quantitative variables as quantitative

In many instances variables are measured on a semi-quantitative scale,generally because the added precision of a quantitative measurementwould not justify the additional cost or difficulty to gather it. Suchsemi-quantitative measurements are often devised in such away thatthe intervals between the classes follow a known distribution (forinstance a variable of abundance classes going from 0 for "absent" toto 5 for "very abundant" may follow a logarithmic transformation ofthe real abundances). In such cases a back-transformation is possible,but one has to be conscious that this operation does not restore aprecision that the original measurements did not have in the first place!

An complex example is the transformation of Braun-Blanquet's phyto-sociological scale into quantitative values (Van Der Maarel 1977):

Table II -Transformation ofBraun-Blanquet'sscores into quan-titative scores.



6. Binary coding of nominal variables

Many analyses incorrectly interpret or do not accept multistatenominal variables (see Section 2.2) whose classes are coded asincremental numbers or as chains of characters. One must thereforerecode these variables into a series of dummy binary variables (TableIII):

Table III: binary coding of a nominal variable. Note that here 3dummy variables are sufficient, the fourth one being collinear to theothers. The fourth one is often discarded by computer programs, or theanalysis can simply not be run with it.

One nominal variable 4 dummy binary variables

Modality Code Calcosol Brunisol Neoluvisol Calcisol

Calcosol 1 1 0 0 0

Brunisol 2 0 1 0 0

Neoluvisol 3 0 0 1 0

Calcisol 4 0 0 0 1



2. Association matrices and coefficients2.1 Association matrices

A large majority of the methods of multivariate analysis, especiallyordination and most clustering techniques, are explicitly or implicitlybased on a comparison of all possible pairs of objects or descriptors.

When the pairs of objects are compared the analysis is said to be ofQ-mode. When the pairs of descriptors are compared the analysis issaid to be of R-mode.

This distinction is important because the comparison is based onassociation coefficients, and the coefficients of Q and R analyses arenot the same.

In Q-mode, the coefficients used measure the distance or thesimilarity between pairs of objects. Example: Euclidean distance,Jaccard similarity. In R-mode, one rather uses coefficients ofdependence among variables, like for instance covariance orcorrelation.

Computing all possible comparisons among pairs of objects produces asquare and symmetrical association matrix, of dimensions n × n(Q-mode) or p × p (R-mode):

a ina ii

a nian1

a i1

a 11 a 1i a1n

a nn

A =nn

Q-mode

A = a jpa jj

a pjap1

a j1

a 11 a 1j a1p

a pp

pp

R-mode

Figure 5 - Association matrices.



Every value in these matrices yields a comparison between twoobjects or descriptors whose location in the raw data matrix is givenby the subscripts: ain is the comparison measure between object i andobject n. Ecological association matrices are usually symmetrical sinceain = ani . The values located on the diagonal compare the objects orvariables with themselves. In Q-mode, the diagonal is usually made of0 (when the coefficient is a distance) or 1 (when the coefficient is asimilarity). In R-mode the diagonal gives a measure of dependence ofa variable with itself: for instance this value equals 1 if the measure is aPearson correlation coefficient, or it equals the variance of the variableif the measure is a covariance.

All the useful information of an association matrix is thus given in thetriangle located above or below the diagonal (without the diagonalitself). The number of comparisons of all possible pairs of n objects isthus equal to n(n–1)/2.

2.2. Types of descriptors

Before reviewing the available categories of coefficients of association,one must specify the mathematical type of variables to which thesecoefficients will be applied. Figure 6 (below) summarises these typesin the form of a hierarchy of complexity starting with the binary type(the simplest one: 1-0, yes-no, present-absent, open-closed...) to thecontinuous quantitative type. In data analysis, one can simplify theinformation at hand (e.g. recode species abundance data into presence-absence data), but usually not the reverse. Note that it often happensthat the information required from an analysis can be obtained withoutthe variables being measured with the maximum possible precision.Frequently, a large amount of objects characterised by measurementsmade with a limited precision is preferred over a small number ofobjects whose variables are measured with a very high precision.



Binary: 1 - 0 present - absent

Multi-state: - Nonordered, nominal : ex. colors, type of soil...

- Ordered: - Semiquantitative, ordinal, rank-ordered, : ex. size classes (0-10 cm, 10-50 cm, more than 50 cm...), rank in a race.

- quantitative: - discontinuous, meristic, discrete (ex.: number ofpersons in this room, nb. of individuals per species...

- continuous (ex.: temperature, length, ...)

Relevé2

Relevé1

Relevé3

Spec. 1

Spec. 2

Spec. 3

1 1

1

1

1

0

0

0

0

Descriptor "species"

Description "species present": 1

Description "species absent": 0}

Size class 1

Size class 2

Size class 3

Relevé2

Relevé1

Relevé3

Esp. 1

Esp. 2

Esp. 3

12 18

56

1

3

0

0

0

0

Figure 6 - Mathematical types of descriptors used in ecology.



2.3. The double-zero problem

In the following sections, the association coefficients will be groupedinto categories depending on the type of objects or descriptors towhich they are applied. Before this review, it is necessary to bring up aproblem pertaining to the comparison of objects when a descriptor hasthe value "zero" in a pair of objects.

In certain cases, the zero value has the same meaning as any othervalue on the scale of the descriptor. The absence (0 mg/L) of dissolvedoxygen in the deep layers of a lake is an ecologically meaningfulinformation.

On the contrary, the zero value in a matrix of species abundances (orpresence-absence) is much more tricky to interpret. The presence of aspecies at a given site generally implies that this site provides a set ofminimal conditions allowing the species to survive (the dimensions ofits ecological niche). The absence of a species from a relevé or site,however, may be due to a variety of causes: the species' niche may beoccupied by a replacement species, or the absence of the species is dueto adverse conditions on any of the important dimensions of itsecological niche, or the species has been missed because of a purelystochastic component of its spatial distribution, or the species does notshow a regular distribution on the site under study. The key here isthat the absence from two sites cannot readily be counted as anindication of resemblance between the two sites, because this doubleabsence may be due to completely different reasons.

The information "presence" has thus a clearer interpretation than theinformation "absence". This is why one can distinguish two classes ofassociation coefficients based on this problem: the coefficients thatconsider the double zero (sometimes also called "negative match") as aresemblance (as any other value) as said to be symmetrical, theothers, asymmetrical. It is preferable to use asymmetrical coefficientswhen analysing species data.



The following sections review the main categories of coefficients withseveral examples. For a comprehensive review and keys to helpchoose the appropriate coefficient, see Legendre & Legendre (1998).All the indices listed in that book are available in the R package forMacintosh of Legendre, Casgrain and Vaudor at following webaddress: <http://www.bio.umontreal.ca/legendre/>.

The choice of an appropriate coefficient is fundamental, becauseall the subsequent analyses will be done on the resultingassociation matrix. Therefore, the structures revealed by theanalyses will be those of the association matrix.

2.4. Q mode: resemblance between objects

The most frequently used coefficients for the comparison of objects aresimilarity or distance measures. Depending on the above-mentionedcharacteristics of the variables in the data table, these coefficients canbe classified as follows (Figure 7):

binary

quantitative

symmetrical coefficients

asymmetrical coefficients

symmetrical coefficients

asymmetrical coefficients

Data

Figure 7 - Types of association coefficients in Q-mode analysis.



2.4.1. Symmetrical binary similarity coefficients

This expression means that these coefficients are made for binary data(and not that the values of the index itself are binary!) and that thesecoefficients treat a double zero in the same way as a double 1.

For binary coefficients, depending on the value taken by a variable in apair of objects, one can represent the observations in a 2 × 2contingency table (Figure 8):

(a + b + c + d) is the total number of descriptors.

The most typical index of this category is the simple matchingcoefficient S1 [the numbering of the coefficients is the one ofLegendre & Legendre (1998)]. It is the number of variables that takethe same value in both objects (i.e. double 1s + double 0s) divided bythe total number of variables in the matrix. It is thus built as follows(Figure 9):



Var.1 Var.2 Var.3 Var.4 Var.5 Var.6Obj.1 1 1 0 0 1 0Obj.2 1 0 1 0 0 1

a b c d b c

S1 = a + da + b + c + d

Figure 9 - Computation of the simple matching coefficient

In this example, the simple matching coefficient is:

S1 = (1+1)/(1+2+2+1) = 2/6 = 0.333

which means that two of the six descriptors have the same value (0 or1) in the two objects considered.

This coefficient, as well as the others of this category, are used tocompare objects described by binary variables other than speciespresence-absence.

2.4.2. Asymmetrical binary similarity coefficients

This category has the same role as the previous one, but for presence-absence species data. The formulas are the same as those of thecategory above, but ignore the d (double zero). The best knowncoefficients of this category are the Jaccard community index (S7) andthe Sørensen index(S8).



S7 =a

a + b + c

S8 = 2a2a + b + c

The use of these two coefficients is widespread in botany as well aszoology.

2.4.3. Symmetrical quantitative similarity coefficients

One example in this category is a form of the simple matchingcoefficient where the variables are multiclass nominal instead of "only"binary. The index is the number of descriptors having the same state inboth objects, divided by the total number of descriptors.

Other coefficients of this category are interesting because they allowone to compare, within a single coefficient, descriptors of differentmathematical types. The trick consists in computing partial similaritiesfor each descriptor, and then to take the average of these partialsimilarities. Among the coefficients of this kind, let us mentionEstabrook & Rogers (S16) and Gower (S15).

2.4.4. Asymmetrical quantitative similarity coefficients

This category, adapted to species abundance data, comprises amongthe most frequently used coefficients. Let us mention two of them: theSteinhaus index S17 (also well-known in its distance form as the Bray-Curtis index, D14), and the χ2 similarity S21.

The S17 index (Figure 10) compares for each species the smallestabundance to the mean abundance in the two objects:

S17 =W

A + B( ) 2=

2W

A + B



Example:

Species abundances A B W

Object 1 70 3 4 5 1 83Object 2 64 4 7 4 3 82

Minima 64 3 4 4 1 76

S17 =2 × 7683+ 82

= 0.921

Figure 10 - Computation of the Steinhaus coefficient S17.

A caveat about S17 is that, by construction, it gives the sameimportance to a difference of, say, 10 individuals, whether this means adifference between 1 and 11 individuals or between 301 and 311individuals. This goes against intuition (and, in many cases, againstecological theory), and many users prefer to log-transform their dataprior to an S17-based analysis.

Another similarity measure adapted to species data, the similarityis related to the χ2 measure used to study contingency tables. Thespecies abundances are transformed into profiles of conditionalprobability; thereafter one computes a weighted Euclidean distanceamong sites. S21 is the reciprocal (S21 = 1–D15) of these distances. Theformula for D15 is given in Section 2.4.5.2 below.



2.4.5. Distance measures in Q-mode

2.4.5.1 Distance measures for qualitative binary or multiclassdescriptors

All similarity coefficients can be converted into distances by one of thefollowing formulas:

D = 1− S D = 1− S2

D = 1− S D =1− S Smax

These conversions provide appropriate coefficients in the case ofindices for qualitative binary or multiclass descriptors.

2.4.5.2 Distance measures for quantitative descriptors

Contrary to similarity coefficients, distances measures give a maximumvalue to two completely different objects, and a minimum value (0) totwo identical objects. On can define three categories of indicesdepending on their geometrical properties:

• The metrics, which share the following four properties:

1. Minimum 0: if a = b then D(a,b) = 0

2. Positiveness: if a ≠ b then D(a,b) > 0

3. Symmetry: D(a,b) = D(b,a)

4. Triangle inequality: D(a,b) + D(b,c) ≥ D(a,c)

• The semimetrics (or pseudometrics), that do not follow the triangleinequality axiom. These measures cannot directly be used to orderpoints in a metric or Euclidean space because, for three points (a, band c), the sum of the distances from a to b and from b to c may besmaller than the distance between a and c.

• The nonmetrics, a group of measures that can take negative values,thus violating the second principle above (positiveness).



Among the metric distance measures, the most obvious is theEuclidean distance (D1). Every descriptor is considered as adimension of a Euclidean space, the objects are positioned in this spaceaccording to the value taken by each descriptor, and the distancebetween two objects is computed using Pythagora's formula:

D1(x1, x2 ) = y1j − y2 j( )2

j =1

p

∑

When there are only two descriptors, this expression becomes themeasure of the hypotenuse of a right-angled triangle (Figure 11):

D1(x1, x2 ) = y11 − y21( )2 + y12 − y22( )2

Descriptor y1

Des

crip

tor

y 2

Object 1x

Object x2

D1(x1,x2)

y21 y11

y12

y22

Figure 11 - Graphical representation of the Euclidean distance D1.

This measure has no upper limit; its value increases indefinitely withthe number of descriptors, and, an important point, the value dependson the scale of each descriptor. The problem may be avoided by



computing the Euclidean distance on standardized variables insteadof the original data. Standardization is not necessary when D1 isapplied to a group of dimensionally homogeneous variables.

For clustering purposes, the square of D1 is sometimes used. Thesquared D1 is semimetric, however, making it less suitable forordination.

D1 is the essential linear measure! It is linked to the Euclidean spacewhere a large majority of the usual statistical techniques are defined:regression, ANOVA... One consequence is that this measure is notadapted to species data: in Euclidean space, zero is a value like allothers. Two objects with zero abundances of a given species will be asclose to one another as if the species had, for instance, 50 individualsin each object, all other values being equal. Therefore, methodsrespecting the Euclidean distance among objects cannot generally beused on species data without proper adaptations. Some of theseadaptations are pre-transformations of species data (see chapter 4:ordination); some adaptations can be imbedded into the Euclideandistance itself.

D3, the chord distance, for instance, is a Euclidean distance computedon site vectors scaled to length 1 (=normalized vectors). It can becomputed as D1 after normalizing the site vectors to 1, or directly onthe raw data through the following formula:

D3 (x1, x2 ) = 2 1−y1j y2 j

j=1

p∑

y1j2

j =1

p∑ y2 j

2

j=1

p∑

This trick provides a distance measure that is insensitive to the doublezeros, making it suitable for species abundance data.



The chord distance is equivalent to the length of a chord joining twopoints within a segment of a sphere or hypersphere of radius 1. If onlytwo descriptors are involved, the sphere becomes a circle and thechord distance can be represented as follows:

Figure 12 - Graphical representation of the chord distance D3.

The chord distance is maximum when the species at the two sites arecompletely different (no common species). At this case, the normalisedsite vectors are at 90° from each other, and the distance between thetwo sites is √2. The chord distance is metric.

In Section 2.4.4, devoted to asymmetrical quantitative similaritycoefficients, we mentioned S21, the χ2 similarity. This coefficient isactually the reciprocal of the metric D15. Its computation is doneusing following equation:

D15(x1, x2 ) =1

y+ j

y1j

y1+−

y2 j

y2+

2

j=1

p

∑

where y+j is the sum of abundances in all sites for species j, and y1+

and y2+ are the sums of species abundances in sites 1 and 2respectively.



A related measure is called the distance D16, where all the terms ofthe sums of squares are divided by the relative frequency of each rowin the overall table instead of the absolute frequency. in other words, itis identical to the χ2 metric multiplied by √y++, where y++ is the grandtotal of the data table:

D16(x1, x2 ) =1

y+ j y++

y1j

y1+−

y2 j

y2+

2

j=1

p

∑ = y++1

y+ j

y1j

y1+−

y2 j

y2+

2

j=1

p

∑

The χ2 distance is the distance preserved in correspondence analysis(CA, chapter 4). This measure has no upper limit.

A coefficient related to D15 and D16 is the Hellinger distance D17, forwhich the formula is:

D17(x1, x2 ) =y1 j

y1+−

y2 j

y2+

2

j=1

p

∑

We shall mention interesting uses of this distance measure, as well asthe chord distance, in Chapter 4.

Finally, among the semimetric distance measures, the most frequentlyused is D14, the Bray and Curtis distance, which is the reciprocal of theSteinhaus similarity coefficient: D14 = 1–S17, and is therefore adaptedto species data.

2.5 R mode: coefficients of dependence

When one compares descriptors on the basis of their values in a seriesof objects, one generally wants to describe the way these descriptorsvary one with respect to the others. Once again we have todistinguish between the case where the descriptors are speciesabundances and the other cases.



2.5.1 Descriptors other than species abundances

Qualitative descriptors: their comparison can be done using two-way contingency tables and their χ2 statistic.

Semi-quantitative descriptors: if a pair of such descriptors is inmonotonic relationship, its resemblance can be measured usingSpearman's r and Kendall's τ nonparametric correlation coefficients. Ifthe relationship is not expected to be monotonic, then it may bepreferable to use the χ2 statistic for contingency tables. The semi-quantitative information is lost, but the relationship can be detected.

Quantitative descriptors: their relationship is generally measured byusual parametric dependence measures like covariance or Pearson'scorrelation. Remember also that Pearson's correlation is covariancemeasured on standardized variables. Note that covariance andcorrelation are only adapted to descriptors whose relationships arelinear.

2.5.2 Species abundances: biological associations

Analyzing species abundances in R mode causes the same problem asin Q mode: what to do with double zeros?

Double absences are frequent in ecological communities because thesecontain many rare species and only a few dominant ones. Since onegenerally wants to define biological associations on the basis of all (ormost) species present, the data matrix contains a large number ofzeros. However, we know that the zeros do not have a nonequivocalinterpretation. Therefore, it is not recommended to use the covarianceor correlation coefficients mentioned above (including thenonparametric ones!), since these use the zero as any other value.Furthermore, correlation or covariance coefficients measure linearrelationships, so that species that are always found together but whoseabundances are not in linear relationship would not be recognised asbelonging to the same association by these coefficients. The same



holds for nonparametric correlation coefficients, that detect monotonicrelationships only.

If one has only access to such coefficients, several options are availableto minimize their adverse effects:

- eliminate from the study the less frequent species, so as to reduce thenumber of double zeros;

- eliminate the zeros (by declaring them as missing values);

- eliminate double zeros only from the computation of the correlationor covariance matrix; this must generally be programmed separately;

Another method is to use the S21 coefficient among variables (species):as an exception, this coefficient can be applied in R mode as well as inQ mode.

Yet another approach is to apply Goodall's probabilistic coefficient(S23) to species. This allows one to set an "objective", probabilisticlimit to associations, such as: "all species that are related at aprobability level p≥0.95 are members of the association".

Alternately, one can also define species groups by clustering thespecies scores of an ordination.

Presence-absence data: in several instances it may be preferable todefine species associations on the basis of presence-absence data, forinstance in cases where quantitative data do not reflect the trueproportions among species (because of sampling biases, identificationproblems, and so on). Biological associations are then defined on thebasis of the co-occurrence of species instead of the relationshipsbetween fluctuations in abundances. In this case there is anotherexception to the rule that Q-mode coefficients cannot be used in Rmode: the Jaccard community coefficient S7 or the Sørensencoefficient S8 can be applied to species vectors (in R mode).Otherwise, Fager's coefficient (S24) or Krylov's probabilistic coefficient(S25) can be used. See Legendre & Legendre (1998) for more details.



Recently, Legendre (2005)1 proposed to use Kendall's W coefficient ofconcordance, together with permutation tests, to identify speciesassociations: "An overall test of independence of all species is firstcarried out. If the null hypothesis is rejected, one looks for groups ofcorrelated species and, within each group, tests the contribution ofeach species to the overall statistic, using a permutation test." Thesimulations accompanying the paper show that "when the number ofjudges [= species] is small, which is the case in most real-lifeapplications of Kendall’s test of concordance, the classical χ2 test isoverly conservative, whereas the permutation test has correct Type Ierror; power of the permutation test is thus also higher."

Permutation tests are addressed in Chapter 5.

2.6 Choice of a coefficient

Legendre & Legendre (1998) p. 299-301 provide tables to help choosean appropriate similarity, distance or dependence coefficient. Thesetables are extremely helpful because of the many criteria to considerand the vast number of available coefficients.

1 Legendre, P. 2005. Species Associations: The Kendall Coefficient of Concordance Revisited. Journal of Agricultural,

Biological, and Environmental Statistics 10 (2): 226–245.



3. Cluster analysis3.1. Overview

Clustering requires the recognition of discontinuous subsets in anenvironment that is sometimes discrete (as in taxonomy), but mostoften continuous in ecology. To cluster is to recognise that objects aresufficiently similar to be put in the same group, and also to identifydistinctions or separations between groups. The present chapterdiscusses methods used to decide whether objects are similar enoughto be allocated to a group.

Clustering is an operation of multidimensional analysis which consistsin partitioning the collection of objects (or descriptors in R mode) inthe study. A partition is a division of a set (collection) into subsets,such that each object or descriptor belongs to one and only one subsetfor that partition (for instance, a species cannot belong simultaneouslyto two genera!). Depending on the clustering model, the result can bea single partition or a series of hierarchically nested partitions.

Note that the large majority of clustering techniques work onassociation matrices, which stresses the importance of the choice of anappropriate association coefficient.

One can classify the families of clustering methods as follows:

1. Sequential or simultaneous algorithms. Most methods are sequential and

consist in the repetition of a given procedure until all objects have found their place:

progressive division of a collection of objects, or progressive agglomeration of objects into

groups. The less frequent simultaneous algorithms, on the contrary, find the solution in a

single step.

2. Agglomerative or divisive. Among the sequential algorithms, agglomerative

procedures begin with the discontinuous collection of objects, that are successively grouped

into larger and larger clusters until a single, all-encompassing cluster is obtained. Divisive

methods, on the contrary, start with the collection of objects considered as one single group,

and divide it into subgroups, and so on until the objects are completely separated. In either



case it is left to the user to decide which of the intermediate partition is to be retained, given

the problem under study.

3. Monothetic versus polythetic. Divisive methods may be monothetic or

polythetic. Monothetic methods use a single descriptor (the one that is considered the best

for that level) at each step for partitioning, whereas polythetic methods use several

descriptors which, in most cases, are combined into an association matrix.

4. Hierarchical versus non-hierarchical methods. In hierarchical methods,

the members of inferior-ranking clusters become members of larger, higher-ranking

clusters. Most of the time, hierarchical methods produce non-overlapping clusters. Non-

hierarchical methods (including the K-means method exposed in Section 3.6) produce one

single partition, without any hierarchy among the groups. For instance, one can ask for 5 or

10 groups for which the partition optimises the intragroup homogeneity.

5. Probabilistic versus non-probabilistic methods. Probabilistic methods

define groups in such a way that the within-group association matrices have a given

probability of being homogeneous. Sometimes used to define species associations.

3.2. Single-linkage agglomerative clustering

Also called nearest neighbour clustering, this method is sequential,agglomerative, polythetic, hierarchical and non-probabilistic, like mostof the methods that will be presented here. Based on a matrix ofsimilarities or distances, it proceeds as follows (see example furtherdown):

1. The matrix of association is rewritten in decreasing order ofsimilarities (or increasing order of distances).

2. The clusters are formed hierarchically, starting with the two mostsimilar objects (first row of the rewritten association matrix). Then thesecond row forms a new group (if it contains two new objects) oraggregates itself to the first group (if one of the objects is a member ofthe first group formed above), and so on. The objects aggregate andthe size of the groups increases as the similarity criterion relaxes.



Table IV (below) is a matrix of Euclidean distances (D1) among fivefictitious objects and will be the base for the examples of computationof clusterings.

Table IV - D1 association matrix among five objects

______________________________________________________________

2 3 4 5

1 0.20 0.25 0.45 0.80

2 0.40 0.35 0.50

3 0.30 0.60

4 0.70

______________________________________________________________

First step of the single linkage clustering: the association matrix(Table IV) is rewritten in order of increasing distances:

D1 Pairs of objects formed________________________________________

0.20 1 - 20.25 1 - 30.30 3 - 40.35 2 - 40.40 2 - 30.45 1 - 40.50 2 - 50.60 3 - 50.70 4 - 50.80 1 - 5



Second step: the groups are formed by extending the distanceprogressively:

a. First group to be formed: pair 1 - 2, distance 0.2.

b. Object 3 rejoins the group above at distance 0.25.

c. Object 4 rejoins the group above at distance 0.30.

d. Object 5 rejoins the group above at distance 0.50.

The name single linkage clustering comes from the fact that the fusionof an object (or a group) with a group at a given similarity (ordistance) level only needs that one object of each of the two groupsabout to agglomerate be linked to one another at this level. We shallsee in Section 3.3 that, at the opposite of the spectrum, completelinkage clustering demands, for the two groups to agglomerate, that allobjects be related at the given similarity.

The result of a hierarchical clustering is generally presented in the formof a dendrogram. The dendrogram resulting from the example aboveis the following (Figure 13):

12345

0.1 0.2 0.3 0.4 0.50.0

Proportional linkage Connectedness 0.000

Figure 13: Dendrogram of the single linkage clustering of the datashown in Table IV. The scale represents Euclidean distances (thecoefficient used in the association matrix) “Proportional linkage” and“connectedness”: see text.



3.3. Complete linkage agglomerative clustering

Contrary to single linkage clustering, complete linkage clustering (alsocalled furthest neighbour sorting) allows an object (or a group) toagglomerate with another group only at a similarity corresponding tothat of the most distant pairs of objects (thus, a fortiori, all members ofboth groups are linked).

The procedure and results for Table IV data are as follows:

First step: the association matrix is rewritten in order of increasingdistances (same as single linkage).

Second step: agglomeration based on the criteria exposed above:

a. First group to form: pair 1 - 2, distance 0.20.

b. A second group forms, independent of the fist: pair 3 - 4, distance0.30. Indeed, none of objects 3 or 4 is at a shorter distance than 0.30from the furthest member of group 1 - 2 (object 3 is at 0.25 of object1, but at 0.40 of object 2).

c. Fusion of the two pairs formed above (1-2 and 3-4) can occur onlyat the distance separating the members that are furthest apart. Herethis distance is 0.45 (between objects 1 and 4). The two groups join atthis level (since no external object is closer to one of the groups than0.45).

d. Object 5 can join the group only at the distance of the member thatis furthest from it, i.e. 0.80 (distance between object 5 and object 1).

The resulting dendrogram (Figure 14) has a quite different aspect thanthe previous one:



12345

0.1 0.2 0.3 0.4 0.50.0 0.6 0.7

Proportional linkage Connectedness 1.000

Figure 14: Dendrogram of the complete linkage clustering of the datashown in Table IV.

The comparison between the two dendrograms shows the differencein the philosophy and the results of the two methods: single linkageallows an object to agglomerate easily to a group, since a link to onesingle object of the group suffices to induce the fusion. This is a"closest friend" procedure, so to say. As a result, single linkageclustering has a tendency to produce a chaining of objects: a pairforms, then an objects rejoins the pair, and another, and so on. Theresulting dendrogram does not show clearly separated groups, but canbe used to identify gradients in the data.

At the opposite, complete linkage clustering is much moreconstraining (and contrasting). A group admits a new member only ata distance corresponding to the furthest object of the group: one couldsay that the admission requires unanimity of the members of thegroup! It follows that, the larger a group is, the more difficult it is toagglomerate with it. Complete linkage, therefore, tends to producemany small groups separately, that agglomerate at large distances.Therefore, this method is interesting to look for discontinuities in datathat are a priori quite compact. In other words, single linkageclustering contracts the reference space around a cluster, whilecomplete linkage clustering dilates it.



3.4. Intermediate linkage clustering

This expression includes all the intermediates between the aboveextremes, i.e. algorithms where group fusion occurs when a definiteproportion of links is established between the members of the twogroups. This proportion is called the connectedness. Connectednessvaries from 0 (single linkage) to 1 (complete linkage). Often inecology, appropriate solutions are found in intermediateconnectednesses (0.3 to 0.7), where the clustering algorithmapproximately conserves the metric properties of the reference space.

The study of this family of linkage techniques shows that it has a greatflexibility. This quality could lead the reader to think that one canimpose one's preconcieved ideas on the data. In reality you mustremember the following points:

1. It is preferable to define what you expect from a clustering beforerunning the computation. To show a possible gradient? To reveal faintdiscontinuities? An intermediate, "neutral" clustering?

2. Whatever the method chosen is, the structures revealed indeed existin the association matrix. Even a complete linkage clustering will notproduce small, compact groups from an association matrix describingonly a strong gradient, and the converse is true also.

Therefore, it is extremely important that one chooses the appropriateassociation coefficient and the appropriate clustering method to extractthe desired information from the data.

3.5. Average agglomerative clustering

The four methods of this family are commonly used in numericaltaxonomy (but less in ecology). Their name in this discipline arementioned in parentheses in Table V below. These methods are notbased on the number of links between groups or objects, but rather on



average similarities among objects or on centroids of clusters. Thedifference among them pertains to the way of computing the positionof the groups (arithmetic average versus centroids) and to theweighting or non-weighting of the groups according to the number ofobjects that they contain. Table V summarises this:

Table V - The four methods of average agglomerative clustering

Arithmetic average Centroid clustering

Equal weights Unweighted arithmetic Unweighted centroid

average clustering (UPGMA) clustering (UPGMC)

Unequal weights Weighted arithmetic Weighted centroid

average clustering (WPGMA) clustering (WPGMC)

Unweighted arithmetic average clustering (UPGMA)

Also called group average sorting and Unweighted Pair-GroupMethod using Arithmetic averages), this technique must be appliedwith caution: because it gives equal weights to the original similarities,it assumes that the objects in each group form a representative sampleof the corresponding larger groups of objects in the referencepopulation under study. For this reason, UPGMA clustering shouldonly be used in connection with simple random or systematic samplingdesigns if the results are to be extrapolated to a larger referencepopulation.

UPGMA allows an object to join a group at the average of thedistances between this object and all members of the group. Whentwo groups join, they do it at the average of the distances between allmembers of one group and all members of the other. This gives, usingour example (Table IV):



- objects 1 and 2 join at 0.20;

- object 3 is at distance 0.25 with 1, and 0.40 with 2. The average ofthese distances is 0.325, i.e., larger than the distance between objects 3and 4 (0.30). Therefore, the latter join at distance 0.30 as a distinctgroup;

- object 5 being very far, the two groups 1 - 2 and 3 - 4 join at theaverage of the inter-group distances, i.e. [D1(1-3) + D1(1-4) + D1(2-3)+ D1(2-4)]/4 = (0.25+0.45+0.40+0.35)/4 = 0.3625;

- similarly, object 5 joins the group at the average of its distances withall the members of the group, i.e. (0.50+0.60+0.70+0.80)/4 = 0.65.

12345

0.1 0.2 0.3 0.4 0.50.0 0.6

Lance & Williams: Average clustering

Figure 15: Dendrogram of the UPGMA clustering of the data shownin Table IV.

Unweighted centroid clustering(UPGMC)

The same caveat as in UPGMA, about the representativeness of thesample, applies to UPGMC.

In a cluster of points, the centroid is the point that has the averagecoordinates of all the objects of the cluster. UPGMC joins the objectsor groups that have the highest similarity (or the smallest distance), by



replacing all the objects of the group produced by the centroid of thegroup. This centroid is considered as a single object at the nextclustering step.

A simple manner to achieve this is to replace, in the similarity matrix,the two rows and columns corresponding to the two objects about tojoin by a single series obtained by computing the averages of thesimilarities of the two objects with all the others. Presently however,one uses a slightly more complex formula, that is given by Legendre &Legendre (1998) p. 322.

The dendrogram of the UPGMC clustering of our example data hasthe following aspect (Figure 16):

123

4

0.1 0.2 0.3 0.4 0.50.0

5

Lance & Williams: Centroid clustering

Figure 16: Dendrogram of the UPGMC clustering of the data shownin Table IV, showing a reversal.

UPGMC, as well as WPGMC, can sometimes produce reversals in thedendrogram. This situation occurred in our example. This happenswhen:

1. Two objects about to join (let us call them A and B) are closer toone another than each of them is to a third object C: AB<AC ;AB<BC.



2. After the fusion of A and B, the centroid of the new group A-B iscloser to C than A was to B before the fusion: CABC<AB.

This result is due to a violation of the ultrametric property, that statesthat a distance between two objects A and B must be smaller than orequal to the maximal distance between A and a third object C and Band C: D(A,B) ≤ Max | D(A,C) , D(B,C)|. See Legendre & Legendre (1998)p. 324 for further explanations.

This dendrogram is tricky to interpret. In fact, one cannot consider itas a classification sensu stricto.

WPGMA and WPGMC

The weighted counterparts of the two methods above, i.e. WPGMAand WPGMC, are not detailed here. They can be used in cases wheregroups of objects representing different situations (and thus likely toform different groups) are represented by unequal numbers of objects.In these cases the two unweighted methods above may be distortedwhen a fusion of a large and a small group of objects occurs. Thesolution consists in giving equal weights, when computing fusionsimilarities, to the two branches of the dendrogram that are about tofuse.

3.5b Ward's minimum variance clustering method

This method is related to UPGMC and WPGMC: cluster centroidsplay an important role. The method minimizes an objective function:the "squared error" of ANOVA. At the beginning, the n objects eachform a cluster. Sum of squared distances between objects andcentroids is 0. As clusters form, the centroids move away from actualobject coordinates and the sum of the squared distances from theobjects to the centroids increase.



The sum of squared distances is the same quantity as tho one called"error" in ANOVA. At each clustering step, Ward's method finds thepair of objects or clusters whose fusion increases as little as possiblethe sum, over all objects, of the squared distances between objects andcluster centroids. The within-cluster sum of squared errors can becomputed either from the raw data, or as the mean of the squareddistances among cluster members. Therefore, Ward's method canbe applied to raw data or to distance matrices. In the latter case theWard method, originally based on the Euclidean distance, can beextended to any distance coefficient.Dendrograms can be represented using various scales. The topologyremains the same:

• squared distances;• square root of the fusion distances (removes the distorsioncreated by squaring the distances). Used in the R package ofLegendre, Casgrain & Vaudor;• sum of squared errors.

12345

0.1 0.2 0.3 0.4 0.50.0 0.6 0.7 0.8

Ward's minimum variance clustering

Figure 16b - Result of a Ward clustering on the distance matrix of 5objects of Table IV. The scale is given as square root of the fusiondistances.



3.6 Partitioning by K-means

Partitioning is finding a single partition of a set of objects. The problemhas been stated as follows: given n objects in a p-dimensional space,determine a partition of the objects into K groups, or clusters, such asthe objects within each cluster are more similar to one another than toobjects in the other clusters. The number of groups, K, is determinedby the user. The K-means method uses the local structure of the datato delineate clusters: groups are formed by identifying high-densityregions in the data. To achieve this, the method iteratively minimisesan objective function called the total error sum of squares (E2

K orTESS). This quantity is the sum, over the K groups, of the means ofthe squared distances among objects in their respective groups.

To begin the computation, one has to provide an initial configuration,i.e. to attribute the objects to the K groups as a starting point to theoptimisation process. This initial configuration can be random (this isgenerally offered by the computer programs running K-means), or itcan be, for instance, a partition derived from a hierarchical clusteringcomputed on the same data, or it may be provided by an ecologicalhypothesis.

Some programs ask for "group seeds", i.e. objects round which to startthe clustering process. Whenever possible, it is recommended tochoose objects that are as close as possible to the expected centroidsof the final groups. In a closely related method (partitioning aroundmedoids, PAM) these group seeds are called "medoids".

A problem often encountered in such iterative algorithms, and existinghere, is that the final solution somewhat depends on the initialconfiguration, since the iterative procedure may encounter a localminimum during the process. This is why one often tries several runs,each one based on a different initial configuration, to retain the one thatyields the lowest TESS at the end of the process.



K-means partitioning may be computed from either a raw data table ora distance matrix (since TESS can be computed directly fromdistances among objects). If one wishes to use K-means on speciesabundance data, one has to compute a distance matrix using anappropriate, asymmetrical measure (see Chapter 2). if the computationis run from the raw data table, then the double zeros are counted asresemblances among objects, which is inappropriate. An alternative isto pre-transform the species data as shown in Section 4.3 below beforerunning K-means on the matrix of objects by (transformed) species.

This overview does by far not show all the clustering methodsavailable. But it shows that numerous approaches exist, that thesemethods address different questions, focus on different aspects of thedata and therefore do not necessarily yield the same results. Thechoice depends on the researcher's aims.



4. Ordination in reduced space4.1. Generalities

Contrary to most clustering techniques, which aim at revealingdiscontinuities in the data, ordination mainly displays gradients. Adetailed account on how to compute an ordination goes beyond thisshort introduction. Here we shall present an overview of the mostuseful methods available, with an intuitive explanation about the waythey work.

Suppose you have a series of observations (objects) characterised bytwo variables. The objects could be represented in a two-dimensionalspace, each dimension being one of the variables (Figure 17):

Figure 17 - Ordination of six objects in the space of two variables.



A matrix of raw data (for instance objects by physico-chemicalmeasurements) generally contains much more than two variables. Inthis case it becomes difficult, cumbersome and not very informative todraw the objects in a series of planes defined by all possible pairs ofdescriptors. For instance, if the matrix contains 10 descriptors, thenumber of planes to draw would be equal to (10 × 9)/2 = 45. Such aseries of drawings would neither allow one to bring out the mostimportant structures of the data, nor to visualise the relationshipsamong descriptors (which, in general, are not linearly independentfrom one another anyway).

The aim of the ordination methods is to represent the data in a reducednumber of orthogonal axes, constructed in such a way that theyrepresent, in decreasing order, the main trends of the data. Here weshall mention four basic techniques: principal component analysis(PCA), correspondence analysis (CA), principal coordinate analysis(PCoA) and nonmetric multidimensional scaling (NMDS).

4.2. Principal component analysis (PCA)

Imagine, again, a data set made of n objects by p variables. The nobjects can be represented as a cluster of points in a p-dimensionalspace. Now, this cluster is generally not completely spheroidal: it iselongated in several directions, flattened in others. These directions arenot necessarily aligned with one single dimension (= one singlevariable) of the multidimensional space. The direction where thecluster is most elongated corresponds to the direction of largestvariance of the cluster.

PCA realises a rigid rotation of the original system of axes, such as thesuccessive new axes (called principal components) are orthogonal toone another, and correspond to the successive dimensions ofmaximum variance of the scatter of points. The principal components



give the positions of the objects in the new system of coordinates(Figure 18):

Figure 18 - PCA rotation of the 6 objets of Figure 17.

Each principal component is actually a linear combination of theoriginal variables. Therefore, one can interpret the axes of a PCA byverifying which variable(s) contribute most to the few first principalcomponents. One can also represent the variables on the PCA diagramrepresenting the objects. The variables take the form of vectors (Figure19). Note, however, that when one is specifically interested in therelationships among variables, another type of projection is preferable(see later on).

Each principal component is built on what is called an eigenvector,that has an associated eigenvalue i. This eigenvalue gives the amountof variance that is represented on the axis. The eigenvalues always goin decreasing order, i.e. the first axis represents the largest part of the



variance, the second axis less than the first, and so on. There are asmany principal components as there are variables in the original dataset.

The total variance is given in several programs by the total sum ofsquares (total SS, i.e. the variance without the division by degrees offreedom). In some programs, like in Canoco, the ordination summaryalso presents the results with the total SS scaled to 1, so that theeigenvalues can readily be interpreted as proportions of variance: aneigenvalue of 0.705 means that the axis represents 70.5% of the totalSS of the data.

Figure 19 - PCA diagram of the data of Figures 17 and 18, withprojection of the original variables. There were only two variables inthe data, thus there are only two PCA axes. Scaling type 1.

Technical parenthesis - According to a very important theorem in statistics called the central limittheorem, when a random variable results from several independent and additive effects, of which none has adominant variance, then this variable tends towards a normal distribution even if the effects are notthemselves normally distributed. This can be applied to ecological variables. It follows that, taken together,the ecological variables tend to follow a multinormal distribution. Now, the first principal axis giving theorientation of the first principal component exposed above goes actually through the greatest dimension ofthe concentration ellipsoid describing the multinormal distribution. In the same way, the following principalaxes (orthogonal to one another, i.e. at right angles to one another, and successively shorter) go through thefollowing dimensions of the p-dimensional ellipsoid. A maximum of p principal axes can be derived from adata table containing p variables.



For practical purposes, it must be known that one can conduct a PCAand display its results in different ways. In its basic form, PCA (1) iscomputed on the raw (centred but otherwise untransformed) variables,and (2) the result respects the Euclidean distance among objects. Onecan act on these two properties, however.

Covariance or correlation? - Covariance or correlation are theassociation measures used to compare all the pairs of variables in PCA.Both are linear measures. One important decision to make is on whichof these association matrices the PCA will be computed. This isbecause of the Euclidean property of the analysis: remember thatEuclidean distance is very sensitive to the scales of the variables.Therefore, conducting a PCA on the raw (actually centred) variables(= PCA on a covariance matrix) is only appropriate when thesevariables are dimensionally homogeneous. Otherwise, it is advised toeliminate the effect of the differences in scale among the variables. Thiscan be done by asking a PCA on a correlation matrix, since correlationis a covariance computed on standardized variables.

Scaling - As mentioned above, both the objects and the variables canbe represented on the same diagram, called a biplot. Two types ofbiplots can be used to represent PCA results:

• PCA Scaling 1 = Distance biplot: the eigenvectors are scaled tounit length; the main properties of the biplot are the following: (1)Distances among objects in the biplot are approximations of theirEuclidean distances in multidimensional space. (2) Projecting anobject at right angle on a descriptor approximates the position of theobject along that descriptor. (3) Since descriptors have length 1 in thefull-dimensional space, the length of the projection of a descriptor inthe reduced space indicates how much it contributes to that space. (4)The angles among descriptor vectors are meaningless.

• PCA Scaling 2 = correlation biplot: the eigenvectors are scaled tothe square root of their eigenvalue. The main properties of the biplot



are the following: (1) Distances among objects in the biplot are notapproximations of their Euclidean distances in multidimensionalspace. (2) Projecting an object at right angle on a descriptorapproximates the position of the object along that descriptor. (3a) PCAon a covariance matrix: descriptors have length sj (= their standarddeviation) in the full-dimensional space; therefore, the length of theprojection of a descriptor in the reduced space is an approximation ofits standard deviation. (3b) PCA on a correlation matrix: all thedescriptors have unit variance (s = 1); the length of the projection of adescriptor in the reduced space reflects its contribution to that space.(4) The angles between descriptors in the biplot reflect theircovariances or correlations.

Thus, if the main interest of the analysis is to interpret the relationshipsamong objects, choose scaling 1. If the main interest focuses on therelationships among descriptors, choose scaling 2.

Equilibrium contribution circle - in three of the four options above(i.e. all the options except covariance PCA + scaling 2), it is possible todraw on a plane made of two principal components a circlerepresenting the equilibrium contribution of the variables. Equilibriumcontribution is the length that a descriptor-vector would have if itcontributed equally to all the dimensions (principal axes) of the PCA.Variables whose vectors contribute little to a given reduced space (say,the 1×2 plane) have vectors that are shorter than the radius of theequilibrium contribution circle. Variable that contribute more havevectors whose lengths exceed the radius of that circle. The circle has aradius equal to √(d/p), where d equals the number of dimensions of thereduced space considered (usually d=2) and p equals the total numberof descriptors (and hence of principal components) in the analysis. In aPCA+scaling 2, the equilibrium contribution must be computedseparately for each descriptor, and is equal to sj√(d/p), where sj is thestandard deviation of the descriptor considered. Figure 20 shows anexample.



YWAR

SOSP

COYE

MAWRVIRA

BLTE

AMBI

MALL

PBGRCOMO

EUST

WIFL

NSHO

SPSA

COGR

WAVI

GCFL

BAOR

GRCA

EAKIHAWO

EWPE

AMRE

HOWR

LEFLWBNUALFL

-1.00 -1.00

-0.50

-0.50

0.00

0.00

0.50

0.50

1.00

1.00

Circle of equilibriumcontribution

Circle of radius 1

DOWO

Figure 20 - PCA on a correlation matrix of Hellinger-transformedspecies data. Scaling type 2. Axes 1 × 2. Circle of equilibriumcontribution (red). Circle of radius 1 (black): maximum length possiblefor a vector in a PCA on a correlation matrix.

Number of axes to interpret - PCA is not a statistical test, but aheuristic procedure: it aims at representing the major features of thedata on a reduced number of axes (hence the often used expression"ordination in reduced space"). Usually, the user examines theeigenvalues, and decides how many axes are worth representing and



displaying on the basis of the amount of variance explained. Thedecision can be completely arbitrary (for instance, interpret the numberof axes necessary to represent 75% of the variance of the data), orhelped by one of several procedures proposed to set a limit betweenthe axes that represent interesting features of the data and axes thatmerely display the remaining, essentially random variance. One ofthese procedures is to compute the average of all eigenvalues andinterpret only the axes whose eigenvalues are larger than that average.Another is to compute a model called the broken stick model, whichrandomly divides a stick of unit length into the same number of piecesas there are PCA axes. The pieces are then put in order of decreasinglength and compared to the eigenvalues. One interprets only the axeswhose eigenvalues are larger than the length of the correspondingpiece of stick.

4.3 Pre-transformation of species data

Principal component analysis is very useful for the ordination ofmatrices of environmental data. On the contrary, since it is a linearmethod working in a Euclidean space, it is not adapted to raw speciesabundance data, since the zero is treated as any other value. However,Legendre & Gallagher (2001)1 have shown how to overcome thisproblem. The trick is to pre-transform the species data in such a waythat, after PCA, the distance respected among objects is no more theEuclidean distance, but an ecologically meaningful one, i.e. a distancethat does not take the double zeros into account in the computation ofresemblances between objects. These transformations can be devisedto obtain any distance measure that contains a Euclidean component.The transformations proposed in that paper are devised to obtainfollowing distance coefficients: chord distance (D3), χ2 metric (D15), χ2

distance (D16), distance between species profiles and Hellinger distance

1 Legendre, P. & E. D. Gallagher. 2001. Ecologically meaningful transformations for ordination of species

data. Oecologia 129: 271-280.



(D17). Table VI gives the transformations to apply to species data sothat a Euclidean distance applied to the sites respects the distanceconsidered.

Table VI - Pre-transformation of species abundance data to respectecologically meaningful distances among sites when using linearanalytical methods like PCA, RDA, K-means clustering, and so on.

Distance to be respected Transformation

Chord distance (D3) yij' =

yij

yij2

j=1

p

∑

χ2 metric (D15) yij' =

yij

yi+ y+ j

χ2 distance (D16) yij' = y++

yij

yi+ y+ j

Distance between species profiles yij' =

yij

yi+

Hellinger distance (D17) yij' =

yij

yi+

where y'ij is the transformed value of the j-th species in the i-th object;yij is the raw abundance of the j-th species in the i-th object; yi+ is thesum of abundances of all species in the i-th object; y+j is the sum ofabundances of the j-th species in all objects; y++ is the grand total, i.e.the sum of all abundances in the raw data table.



4.4 Correspondence analysis (CA)

CA is actually a PCA on species data table that has been transformedinto a Pearson χ2 statistic. The raw data are first transformed intoprofiles of conditional probabilities weighted by the row and columnsums, and the resulting table is submitted to a PCA. The result is anordination where it is the χ2 distance (D16) that is preserved amongsites instead of the Euclidean distance D1. The χ2 distance does notconsider the double zeros. Therefore, CA is a method adapted to theanalysis of species abundance data. Note that the data submitted to aCA must be dimensionally homogeneous and equal to 0 or positive(which is the case of species counts or presence-absence data).

For technical reasons not developed here, CA ordination produces oneaxis less than min[n,p]. As in PCA, the orthogonal axes are ranked indecreasing order of variation represented, but instead of the total SS ofthe data, the variation is measured as a quantity called the total inertia.individual eigenvalues are always smaller than 1. To know the amountof variation represented on an axis, one must divide the eigenvalue ofthis axis by the total inertia of the species data matrix.

In CA, both the objects and the species are generally represented aspoints on the same joint plot. Two scalings are most useful inecology. They are explained here for data matrices where objects arerows and species are columns:

• CA scaling type 1: rows are at the centroids of columns. This scalingis the most appropriate if one is primarily interested in the ordination ofobjects (sites). In the multidimensional space, χ2 distance is preservedamong objects. See Figure 20 below. Interpretation: (1) The distancesamong objects in the reduced space approximate their χ2 distance.Thus, object points that are close to one another are likely to berelatively similar in their species relative frequencies. (2) Any objectfound near the point representing a species is likely to have a highcontribution of that species. For presence-absence data, the object ismore likely to posses the state "1" for that species.



• CA scaling type 2: columns are at the centroids of rows. This scalingis the most appropriate if one is primarily interested in the ordination ofspecies. In the multidimensional space, χ2 distance is preserved amongspecies. Interpretation: (1) The distances among species in the reducedspace approximate their χ2 distance. Thus, species points that are closeto one another are likely to have relatively similar relative frequenciesin the objects. (2) Any species that lies close to the point representingan object is more likely to be found in that object, or to have a higherfrequency there than in objects that are further away in the joint plot.

The following example (Table VII) will be submitted to acorrespondence analysis:

Table VII - Artificial data for CA

Spec.1

Obj.1

Obj.2

Obj.3

Obj.4

Obj.5

Obj.6

1 5

4 4

3 3

1 3

2 2

4 1

2

6

0

5

4

0

Spec.3Spec.2

Since there are 6 objects and 3 species, the number of CA axes will bemin(6,3)–1 = 2.

Using scaling 1, one obtains the following joint plot (Figure 21):



Figure 21 - 1 × 2 plane of the CA of the data shown in Table VII.Scaling type 1.

In this example, the eigenvalues are equal to 0.2295 and 0.0857. Sincethere are only two axes, the total inertia (sum of all eigenvalues) equals0.2295+0.0857=0.3152, which, for each axis, amounts to followingproportions of variation:

0.2295/0.3152 = 72.8% for axis 1

0.0857/0.3152 = 27.2 % for axis 2

If scaling type 2 was used, the biplot would be the following (Fig. 22):



Figure 22 - 1 × 2 plane of the CA of the data shown in Table VII.Scaling type 2.

Words of caution

Correspondence analysis has first been described to analysecontingency tables. Therefore, it tends to overemphasise extremevalues, and, as an ordination method, it is very sensitive to rarespecies, which tend to be located at extreme positions in the ordinationdiagram. Therefore, it may be advisable to eliminate the rarest speciesfrom the data table.

Arch and horseshoe effects - Long environmental gradients oftensupport a succession of species (Figure 23). Since the species that arecontrolled by environmental factors tend to have unimodal



distributions, a long gradient may encompass sites that, at both ends ofthe gradient, have no species in common; thus, their distance reaches amaximum value (or their similarity is 0). But if one looks at either endof the succession, the sites still represent a continuation of theecological succession, so contiguous sites continue to grow moredifferent from each other. Therefore, instead of a linear trend, thegradient is represented on a pair of CA axes as an arch (Figure 24A).Several detrending techniques have been proposed to counter thiseffect, leading to detrended correspondence analysis (DCA):

- detrending by segments: axis I is divided into a number of segments,and, within each one, the mean of the object scores along axis 2 ismade equal to zero. This methods has been strongly rejected by manyauthors. Actually, the scores on the second axis are essentiallymeaningless;

- detrending by polynomials: another line of reasoning about the originof the arch effect leads to the observation that when an arch occurs,the second axis can be seen as quadratically related to the first (i.e. it isa second-order polynomial of the first). This makes up for theparabolic shape of the scatter of points. Hence, a solution is to makethe second axis not only linearly, but also quadratically independentfrom the first. Although intuitively attractive, this method of detrendinghas to be applied with caution because it actually imposes a moreconstraining model on the data.

Note that the arch-like pattern is even stronger in PCA. There theextreme sites tend to be actually closer to one another as the numberof nonoverlapping species increases, because the double zeros impliedare considered in the Euclidean space as a resemblance between thesites. Thus, the extreme sites become closer as the number of doublezeros increases. One can clearly see that this is an ecological nonsense.This pattern is called the horseshoe effect (Figure 24B), because theextremities of the arch bend inwards.



0

5

10

15

20

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97

Figure 23 - Succession of species along an ideal gradient (speciespacking model).

-1.41 -1.34

-0.70

-0.66

0.00

0.02

0.70

0.71

1.41

1.39

-14.63-17.31

-7.31

-10.15

0.00

-2.99

7.32

4.18

14.63

11.34

CA: arch effect PCA: horseshoe effect

A B

Figure 24 - CA and PCA on the data of Figure 23. 1x2 plane, objects,CA and PCA type 1 scalings.



4.5 Principal coordinate analysis (PCoA)

PCA as well as CA (at least in their usual forms) impose the distancepreserved among objects: Euclidean distance for PCA and χ2 distancefor CA (remember, however, than one can modify this to some extentby using pre-transformation of data; see Section 4.3). But if one wouldlike to ordinate objects on the basis of yet another distance measure,more appropriate to the problem at hand, then PCoA is the method toapply. It allows to obtain a Euclidean representation of a set of objectswhose relationships are measured by any similarity or distancecoefficient chosen by the user. For example, if the coefficient is S16,which can combine descriptors of many mathematical types into asingle measure of resemblance, then the ordination will represent therelationships among the objects based upon these many differentvariables. This would not be possible with PCA or CA.

Like PCA and CA, PCoA produces a set of orthogonal axes whoseimportance is measured by eigenvalues. Since it is based on anassociation matrix, it can only represent the relationships amongobjects (if the association matrix was in Q mode) or variables (if theassociation matrix was in R mode), but not both at the same time.

In the case of Euclidean association measures, PCoA will behave in aEuclidean manner. For instance, computing a Euclidean distanceamong sites and running a PCoA will yield the same results as runninga PCA on a covariance matrix and scaling 1 on the same data. But ifthe association coefficient used is nonmetric, semimetric or has otherproblems of "non-Euclideanarity", then PCoA will react by producingseveral negative eigenvalues in addition to the positive ones (an a nullone in-between). The negative eigenvalues can be seen as therepresentation of the non-Euclidean part of the structure of theassociation matrix and it is, of course, not representable on "real"ordination axes. In most cases this does not affect the representation ofthe objects on the several first principal axes, but in severalapplications this can lead to problems. There are technical solutions to



this problem (e.g. Lingoes or Caillez correction), but they are notalways recommendable, and go beyond the scope of this introduction.

The ordination axes of a PCoA can be interpreted like those of a CA:proximity of objects represent similarity in the sense of the associationmeasure used (Figure 25).

Figure 25 - 1 × 2 axis of a PCoA on a D14 matrix of the data of TableVII. The two first axes have eigenvalues of 0.359 and 0.095; theyrepresent 59.5% and 20.2% variance respectively. This PCoA gives 4positive, one zero and one negative eigenvalue.

4.6 Nonmetric multidimensional scaling (NMDS or MDS)

If the user's priority is not to preserve the exact distances amongobjects, but rather to represent as well as possible the orderingrelationships among objects in a small and specified number of axes,then NMDS may be the solution. Like PCoA, NMDS is not limited toEuclidean distance matrices. It can produce ordinations of objects from



any distance matrix. The method can also proceed with missingdistance estimates, as long as there are enough measures left toposition an object with respect to a few others.

NMDS is not an eigenvalue technique, and it does not maximise thevariability associated with individual axes of the ordination. As a result,plots may arbitrarily be rotated, centred, or inverted. The proceduregoes as follows (very schematically; for details see Legendre &Legendre p. 445sq.):

1. Specify the number m of axes (dimensions) desired.

2. Construct an initial configuration of the objects in the m dimensions,to be used as a starting point of an iterative adjustment process. This isa tricky step, since the end-result may depend on the startingconfiguration.

3. An iterative procedure seeks to position the objects in the desirednumber of dimensions in such a way as to minimize a stress function(scaled from 0 to 1), which measures how far the reduced-spaceconfiguration is from being monotonic to the original distances in theassociation matrix.

4. The adjustment goes on until the stress value can no more bediminished, or it attains a predefined value (tolerated lack-of-fit).

5. Most NMDS programs rotate the final solution using PCA for easierinterpretation.

For a given and small number of axes (e.g. 2 or 3), NMDS oftenachieves a less deformed representation of the relationships amongobjects than a PCoA can show on the same number of axes. ButNMDS remains a computer-intensive solution, exposed to the risk ofsuboptimal solutions in the iterative process (because the objectivefunction to minimize has reached a local minimum).



4.7 Canonical ordination: redundancy analysis (RDA) andCanonical correspondence analysis (CCA)

The ordination methods reviewed above are meant to represent thevariation of a data matrix in a reduced number of dimensions.Interpretation of the structures is done a posteriori, hence theexpression indirect gradient analysis used for this approach. Forinstance, one can interpret the CA ordination axes (one at a time), byregressing the object scores on one or several environmental variables.The ordination procedure itself has not been influenced by theseexternal variables, which become involved only after the computation.One lets the data matrix express itself without constraint. This is anexploratory, descriptive approach.

Constrained ordination (RDA and CCA), on the contrary, explicitlyputs into relationship two matrices: one dependent matrix and oneexplanatory matrix. Both are implied at the stage of the ordination.This approach is called direct gradient analysis, and integrates thetechniques of ordination and multiple regression (Table VIII):

Table VIII - Relationship between ordination and regression

Data to explain Explanatory variables Analysis

1 variable 1 variable Simple regression1 variable m variables Multiple regressionp variables - Simple ordinationp variables m variables Canonical ordination

In RDA and CCA, the ordination process is directly influenced by a setof explanatory variables: the ordination seeks the axes that are bestexplained by a linear combination of explanatory variables. Inother words, these methods seek the combinations of explanatoryvariables that best explain the variation of the dependent matrix. It istherefore a constrained ordination process. The difference with an



unconstrained ordination is important: the matrix of explanatoryvariables conditions the "weight" (eigenvalues), the orthogonality andthe direction of the ordination axes. Here one can say that the axesexplain (in the statistical sense) the variation of the dependent matrix.

A constrained ordination produces as many canonical axes as thereare explanatory variables, but each of these axes is a linearcombination (a multiple regression model) of all explanatory variables.Examination of the canonical coefficients (i.e., the regressioncoefficients of the models) of the explanatory variables on each axisallows to know which variable(s) is or are most important to explainthe first, second... axis.

The variation of the data matrix that cannot be explained by theenvironmental variables is expressed on a series of unconstrained axesfollowing the canonical ones.

Due to the fact that in many cases the explanatory variables are notdimensionally homogeneous, usually canonical ordinations are donewith standardized explanatory variables. In RDA, this does notaffect the choice between running the analysis on a covariance or acorrelation matrix, however, since this choice relates to the response(y) variables.

Depending on the algorithm used, the search for the optimal linearcombinations of explanatory variables, that represent the orthogonalcanonical axes, is done sequentially (axis by axis, using an iterativealgorithm) or in one step (direct algorithm). Figure 26, which is Figure11.2 of Legendre & Legendre (1998, p. 581), summarises the steps ofa redundancy analysis (RDA) using the direct algorithm:

- regress the p dependent variables, each one separately, on theexplanatory variables; compute the fitted and residual values of theregressions;

- run a PCA of the matrix of fitted values of these regressions;



- use the matrix of canonical eigenvectors to compute two sorts ofordinations:

- an ordination in the space of the dependent variables (speciesspace); this yields the "sample scores" and the "species scores" ofCanoco; the ordination axes are not orthogonal in this ordination;

- an ordination in the space of the explanatory variables; this yieldsthe fitted site scores, called "Sample scores which are linearcombinations of environmental variables" in Canoco; the canonicalaxes obtained here are orthogonal to one another;

- use the matrix of residuals from the multiple regressions to computean unconstrained ordination (PCA in the case of an RDA).

Redundancy analysis (RDA) is the canonical version of principalcomponent analysis (PCA). Canonical correspondence analysis (CCA)is the canonical version of correspondence analysis (CA).

Due to various technical constraints, the maximum numbers ofcanonical and non-canonical axes differ (Table IX):

Table IX - Maximum number of non-zero eigenvalues andcorresponding eigenvectors that may be obtained from canonicalanalysis of a matrix of response variables Y(n×p) and a matrix ofexplanatory variables X(n×m) using redundancy analysis (RDA) orcanonical correspondence analysis (CCA). This is Table 11.1 fromLegendre & Legendre (1998, p.588).

Canonical eigenvalues Non-canonical eigenvalues

and eigenvectors and eigenvectors

RDA min[p, m, n–1] min[p, n–1]

CCA min[p–1, m, n–1] min[p–1, n–1]



Regress each variable y on table X andcompute the fitted (y) and residual (yres) values^

Data tableY

(centred variables)

Response variables

Data tableX

(centred var.)

Explanatory var.

Fitted valuesfrom the

multiple regressions PCA

Y = X [X'X] X'Y^ –1

U = matrix ofeigenvectors(canonical)

YU =ordination inthe space ofvariables X

^

YU =ordination inthe space ofvariables Y

Residual valuesfrom the

multiple regressions PCA

Yres = Y – Y

Ures =matrix of

eigenvectors(of residuals)

YresUres =ordination inthe space of

residuals



Figure 26 - (preceding page) The steps of redundancy analysis using adirect algorithm. This is Figure 11.2 of Legendre & Legendre (1998).

Graphically, the results of RDA and CCA are presented in the form ofbiplots or triplots, i.e. scattergrams showing the objects, responsevariables (usually species) and explanatory variables on the samediagram. As for the explanatory variables, they can be qualitative (themulticlass ones are coded as a series of binary variables) orquantitative. A qualitative explanatory variable is represented on the bi-or triplot as the centroid of the sites that have the description "1" forthat variable ("Centroids of environmental variables" in Canoco), andthe quantitative ones are represented as vectors (the vector apices aregiven under the name "Biplot scores of environmental variables" inCanoco). The analytical choices are the same as for PCA and CA withrespect to the analysis on a covariance or correlation matrix (RDA)and the scaling types (RDA and CCA). Interpretation for RDA:

• RDA Scaling 1 = Distance biplot: the eigenvectors are scaled tounit length; the main properties of the biplot are the following:

(1) Distances among objects in the biplot are approximations of theirEuclidean distances in multidimensional space.

(2) Projecting an object at right angle on a response variable or aquantitative explanatory variable approximates the position of theobject along that variable.

(3) The angles among response vectors are meaningless.

(4) The angles between response and explanatory variables in thebiplot reflect their correlations.

(5) The relationship between the centroid of a qualitative explanatoryvariable and a response variable (species) is found by projecting thecentroid at right angle on the variable (as for individual objects).

(6) Distances among centroids, and between centroids and individualobjects, approximate Euclidean distances.



• RDA Scaling 2 = correlation biplot: the eigenvectors are scaled tothe square root of their eigenvalue. The main properties of the biplotare the following:

(1) Distances among objects in the biplot are not approximations oftheir Euclidean distances in multidimensional space.

(2) Projecting an object at right angle on a response or an explanatoryvariable approximates the value of the object along that variable.

(3) The angles in the biplot between response and explanatoryvariables, and between response variables themselves or explanatoryvariables themselves, reflect their correlations.

(4) The angles between descriptors in the biplot reflect theircovariances or correlations.

(5) The relationship between the centroid of a qualitative explanatoryvariable and a response variable (species) is found by projecting thecentroid at right angle on the variable (as for individual objects).

(6) Distances among centroids, and between centroids and individualobjects, do not approximate Euclidean distances.

In CCA, on can use the same types of scalings as in CA. Objects andresponse variables are plotted as points on the triplot. For the speciesand objects, the interpretation is the same as in CA. Interpretation ofthe explanatory variables:

• CCA Scaling type 1 (focus on sites): (1) The position of object on aquantitative explanatory variable can be obtained by projecting theobjects at right angle on the variable. (2) An object found near thepoint representing the centroid of a qualitative explanatory variable ismore likely to possess the state "1" for that variable.

• CCA Scaling type 2 (focus on species): (1) The optimum of aspecies along a quantitative environmental variable can be obtained byprojecting the species at right angle on the variable. (2) A species



found near the centroid of a qualitative environmental variable is likelyto be found frequently (or in larger abundances) in the sites possessingthe state "1" for that variable.

Figure 27 provides a fictitious example of a CCA triplot. Figure 28 is areal example of RDA biplot showing the two first axes of a canonicalordination of 143 sites, 63 bird species, 15 quantitative environmentalvariables and 9 classes of qualitative variables. This figure is heremerely to show that a biplot can become rather crowded when thedata set is large. In this case, the 143 sites were not represented on thescatterplot.

Figure 27: ACC triplot showing the objects (black dots), the responsevariables (species, white squares), the quantitative explanatoryvariables (arrows) and the qualitative (binary) explanatory variables(stars). Type 2 scaling: explanations in the text.



Figure 28 - Real example of RDA biplot (RDA on a covariancematrix, scaling 2) showing the two first axes of a canonical ordinationof 143 sites (not represented), 63 bird species (headless or full-headedarrows), 15 quantitative environmental variables (indented arrows) and9 classes of qualitative variables (circles, squares and triangles).

4.8a Partial canonical ordination - Variation partitioning

In the same way as one can do partial regression, it is possible to runpartial canonical ordinations. It is thus possible to run, for instance, aCCA of a species data matrix (Y matrix), explained by a matrix ofclimatic variables (X), controlling for the edaphic variables (W). Suchan analysis would allow the user to assess how much species variationcan be uniquely attributed to climate when the effect of the soil factors



have been removed. This possibility has led Borcard et al. (1992)1 todevise a procedure called variation partitioning in a context of spatialanalysis. One explanatory matrix X contains the environmentalvariables, and the other (W) contains the x-y geographical coordinatesof the sites, augmented by the terms of a third-order polynomialfunction:

b0 + b1x + b2y + b3x2 + b4xy + b5y

2 + b6x3 + b7x

2y + b8xy2 + b9y3

The procedure aims at partitioning the variation of a Y matrix ofspecies data into following fractions (Figure 29):

[a] variation explainable only by matrix X

[b] variation explainable both by matrix X and matrix W

[c] variation explainable only by matrix W

[d] unexplained variation.

If run with RDA, the partitioning is done under a linear model, thetotal SS of the Y matrix is partitioned, and it corresponds strictly towhat is obtained by multiple regression if the Y matrix contains onlyone response variable. If run under CCA, the partitioning is done onthe total inertia of the Y matrix.

More recently, Borcard & Legendre (2002)2, Borcard et al. (2004)3

and Legendre & Borcard (2006)4 have proposed to replace the spatialpolynom by a much more powerful representation of space. Themethod is called PCNM analysis. The acronym stands for PrincipalCoordinates of Neighbour Matrices. See Chapter 6. 1 Borcard, D., P. Legendre. & P. Drapeau. 1992. Partialling out the spatial component of ecological

variation. Ecology 73(3): 1045-1055.2 Borcard, D. & P. Legendre. 2002. All-scale spatial analysis of ecological data by means of principal

coordinates of neighbour matrices. Ecological Modelling 153: 51-68.3 Borcard, D., P. Legendre, Avois-Jacquet, C. & Tuomisto, H. (2004). Dissecting the spatial structures of

ecologial data at all scales. Ecology 85(7): 1826-1832.4 Legendre, P. & D. Borcard. 2006. Quelles sont les échelles spatiales importantes dans un écosystème? In:

J.-J. Droesbeke, M. Lejeune et G. Saporta (éds), Analyse statistique de données spatiales.Editions TECNIP, Paris.



Total variation of Y matrix

[a] [b] [c]

Matrix XMatrix W

[d]

Figure 29 - The fractions of variation obtained by partitioning aresponse data set Y with two explanatory data matrices X and W.

Fractions [a]+[b], [b]+[c], [a] alone and [c] alone can be obtained bycanonical or partial canonical analyses. Fraction [b] does notcorrespond to a fitted fraction of variation an can only be obtained bysubtraction of some of the fractions obtained by ordinations.

The procedure must be run as follows if one is interested in the R2

values of the four fractions:

1. RDA (or CCA) of Y explained by X. This yields fractions [a]+[b].

2. RDA (or CCA) of Y explained by W. This yields fractions [b]+[c].

3. RDA (or CCA) of Y explained both by X and W. This yieldsfractions [a]+[b]+[c].

The R2 values obtained above are unadjusted, i.e. they do not take intoaccount the numbers of explanatory variables used in matrices X andW. In canonical ordination as in regression analysis, R2 alwaysincreases when an explanatory variable xi is added to the model,



regardless of the real meaning of this variable. In the case ofregression, to obtain a better estimate of the population coefficient ofdetermination (ρ2), Zarr (1999, p. 423)5, among others, propose to usean adjusted coefficient of determination:

Radj2 = 1−

(n −1)(n − m −1)

(1− R2 )

As Peres-Neto et al.6 have shown using extensive simulations, thisformula can be applied to the fractions obtained above in the case ofRDA (but not CCA), yielding adjusted fractions: ([a]+[b])adj,([b]+[c])adj and ([a]+[b]+[c])adj. These adjusted fractions can then beused to obtain the individual adjusted fractions:

4. Fraction [a]adj is obtained by subtracting ([b]+[c])adj from([a]+[b]+[c])adj.

5. Fraction [b]adj is obtained by subtracting [a]adj from ([a]+[b])adj.

6. Fraction [c]adj is obtained by subtracting ([a]+[b])adj from([a]+[b]+[c])adj.

7. Fraction [d]adj is obtained by subtracting ([a]+[b]+[c])adj from 1 (i.e.the total variation of Y).

We strongly advocate the use of the adjsted coefficient ofdetermination, together with RDA, for the partitioning of variation ofecological data matrices.

Alternately, if one is interested in the fitted site scores for fractions [a]and [c], the partitioning can be run using partial canonical ordinations.Note, however, that it is not possible to obtain the adjusted R2 on thisbasis:

1. RDA (or CCA) of Y explained by X. This yields fractions [a]+[b]. 5 Zar, J. H. 1999. Biostatistical analysis. Fourth Edition, Prentice Hall, Upper Saddle River, NJ.6 Peres-Neto, P. R., P. Legendre, S. Dray & D. Borcard. In revision. Variation partitioning of species data matrices: estimationand comparison of fractions.



2. Partial RDA (or CCA) of Y explained by X, controlling for W. Thisyields fractions [a].

3. Partial RDA (or CCA) of Y explained by W, controlling for X. Thisyields fractions [c].

4. Fraction [b] is obtained by subtracting [a] from [a]+[b].

5. Fraction [d] is obtained by subtracting [a]+[b]+[c] from 1 (RDA)or the total inertia of Y (CCA).

It must be emphasised here that fraction [b] has nothing to do withthe interaction of a ANOVA! In ANOVA, an interaction measuresthe effect that an explanatory variable (a factor) has on theinfluence of the other explanatory variable(s) on the dependentvariable. An interaction can have a non-zero value when the twoexplanatory variables are orthogonal, which is the situation wherefraction [b] is equal to zero. Fraction [b] arises because there is somecorrelation between matrices X and W. Note that in some casesfraction [b] can even take negative values. This happens, for instance,if matrices X and W have strong opposite effects on matrix Y whilebeing positively correlated to one another.

This variation partitioning procedure can be extended to more thantwo explanatory matrices, and can be applied outside the spatialcontext.



4.8b Partial canonical ordination - Forward selection ofenvironmental variables

There are situations where one wants to reduce the number ofexplanatory variables in a regression or canonical ordination model.Canoco and some functions in the R language allow this with aprocedure of forward selection of explanatory variables. This ishow it works:

1. Compute the independent contribution of all the m explanatoryvariables to the explanation of the variation of the response datatable. This is done by running m separate canonical analyses.

2. Test the significance of the contribution of the best variable.

3. If it is significant, include it into the model as a first explanatoryvariable.

4. Compute (one at a time) the partial contributions (conditionaleffects) of the m–1 remaining explanatory variables, controlling forthe effect of the one already in the model.

5. Test the significance of the best partial contribution among the m–1variables.

6. If it is significant, include this variable into the model.

7. Compute the partial contributions of the m–2 remaining explanatoryvariables, controlling for the effect of the two already in the model.

8. The procedure goes on until no more significant partial contributionis found.

In Canoco 4.5, forward selection can be run either manually (at eachstep, the user asks for the test and decides whether to include avariable or not) or automatically. In the latter case, however, theprogram tests all the variables and includes them all into the model,significant or not. The user has then to ask for the forward selectionsummary (FS summary button), examine the conditional effects and



their probability, and rerun the analysis, retaining only the k firstvariables whose conditional effects are significant at a preestablishedprobability level.

Remarks

a) The tests are run by random permutations. See Sections 5.2 and 5.3.

b) Like all procedures of selection (forward, backward or stepwise),this one does not guarantee that the best model is found. From thesecond step on, the inclusion of variables is conditioned by thenature of the variables that are already in the model.

c) As in all regression models, the presence of strongly intercorrelatedexplanatory variables renders the regression/canonical coefficientsunstable. Forward selection does not necessarily eliminate thisproblem since even strongly correlated variables may be admittedinto a model.

d) Forward selection can help when several candidate explanatoryvariables are strongly correlated, but the choice has no a prioriecological validity. In this case it is often advisable to eliminate oneof the intercorrelated variables on ecological basis rather than onstatistical basis.

e) Forward selection is a rather conservative procedure whencompared to backward elimination (see below): it tends to admit asmaller set of explanatory variables. In absolute terms, however, itis relatively liberal.

f) If one wants to select an even larger subset of variables, anotherchoice is backwards elimination, where one starts with all thevariables included, and remove one by one the variables whosepartial contributions are not significant. The partial contributionsmust also be recomputed at each step. Backward elimination is notoffered by Canoco not Dray's R function, however, and wouldneed to be programmed separately.



g) In cases where several correlated explanatory variables are present,without clear a priori reasons to eliminate one or the other, one canexamine the variance inflation factors (VIF) offered by Canoco.

h) The variance inflation factors (VIF) measure how much thevariance of the canonical coefficients is inflated by the presence ofcorrelations among explanatory variables. This measures in fact theinstability of the regression model. As a rule of thumb, ter Braakrecommends that variables that have a VIF larger than 20 beremoved from the analysis. Beware: always remove the variablesone at a time and recompute the analysis, since the VIF of everyvariable depends on all the others!

4.9 Distance-based redundancy analysis (db-RDA)

For cases where the user does not want to base the comparisonsamong objects on the distances that are preserved in CCA or RDA(including the species pre-transformations), another approach ispossible for canonical ordination: db-RDA (Legendre & Anderson1999)7. Described in the framework of multivariate ANOVA testing,the steps of a db-RDA are as follows:

1. Compute a distance matrix from the raw data using the mostappropriate association coefficient.

2. Compute a PCoA of the matrix obtained in 1. If necessary, correctfor negative eigenvalues (Lingoes or Caillez correction), because theaim here is to conserve all the data variation.

3. Compute an RDA, using the objects × principal coordinates asdependent (Y) matrix and the matrix of explanatory variables as Xmatrix.

7 Legendre, P. & M. J. Anderson. 1999. Distance-based redundancy analysis: testing multi-species

responses in multi-factorial ecological experiments. Ecological Monographs 69 (1): 1-24.



Figure 30 summarises the method:

Raw data(replicates x species)

Distance matrix(Bray-Curtis, etc.)

Principal coordinate analysis(PCoA)

Correction fornegative eigenvalues

(replicates xprincipal coordinates)

Matrix Y Matrix X

(dummyvariablesfor thefactor)

Test of one factorin single-factor model

Redundancy analysis (RDA)F* statistic

Test of F*by permutation

Figure 30 - The steps of a db-RDA. Adapted from Legendre &Anderson (1999).

Note that nowadays, thanks to the transformations proposed byLegendre & Gallagher (2001) for the species data matrices andallowing the direct application of RDA to species data, db-RDA is lessused in this case.



4.10 Orthogonal factors: coding an ANOVA for RDA

As mentioned above, RDA is a linear method. It is the direct extensionof multiple regression to multivariate response variables. On the otherhand, ANOVA can be computed using a multiple regression approachif the factors and interactions are coded in an appropriate manner.Therefore, using the same coding, it is possible to run multivariateANOVA using RDA, with great advantages over traditionalMANOVA: there is no limitation about the number of responsevariables with respect to the number of objects; the ANOVA can betested using permutations, which alleviates the problems of distributionof data (see Chapter 5); the results can be shown and interpreted withhelp of biplots. Furthermore, using the pre-transformations of speciesdata, one can now compute MANOVA on species data. This is ofgreat interest to ecologists, who use experimental approaches moreand more.

The two following pages show how to code two orthogonal factors,without interaction first (when these is only one experimental orobservational unit for each combination of the two factors) and withinteractions (in the case of more than 1, here 2 objects percombination). This coding works for balanced experimental designs.



5. Statistical tests for multivariate data

Ecological data are difficult to handle when it comes to statisticaltesting. All the methods above, used as they are presented, aredescriptive or explanatory, but as yet no statistical test has beenpresented to assess the significance of the relationships or structures.Here we shall present two tests, both in the general framework ofpermutation testing: the test on canonical axes and the Mantel teston distance matrices.

5.1 Parametric tests

Classical, parametric testing has many constraints and generallysupposes that several conditions are fulfilled for the test to be valid.One fundamental assumption is that the observations must beindependent from one another (i.e. the probability of obtaining a givenvalue of the response variable in one observation is independent of thevalues found in other observations). Autocorrelated data violate thisprinciple, their error terms being correlated across observations. Thistopic will be discussed again below in the context of spatial analysis.Another frequent requirement of classical testing is the conformity ofthe distribution of the data to some well-known theoretical distribution,most often the normal distribution.

When the conditions of a given test are fulfilled, an auxiliary variableconstructed on the basis of one or several parameters estimated fromthe data (for instance an F or t-statistic) has a known behaviour underthe null hypothesis. It is thus possible to ascertain whether theobserved value of that statistic is likely or not to occur if H0 is true. Ifthe observed value is as extreme or more extreme than the value ofthe reference statistic for a pre established probability level (usually α= 0.05), then H0 is rejected. If not, H0 is not rejected (Figure 31).



Figure 31 - Decision in statistical testing. S is some test statistic (e.g.Student's t statistic). Adapted from course notes by Pierre Legendre.



5.2 Permutation tests

The parametric procedure is rarely usable with ecological data, mainlybecause these data rarely fulfil the assumptions related to distribution.Furthermore, even data transformations often do not manage tonormalize the data. In these conditions, another, very elegant butcomputationally more intensive approach is available: testing byrandom permutations.

Principle of permutation testing: if no theoretical reference distributionis available, then generate a reference distribution under H0 from thedata themselves. This is achieved by permuting the data randomly in ascheme that ensures H0 to be true, and recomputing the test statistic.Repeat the procedure a large number of times. The observed teststatistic is then compared to the set of test statistics obtained bypermutations. If the observed value is as extreme or more extremethan, say, the 5% most extreme values obtained under permutations,then it is considered too extreme for H0 to be true. H0 is rejected.

An example can be construed based on the Pearson correlationcoefficient between two quantitative variables (Table X):

Table X - Example of data for permutation test: Pearson's r

Var.1 Var.2 Perm.1 Perm.2 Perm.3 ......1 4 4 10 113 5 5 8 82 3 8 4 144 6 10 6 53 8 11 3 36 7 9 7 107 9 3 5 65 10 14 14 98 11 7 9 79 14 6 11 4Pearson r 0.890 r* -0.081 0.288 -0.474 ......



In Table X, the two leftmost columns represent the original,unpermuted data. 0.890 is the value of Pearson's r correlationcoefficient between the two variables. Perm.1, Perm.2 and so on arepermutations of Var.2. The r* are the values of r between Var.1 andthe permuted Perm.* columns. Since these columns have had theirvalues permuted randomly, there is no relationship between Var.1 andPerm.1, Perm.*, and so on. These are thus realisations of H0, the nullhypothesis of the test: there is no linear relationship between the twovariables.

In permutation testing, the observed (true) value must a priori beconsidered as belonging to the reference distribution. Therefore, it iscustomary to ask for 99, 999 or 9999 random permutations. It is theneasy to verify the ranking of the observed value with respect to thepermuted ones, and to transform this into a probability of H0 value(Figure 32):

A B

Figure 32 - Examples of comparison of true test values with referencedistributions generated by random permutations. In A, the true value(arrow) is quite extreme: 5% or less random values are larger than thetrue one. H0 would be rejected at the 5% one-tailed probability level.B: the true value is amidst the random ones. H0 is accepted. Adaptedfrom course notes by Pierre Legendre.



If the test is two-tailed, H0 is rejected at the 0.05 probability levelwhen

Pcomp =per < -obs[ ]+ per = -obs[ ] + per = obs[ ]+ per > obs[ ]

Nb. permutations + 1 ≤ 0,05

If the test is one-tailed (right), H0 is rejected at the 0.05 probabilitylevel when

Pcomp =per = obs[ ]+ per > obs[ ]Nb. permutations + 1

≤ 0,05

Table XI gives some examples of numerical summaries of permutationtests, with the probability of H0 derived from the results, for two- andone-tailed tests.

Table XI - Examples based on Pearson's r:

Two-tailed tests:

[per<-|obs|][per=-|obs|] [-|obs|<per<|obs|] [per=|obs|][per>|obs|] P(H0)

6 1 969 1 23 0.031

21 20 926 1 32 0.074

0 0 999 1 0 0.001

0 0 990 1 9 0.010

0 0 99 1 0 0.01

0 1 98 1 0 0.02



One-tailed test, right tail:


6 1 969 1 23 0.024

21 20 926 1 32 0.033

0 0 999 1 0 0.001

0 1 98 1 0 0.01

One-tailed test, left tail:


6 1 969 1 23 0.007

21 20 926 1 32 0.041

0 1 999 0 0 0.001

0 1 98 1 0 0.01

Words of caution (permutation tests)

Elegant as it may seem, the method of permutations does not solve allthe problems related to statistical testing.

1. Beyond simple cases like the one above, other problems mayrequire different and more complicated permutation schemes than thesimple random scheme applied here. It is, in particular, the case withthe tests of the main factors of an ANOVA coded as proposed inSection 4.10, where the permutations for factor A must be limitedwithin the levels of factor B, and vice versa.

2. Permutation tests do solve several, but not all, distributionalproblems. In particular, they do not solve distributional problemslinked to the hypothesis being tested. For instance, permutationalANOVA does not require normality, but it still does require



homogeneity of variances, because this relates to the Behrens-Fisherproblem linked to comparisons of means: actually two hypotheses aretested simultaneously in ANOVA, i.e. equality of the means andequality of the variances.

3. Contrary to popular belief, permutation tests do not solve theproblem of independence of observations. This problem has still tobe addressed by special solutions, differing from case to case, andoften related to the correction of degrees of freedom.

4. Although many statistics can be tested directly by permutations (e.g.Pearson's r above), it is generally advised to use a pivotal statisticwhenever possible (for the r it would be a Student t statistic). A pivotalstatistic has a distribution under the null hypothesis which remains thesame for any value of the measured effect.

5. Observe that it is not the statistic itself which determines if atest is parametric or not: it is the reference to a theoreticaldistribution (which requires assumptions about the parameters of thestatistical population from which the data have been extracted) or topermutations.

5.3 Tests of an RDA or CCA

5.3.1 Principle

Remember that the eigenvalue of a canonical axis represents theamount of variation of the response data explained by the axis. If onewants to test one single axis at a time the idea of the test is to verifywhether an equal or larger eigenvalue can be obtained under the nullhypothesis of no relationship between the response matrix and theexplanatory matrix. But normally one first tests the significance of theanalysis globally. The basis is then the sum of all canonicaleigenvalues. The hypotheses are thus:



- H0: there is no linear relationship between the response matrix andthe explanatory matrix;

- H1: there is a linear relationship between the response matrix and theexplanatory matrix.

Originally, the test statistic was the eigenvalue or sum of canonicaleigenvalues itself. Now, one uses a pivotal statistic instead, which is a"pseudo-F" statistic which is defined as

F =sum of all canonical eigenvalues / m

RSS/(n − m −1)

where n is the number of objects, m is the number of explanatoryvariables and RSS is the residual sum of squares, i.e. the sum of non-canonical eigenvalues (after fitting the explanatory variables).

Partial canonical analyses and their axes can also be tested forsignificance. The F statistic then takes into account the covariables, i.e.the variables of the W matrix that are controlled for in the analysis.

5.3.2 Permutation procedures

The permutation procedures for these tests are not trivial (seeLegendre & Legendre 1998, p. 607sq. for details). The mainpermutation types are the following:

- without covariables in the analysis

- permutation of raw data; the null hypothesis is that ofexchangeability of the rows of Y with respect to the observations in X.This is implemented by permuting the rows of Y (or, alternatively, therows of X) at random and recomputing the redundancy analysis;

- permutation of residuals; here the residuals of a linear (orother) model are the permutable units. In canonical analysis, the nullhypothesis is that of exchangeability of the residuals of the responsevariables after fitting the explanatory variables. Tests of significance



using permutation of residuals have only asymptotically exactsignificance levels (i.e. as n becomes large).

- with covariables in the analysis: two methods of permutation ofresiduals are used to test the significance of the sum of all canonicaleigenvalues:

- permutation of residuals under a reduced (or null) model;

- permutation of residuals under a full model.

These methods of permutation of residuals are especiallyrecommended when the matrix W of covariables contains outliers.

5.4 Mantel test: matrix correlation

Described in 1967 by the epidemiologist Nathan Mantel1, the test ofmatrix correlation that bears his name has been increasingly used byecologists in the eighties. Presently, however, with the advent of thepowerful canonical ordination techniques, the use of the Mantel testshould be restricted to cases where the hypotheses and datathemselves are naturally stated in terms of distances or similaritiesrather than in terms of raw data.

5.4.1 Principle of the test

The Mantel test deals which the linear correlation between similarityor distance matrices. For example, one could use it to compare amatrix Y of Bray-Curtis distances among sites based on speciesabundances and a matrix X of Euclidean distances among the samesites, built on the basis of standardized physico-chemicalmeasurements. The test will tell if the species-based distances aresignificantly, linearly correlated with the environment-based distances,In other words, it will answer a question of the type: 1 Mantel, N. 1967. The detection of disease clustering and a generalized regression approach. Cancer Res.

27: 209-220.



"Do pairs of sites that are similar in terms of species composition alsotend to be similar in terms of environmental variables?"

If it is the case, then one will conclude (but with caution, since theinterpretation has to be made in the "world of distances" and not in the"world of raw data") that the living community reflects its environmentas measured by the explanatory variables involved in matrix X.

Formally, the hypotheses of the Mantel test can be stated as follows:

H0: the distances (or similarities) among objects in matrix Y are not(linearly) correlated with the corresponding distances in matrix X.When X contains geographic distances, H0 reads: the variable (ormultivariate data) in Y is not structured as a (linear) gradient.

H1: the distances among objects in matrix Y are linearly correlated tothe distances in X.

The original Mantel z statistic, i.e. the measure used to evaluate theresemblance between the two matrices, is:

zM = xij yijj =i+1

n

∑i=1

n−1

∑

where i and j are row and column indices of the resemblance matrices.

However, nowadays the mantel test is generally computed using thestandardized Mantel r statistic, whose formula is the same as that ofPearson's r correlation coefficient:

rM =1

d −1

xij − x

sx

yij − y

sy

j =i+1

n

∑i=1

n−1

∑

where i and j are as above, x-bar, y-bar, sx and sy are the means andstandard deviations of the distance values of each matrix,

and d = n(n–1)/2 is the number of distance or similarity measures inone of the upper triangular matrices.



5.4.2. Example:

Let us imagine two similarity matrices between 4 objects:

1

2

3

2 3 4

0.25 0.43 0.55

0.17 0.39

0.66

0.43 0.41 0.47

0.22 0.60

0.71

2 3 4

“Species” matrix “Environnement” matrix

Figure 33 - Two fictitious similarity matrices between 4 objects.

Mantel's z statistic is computed as follows:

z = (0.25 × 0.43) + (0.43 × 0.41) + (0.55 × 0.47) + (0.17 × 0.22) +

+ (0.39 × 0.60) + (0.66 × 0.71) = 1.2823

This value (1.2823) is the "true" (observed) value, that must then becompared to a reference distribution obtained by randomly permuting(99 or 999 or 9999 times) the rows and corresponding columns of oneof the two similarity matrices. Beware: the values of the similaritymatrices cannot be permuted completely at random. The permutationscheme is actually equivalent to permuting the raw data andrecomputing the similarities.

Finally, the observed z value is compared to the reference distributionin the same way as in the Pearson correlation example of Section 5.2,using the one-tailed hypothesis count. Indeed, if one has used twosimilarity matrices or two distances matrices (not a similarity and adistance matrix!), then the only meaningful alternative hypothesis in



ecology is that the distances or similarities are positively correlated. Anegative correlation between distance measures would mean that, forinstance, the sites would be more similar as perceived by the livingcommunity when they are less similar with respect to theenvironmental variables. This illustrates the specificities of theinterpretation of a Mantel test, which must be based on a reasoning onassociation measures and not on raw data.

Additional remarks:

- like Pearson's correlation coefficient, the Mantel test also has a partialform, where the matrix correlation rM(AB.C) between two matrices Aand B is tested while controlling for the effect of matrix C. rM(AB.C)is computed in the same way as a partial Pearson correlationcoefficient;

- the Mantel test can be used to detect linear geographical gradients.The Y matrix is as usual (e.g. Bray-Curtis distance on species data).The X matrix contains Euclidean distances computed from thegeographical coordinates of the sites. Note, however, that much morepowerful techniques are available to detect spatial structures (seeChapter 6);

- one can use the Mantel test as a goodness-of-fit test to compare asimilarity matrix obtained from a given data set to a model derivedindependently. For instance, a model tells that vegetation relevés takenfrom either crystalline or calcareous soils must be more similar to oneanother within one substrate than pairs of relevés coming fromdifferent substrates. The model (Figure 34) would contain 1s assimilarities among pairs of relevés whose members come from similarsubstrates, and 0s for pairs whose members each comes from adifferent substrate (see below). This matrix is compared to a matrix ofsimilarities based on actual vegetation data.



1

2

3

2 3 4

0.55 0.63 0.15

0.77 0.31

0.26

4

5

0.26

0.46

0.37

0.78

6

0.28

0.09

0.52

0.86

5 0.62

“Species” matrix

1 1 0

1 0

0

2 3 4

0

0

0

5

1

0

0

0

6

1

1

“Model” matrix

Figure 34 - A matrix of similarity based on species abundances and amodel matrix of binary values representing similar (1) and different (0)types of soil.

Words of caution (Mantel test)

The paragraph below is an exerpt from a manuscript by PierreLegendre2. It warns users against misuses of the Mantel test.

"Empiricists who frown upon theoretical justifications should beinterested in the fact that the R2

M of a Mantel test or a regression ondistance matrices is always much lower than the R2 of a (multiple)regression or canonical analysis computed on the raw data, when it ispossible to do so; this has often been noted by users of the Mantel test.This was one of the results reported by Dutilleul et al. (2000, Table2)3; it can easily be verified using any data set. Legendre (2000, Table

2 Legendre, P. (in prep.) Mantel and partial Mantel tests: practical aspects.3 Dutilleul, P., J. D. Stockwell, D. Frigon, and P. Legendre. 2000. The Mantel-Pearson paradox: statistical

considerations and ecological implications. Journal of Agricultural, Biological, andEnvironmental Statistics 5: 131-150.



II)4 has also shown that the power of a Pearson correlation (i.e., itscapacity to reject the null hypothesis when H0 is false) is much higherthan the power of a simple Mantel test computed on distance matricesderived from the same data (...). Hence, whenever possible, usestatistical procedures based on tables of raw data, such as correlation,regression, or canonical analysis. Save the Mantel test and derivedforms to test hypotheses formulated in terms of distances."

Another paper has been published recently by Legendre et al. (2005)5,comparing the performances of tests based on raw data and Manteltests computed on distance matrices derived from the same data. Thetheoretical developments and simulation results presented in this paperled to the following observations:

(1) The variance of a community composition table is a measure ofbeta diversity.

(2) The variance of a dissimilarity matrix among sites is not thevariance of the community composition table nor a measure of betadiversity; hence, partitioning on distance matrices should not be usedto study the variation in community composition among sites.

(3) In all of the simulations, partitioning on distance matricesunderestimated the amount of variation in community compositionexplained by the raw-data approach.

(4) The tests of significance in the distance approach had less powerthan the tests of canonical ordination. Hence, the proper statisticalprocedure for partitioning the spatial variation of communitycomposition data among environmental and spatial components, andfor testing hypotheses about the origin and maintenance of variation incommunity composition among sites, is canonical partitioning. TheMantel approach is appropriate for testing other hypotheses, such asthe variation in beta diversity among groups of sites. 4 Legendre, P. 2000. Comparison of permutation methods for the partial correlation and partial Mantel tests.

Journal of Statistical Computation and Simulation 67: 37-73.5 Legendre, P., D. Borcard and P. R. Peres-Neto. 2005. Analyzing beta diversity: partitioning the spatial

variation of community composition data. Ecological Monographs 75: 435-450.



6. Spatial analysis of multivariate ecological data6.1 Introduction

6.1.1 Conceptual importance

Ecological models have long assumed, for simplicity, that biologicalorganisms and their controlling variables are distributed in nature in arandom or uniform way. This assumption is actually quite remote fromreality: field biologists know from experience that neither the livingbeings nor the variables they use to describe the environment aredistributed uniformly or at random. The environment can beconsidered as primarily structured by broad-scale physical processes(geomorphology on land, currents and winds in fluids), that generategradients and/or patchy structures separated by discontinuities(interfaces). These structures induce similar responses in biologicalsystems. Furthermore, even in zones that appear homogeneous at agiven spatial scale, finer-scale contagious abiotic or biotic processestake place, generating more spatial structuring through reproductionand death, predator-prey interactions, food availability, parasitism, andso on.

Thus one can see that spatial heterogeneity is functional inecosystems, and not the result of some random, noise-generatingprocess. Therefore, it is important to study it for its own sake.Ecosystems without spatial structuring would be unlikely to function.Imagine the consequences: large-scale homogeneity would cut downon diversity of habitats, feeders would not be found close to their food,mates would be located at random throughout the landscape,newborns would be spread around instead of remaining in favorableenvironments... Irrealistic as it may seem, this view is still present inseveral of our theories and models describing population andcommunity functioning.

Spatial organization of ecosystems has thus to be incorporated intheories, otherwise these will be suboptimal. In general terms, more



and more theories admit that the elements of an ecosystem that areclose to one another in space or time are more likely to be influencedby the same generating processes. Such is the case, for instance, forthe theories of competition, succession, evolution and adaptation(historic autocorrelation), maintenance of species diversity, parasitism,population genetics, population growth, predator-prey interactions, andsocial behaviour.

6.1.2 Importance in sampling strategy

The very fact that every living community is spatially structured has itsconsequences on the sampling strategies. One should be aware thatthe sampling strategy strongly influences the perception of the spatialstructure of the sampled population or community. For instance, in asite where the variables to be sampled are structured in more or lessregular patches, systematic sampling could lead to completely alteredestimations of the spatial structure if the intersample distance is largerthan one half of the inter-patch distance (Figure 35, see below).

This example may seem trivial especially to botanists....but botanistshave a major advantage on zoologists in that they see what theysample. This is not the case in, say, ecology of aquatic or soilorganisms. When one samples a soil community following asystematic pattern, for example, it is quite possible that a part of thesampled species distributions will be estimated correctly, whereassome other will be totally misinterpreted. Thus, when one aims tostudy the spatial distribution of organisms, a random-based samplingstrategy seems preferable, in that it allows very different and unrelatedinter-sample distances to be sampled. Even when mapping is planned,it is always possible to estimate a regular grid of values on the basis ofa random set of measurements.



Figure 35 - The danger of systematic sampling

6.1.3 Importance in statistics

When the variables to be sampled are spatially structured, one of themost fundamental assumptions of the classical statistics, that is, theassumption of independance of the observations, is violated. In asituation where no spatial structure is present, the resemblancepatterns between all pairs of sample units are independant of thegeographical distances between the sample units. In other words, onecannot predict the value of a variable (or the species composition of acommunity sample unit) on the basis of a few other, neighbouringunits. On the contrary, when there is a structure and one knows its



general shape (gradient, patches...), one can predict at least roughly thecontent of a sample unit on the basis of the other ones. Such a sampleset (or its variables) is said to be spatially autocorrelated (Figure 36).

Figure 36 - Three types of spatial structures.



Due to the violation of the assumption of independance of theobservations, it is not possible to perform standard statistical tests ofhypotheses on spatially autocorrelated data. In most of the naturalcases, the data are positively autocorrelated at short distances, whichmeans that any two close sample units resemble more each other thanpredicted by a random (uncorrelated) structure. In such cases, forinstance, the classical statistical procedures estimate a confidenceinterval around a Pearson correlation coefficient narrower than it is inreality, so that one declares too often that the coefficient is differentfrom zero (Figure 37):

Figure 37 - Underestimation of the confidence interval around aPearson r correlation coefficient in the presence of autocorrelation.

One could understand this from the point of view of the degrees offreedom: normally one counts one degree of freedom for eachindependant observation, and this allows one to choose the appropriatestatistical distribution for the given test. Now, if the observations arenot independant but rather autocorrelated, each new observation doesnot bring a full degree of freedom, but only a fraction. It follows thatthe total actual number of degrees of freedom is smaller than estimatedby the classical procedures. Thus the consequence: for a given totalvariance, the smaller the number of d.f., the broader the actualconfidence interval.



In limited cases there is now possible to correct for autocorrelationwhen estimating the number of d.f. But this is by no means an easytask.

6.1.4 Importance in data interpretation and modelling

As said previously, the spatial structuration of the living communitiesand their controlling environmental factors is functional. Thus, it isimportant to include this structuration into the theories, into the dataanalyses, and into the models.

Spatial structure in the data has many consequences on theinterpretation of analytical results. For instance, spatial structurationthat is shared by response and explanatory variables can inducespurious correlations, leading uncorrect causal models to be accepted.Partial analyses may allow to avoid this pitfall. On the other hand,proper handling of spatial descriptors allows to explain the datavariation in a more detailed way, by disciminating betweenenvironmental, spatial, and mixed relationships.

So, by taking the spatial structure into account when analyzingmultivariate data sets, it is often possible to elucidate more ecologicalrelationships, avoid misinterpretations, and explain more data variation.This insight allows one to build more realistic models.

Last but not least, spatial structures can be mapped. Mapping is notonly a nice tool (or toy!) for illustration, but also a powerful way ofexploring data structure and generating new hypotheses, especiallywhen the data contain a significant amount of spatial structure that isnot related with the measured environmental variables.

This chapter addresses the topics of some measures of autocorrelation,the components of spatial structure, and some techniques of spatialmodeling.



6. 2 The measure of spatial autocorrelation

The introduction above makes clear that there are multiple reasons totest in a unambiguous way if the data are autocorrelated. One can runsuch tests either to verify that there is no spatial structure, and useparametric tests afterwards, or, on the contrary, to confirm thepresence of a spatial structure to be able to study it in more detail.

Such tests are built on coefficients of spatial autocorrelation, thathave the double advantage to test for spatial structures and provide asimple description of it. Here we will first expose the tests forunivariate data (Moran's I and Geary's c), and then the Mantelcorrelogram for multivariate data.

6.2.1 One singe variable: intuitive introduction

6.2.1.1 One spatial dimension

Imagine that you have collected a series of measures of bulk density ofa soil along a transect. You can construct a graph associating theposition of the sites and the measures of density (Figure 38):

Figure 38 - Substratum density along a transect (fictitious data)



In these data, one can clearly see a periodic structure. In such a case,one can predict at least approximately the value at one site on the basisof the values nearby. The series is thus autocorrelated. An easy wayof studying such a series is to correlate it with itself (auto-correlatingit!) several times, introducing a shift (Table XII):

Table XII - Correlation of a data series with itself: auto-correlation. Inthis example, for demonstration purposes only, Pearson's r correlationcoefficient is used for simplicity.

Lag 0 : 8 9 7 5 3 1 2 1 5 6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9 9 8 5 3 1 2 1 4 6 Pearson's r between the 2 series = 1

Lag 1 :(8)9 7 5 3 1 2 1 5 6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9 9 8 5 3 1 2 1 4(6) r = 0.75

Lag 2 :(8 9)7 5 3 1 2 1 5 6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9 9 8 5 3 1 2 1(4 6) r = 0.33

Lag 3 :(8 9 7)5 3 1 2 1 5 6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9 9 8 5 3 1 2(1 4 6) r = – 0.25

Lag 4 :(8 9 7 5)3 1 2 1 5 6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9 9 8 5 3 1(2 1 4 6) r = – 0.73

Lag 5 :(8 9 7 5 3)1 2 1 5 6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9 9 8 5 3(1 2 1 4 6) r = – 0.95

Lag 6 :(8 9 7 5 3 1)2 1 5 6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9 9 8 5(3 1 2 1 4 6) r = – 0.81

Lag 7 :(8 9 7 5 3 1 2)1 5 6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9 9 8(5 3 1 2 1 4 6) r = – 0.36

Lag 8 :(8 9 7 5 3 1 2 1)5 6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9 9(8 5 3 1...) r = 0.25

Lag 9 :(8 9 7 5 3 1 2 1 5)6 9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6 9(9 8 5 3...) r = 0.69

Lag 10:(8 9 7 5 3 1 2 1 5 6)9 9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5 6(9 9 8 5...) r = 0.99

Lag 11:(8 9 7 5 3 1 2 1 5 6 9)9 8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1 5(6 9 9 8...) r = 0.82



Lag 12:(8 9 7 5 3 1 2 1 5 6 9 9)8 5 3 1 2 1 4 6 8 9 7 5 3 1 2 1(5 6 9 9...) r = 0.38

Lag 13:(8 9 7 5 3 1 2 1 5 6 9 9 8)5 3 1 2 1 4 6 8 9 7 5 3 1 2(1 5 6 9...) r = – 0.19

Lag 14:(8 9 7 5 3 1 2 1 5 6 9 9 8 5)3 1 2 1 4 6 8 9 7 5 3 1(2 1 5 6...) r = – 0.79

Lag 15:(8 9 7 5 3 1 2 1 5 6 9 9 8 5 3)1 2 1 4 6 8 9 7 5 3(1 2 1 5...) r = – 0.89

... and we stop at lag 15, because there are not enough pairs of valuesleft.

The result of such an analysis is generally presented in the form of acorrelogram, where the abscissa represents the lag, and the ordinateare the correlation values (Figure 39):

Pas (décalage de la série p.r. à elle-même)

Cor

réla

tion

-1-0.8-0.6-0.4-0.2

00.20.40.60.8

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 39 - Correlogram of the fictitious data of Fig. 38. Beware: thelag ("pas") does not represent spatial coordinates, but reciprocaldistances among sites! For instance, at lag 5, the r = –0.95 means thatany pair of sites whose members are 5 units apart (for instance sites 3and 8, or 10 and 15...) probably have very different values of density:one of the sites has a high value, the other a low one.



6.2.1.2 Two spatial dimensions (surface)

The next step of our reasoning is to extend it to a surface, in the caseof an isotropic process (the spatial pattern is more or less identical inall directions). In this case, one can no more organize the data in alinear series. Rather, one computes a matrix of Euclidean(geographical) distances among all pairs of sites, and then the distancesare grouped in classes. For instance, all distances shorter than 1 meterbelong to class 1, the distances between 1 and 2 meters are put inclass 2, and so on. Table XIII gives an example:

Table XIII - Construction of a matrix of classes of distance

2 3 4 5 6 2 3 4 5 6

1 0.23 2.82 1.65 0.89 1.23 1 3 2 1 2

2 1.45 2.44 0.32 1.87 2 3 1 2

3 3.56 0.09 2.11 -> 4 1 3

4 2.70 1.15 3 2

5 1.34 2

Matrix of Euclidean distances Matrix of distance classes

On this basis (but with more than 6 sites!) one could computeautocorrelation indices for the 4 classes of geographical distances. Forinstance, for class 1 the value would be computed on the basis of pairs1-2, 1-5, 2-5 and 3-5. And so on. For n objects, one always has n(n–1)/2 distances, that should be grouped into an appropriate number ofclasses: neither too high (too few values in each class) nor too low (toavoid the analysis to be too coarse). Sturge's rule is often used todecide how many classes are appropriate:

Nb of classes = 1 + 3.3 log10(m) (rounded to the nearest integer)

where m is, in this case, the number of distances in the uppertriangular matrix of distances (excluding the diagonal).



6.2.2 Indices of spatial autocorrelation: Moran's I and Geary's c

In practice, the two mostly used indices are Moran's I, which behavesapproximately like a Pearson correlation coefficient, and Geary's c,which is a sort of distance.

Moran's I is computed as follows (for distance class d):

I(d) =

1W

whi yh − y ( ) yi − y ( )i=1

n∑

h=1

n∑

1n

yi − y ( )2

i=1

n∑

for h ≠ i

and Geary's c:

c(d) =

12W

whi yh − yi( )2

i=1

n∑

h=1

n∑

1n −1( )

yi − y ( )2

i=1

n∑

for h ≠ i

yh and yi are the values of the variable in objects h and i; n is thenumber of objects; the weights w are equal to 1 when pair (h,i)belongs to distance class d (the one for which the index is computed)and 0 otherwise. W is th sum of all whi, i.e. the number of pairs ofobjects belonging to distance class d.

Moran's I generally varies between –1 and +1, although valuesbeyond these limits are not impossible. Positive autocorrelation yieldspositive values of I, and negative autocorrelation produces negativevalues. Moran's I expected value in the absence of autocorrelation isequal to E(I) = – (n – 1)-1 i.e., close to 0 when n is large.

Geary's c varies from 0 to some unspecified value larger than 1.Positive autocorrelation translates into values from 0 to 1, whilenegative autocorrelation yields values larger than 1. Geary's cexpectation in the absence of autocorrelation is equal to E(c) = 1.



Example

The 8 x 8 grid below (Table XIV and Figure 40) is a fictitious data setwhere the value of a variable z has been measured on a regularsampling design. One can see a gradient in the data.

Table XIV - Fictitious, spatially referenced sample set. 64 sites.

11 10 9 7 7 6 4 2

9 11 7 6 5 3 2 2

10 9 6 10 8 4 5 3

8 9 7 5 4 3 3 2

7 6 5 6 4 4 3 2

5 5 5 4 5 3 1 3

5 4 3 2 3 3 2 2

3 4 2 2 1 3 1 1

Figure 40 - Grid map of the data given in Table IX.

Each site is characterized by its x - y spatial coordinates and the valueof the measured variable, z. In this example, let us state that thehorizontal and vertical intersite distance is equal to 1 m. Thecomputation of a correlogram involves 4 steps:



1. Computation of a matrix of Euclidean distances among sites.

2. Transformation of these distances into classes of distances.

3. Computation of Moran's I or Geary's c for all distance classes.

4. Drawing of the correlograms.

The hypotheses of the tests run on each distance class can be wordedas follows:

H0: there is no spatial autocorrelation. The values of variable z arespatially independent. Each Moran's I or Geary's c value is closeto its expectation.

H1: there is significant spatial autocorrelation in the data. At least oneautocorrelation value is significant at the Bonferroni-correctedlevel of significance (see below).

The result below (with 6 equidistant classes) has been obtained usingthe R package for multivariate data analysis by Pierre Legendre, AlainVaudor and Philippe Casgrain. This package, not to be confused withthe R language, is available for Macintosh from Pierre Legendre's website:

http://www.bio.umontreal.ca/legendre/

Classes équidistantes

Classe Limite sup. Fréq. 1 1.64992 210 2 3.29983 556 3 4.94975 442 4 6.59967 560 5 8.24958 218 6 9.89950 30



FICHIER DE DONNEES grad.8x8.r Option du mouvement: Matrice SIMIL Note: les probabilités sont plus significatives près de zéro les probabilités sont à plus ou moins; 0.00100

H0: I = 0 I = 0 C = 1 C = 1 H1: I > 0 I < 0 C < 1 C > 1 Dist.,I(Moran), p(H0), p(H0), C(Geary), p(H0), p(H0), Card.* 1 0.6723 0.000 0.2366 0.000 420 2 0.3995 0.000 0.4601 0.000 1112 3 0.0225 0.177 0.8660 0.008 884 4 -0.3678 0.000 1.4245 0.000 1120 5 -0.7674 0.000 2.0580 0.000 436 6 -1.0667 0.000 2.7131 0.000 60 Total 4032* Card. means "cardinality", i.e. the number of pairs of observations in each distance class, in a squaredistance matrix, diagonal excluded.

The following correlograms can be drawn from the results above:

Figure 41 - Moran's I correlogram, data from Table XIV.



Figure 42 - Geary's c correlogram, data from Table XIV.

In these graphs, the black squares are significant values (at anuncorrected α level of 0.05, but see below!) and white squares nonsignificant ones. The output file gives the exact probabilities.

The opposite aspect of the curves illustrates well the fact that Moran'sI behaves like a correlation (positive values mean positiveautocorrelation) while Geary's c behaves like a distance (lowest valuefor highest positive autocorrelation).

6.2.3 Correction for multiple testing

One problem remains, related to the α probability level of rejection ofthe null hypothesis. Such a threshold is defined for one test. In ourcase, since we computed six autocorrelation values (one for eachdistance class), six tests have been performed simultaneously. In such



a case, the type I error (rejecting H0 when it is true) is no more 0.05 (ifthis was the selected level). For one test, a threshold of 5% means that,among 100 tests run on completely random variables, 5 will (wrongly)be declared significant and 95 will be declared non-significant.Cumulating 6 tests at the 5% level means that the chances of rejectingH0 at least once is equal to 1 – 0.956 = 0.265 instead of 0.05 ! Themost drastic (conservative) remedy to this situation is called theBonferroni correction, which consists in dividing the globalthreshold level by the number of simultaneous tests. In our case,for each correlogram, this means that, globally, the correlogram willbe declared significant if at least one autocorrelation value is significantat the corrected α level of 0.05/6 = 0.0083. One can verify that, withthis correction, the global chances of accepting H0 when it is true isequal to (1 – 0.0083)6 ≈ 0.95.

This correction is very conservative, however, and can lead to type IIerror (accepting H0 when it is false) when the tests involved in themultiple comparison are not independent, i.e. when they address aseries of related questions and data. Several alternatives have beenproposed in the literature. An interesting one is the Holm correction.It consists in (1) running all the k tests, ( 2) ordering the probabilities inincreasing order, and (3) correcting each α significance level bydividing it by (1 + the number of remaining tests) in the ordered series.In this way, the first level is divided by k, the second one by k – 1, andso on. The procedure stops when a non-significant value isencountered.

6.2.4 Multidimensional data: the Mantel correlogram

If one wants to explore the autocorrelation structure of a multivariatedata set (for instance a matrix of species abundances), on can resort tothe Mantel correlogram. This technique is based on the Mantelstatistic. Each class of the matrix of distance classes is represented by abinary (0 - 1) matrix where a pair of objects belonging to that class



receive the value 1 and the others 0. The tests are permutational (as inthe case of the Mantel test). If the standardized Mantel statistic is used,the values are ranged between -1 and +1 and the interpretation of theMantel correlogram is similar to that of the Moran correlogram,bearing in mind that the expectation of the Mantel statistic under H0 isstrictly 0.

6.2.5 Remarks

1. Many other structure functions are available to describe and modelspatial structures. A very important one is the semi-variogram, whichuses a measure of variance among values of sites belonging to variousdistance classes to build a model that can be used either to describe thespatial structure (as in the case of the correlogram) or to model it,especially for mapping and prediction purposes (as in the case ofkriging).

2. Unless otherwise specified, these methods consider that the spatialstructures are isotropic (same in all directions). But anisotropy can beaddressed in several ways, for instance by modifying the matrix ofdistance classes.

3. When statistical tests are run to identify spatial structures, theyrequire that the condition of second-order stationarity be satisfied.This condition states that the expected value (mean) and spatialcovariance (numerator of Moran's I) of the variable are the same allover the study area, and the variance (denominator) is finite. A relaxedform of stationarity hypothesis, the intrinsic assumption, states thatthe differences (yh – yi) for any distance must have zero mean andconstant and finite variance over the study area.



6.3 Modeling spatial structures

6.3.1 Introduction: the 3 components of spatial structure

For a good understanding of the nature of spatial variation, it is usefulto decompose it into three independent components (Figure 43):

1. A major structural component: the overall mean of the variable(s)across the whole sampling area. This mean may vary in a continuousway on one or several axes across the area. In this case it is said toshow a trend. The presence of a trend in ecological data is generallyinterpreted as the action of a factor at a scale larger than the studyarea.

2. A variation component that is spatially autocorrelated, but at a finerscale than the trend, called a regional scale. This variation can oftenbe interpreted as the result of either biotic processes or the action ofenvironmental forcing on the studied variables.

3. A random, uncorrelated variation, arising from observation oranalytical error, or from variation that may be structured (correlated) ata scale too fine to be resolved by the sampling design.

Figure 43 - The three components of spatial variation



A spatial structure may appear in a variable y because the process thathas produced the values of y is spatial and has generatedautocorrelation in the data; or it may be caused by dependence of yupon one or several causal variables x which are spatially structured;or both.

The following sections present two techniques to model spatialstructures in the univariate or multivatiate context. The first technique,trend surface analysis, is a crude method mainly adequate to modelsimple gradients, or remove them from the data (when detrending isnecessary). The second method, PCNM analysis, has been developedrecently to model spatial structures at all scales resolved by a givensampling design.

6.3.2 Trend surface analysis

This technique is a particular case of multiple regression, where theexplanatory variables are geographical (x-y) coordinates, sometimescompleted by higher order polynomials. When applying this method,one generally supposes that the spatial structure of the observedvariable is a result of one or two generating processes that spread overthe whole studied area, and that the resulting broad-scale structure ofthe dependent variable can be modelled by means of a polynomial ofthe spatial coordinates of the samples. A simple example follows:

Imagine a soil arthropod, the density of which (let us call it z) increasesfrom 0 (near a stream) to 100 individuals per square meter (in anearby meadow). If this density variation is linear, a simple linearregression, with the distance to the stream (x) acting as explanatoryvariable, is enough to model the arthropod density in the wholemeadow (Figure 44):

ˆ z = b0 + b1x



xx

z

z =b 0 + b 1 x

b0

Figure 44 - Density of an arthropod species along a gradient andlinear model.

Now, if the stream (with its neighbouring meadows) extends fromhigher mountains to sea level, perhaps the arthropod density variesalso with the altitude (y). A second explanatory variable is necessary,i.e. the altitude, or possibly the distance to the source along the stream.If the density variation with respect to the altitude is also linear, onegets a first order multiple regression equation of the form:

ˆ z = b0 + b1x + b2y

The result is thus a regression plane fitted through the z data(densities) by means of the x-y coordinates of the arthropod samplingpoints (Figure 45).

x

y

z =b0 + b1 x +b 2 y

z

y

x

Figure 45 - Density of an arthropod species along a double gradientand linear model.



If a plane does not explain enough variation, one can try to fit higherorder polynomials, by adding second, third ...order x-y terms and theirproducts. The following equation is a cubic trend surface equation:

ˆ z = b0 + b1x + b2y + b3x2 + b4xy + b5y2 + b6x3 + b7x 2y + b8xy2 + b9y3

It is easy to visualize the outcome of the addition of one order to atrend surface model by remembering that each addition of an orderallows one more fold to the surface (Figure 46):

Figure 46 - Example of trend surface analysis, equations of order 1, 2,3 and 5.



Trend surface analysis can model relatively simple structures with areasonable amount of “hills” and “ holes” resulting of one or twolong-range trends (hence the name) across the sampling area. But thismethod, although easy to compute, suffers from several conceptualand practical problems, and should be used with great care. Here aresome of these problems:

Conceptual problem:

- fitting a trend surface is useful only when the trend has an underlyingphysical or biological explanation, or if it can help generating biologicalhypotheses; interpretation of individual terms is often difficult;

Practical problems:

- when data points are few, extreme values can seriously distort thesurface;

- the surfaces are extremely susceptible to edge effects. Higher-orderpolynomials can turn abruptly near area edges, leading to unrealisticvalues;

- trend surfaces are inexact interpolators. Because they are long-rangemodels, extreme values of distant data points can exert an unduly largeinfluence, resulting in poor local estimates of the studied variable.

Detrending

Despite its problems, trend surface analysis is very useful in onespecific case. It has been said in Section 6.2.5 that, for testing, thecondition of second-order stationarity or, at least, the intrinsicassumption must be satisfied. Removing a trend from the data at leastmakes the mean constant over the sampling area (although it does notaddress any problem of heterogeneity of variance). Furthermore, mostmethods of spatial analysis are devised to model the intermediate-scalecomponent of spatial variation and are therefore much more powerfulon detrended data. For these reasons, trend surface analysis is often



used to detrend data: one fits a plane on the data and proceeds toanalyze the finer scale structure on the residuals of this regression (thisis equivalent to subtract the fitted values from the raw data and towork with what remains).

6.3.3 Principal Coordinates of Neighbour Matrices (PCNM)

As said above, the coarseness of trend-surface analysis presents aproblem: fine structures cannot be adequately modelled by thismethod. Too many parameters would be required to do so, especiallyin the bidimensional case: the number of terms of the polynomialfunction grows very quickly, making the third order (with nine terms)the highest one to be usable practically, despite its coarseness in termsof spatial resolution. Polynomials can be turned into orthogonalpolynomials, either by using a Gram-Schmidt orthogonalizationprocedure, or by carrying out a principal component analysis (PCA) onthe matrix of monomials. A new difficulty arises: each new orthogonalvariable is a linear combination of several (in the case of the Gram-Schmidt orthogonalization) or all (in the case of PCA) the originalvariables; it does not represent a single spatial scale or direction anylonger.

In recent years, researchers have become more aware of the fact thatecological processes occur at defined scales, and that their perceptionand modeling depends upon a proper matching of the samplingstrategy to the size, grain and extent of the study, and on the statisticaltools used to analyze the data. This has generated the need foranalytical techniques devised to reveal the spatial structures of a dataset at any scale that can be perceived by the sampling design. This iswhy Borcard & Legendre (2002)1 and Borcard et al. (2004)2 have

1 Borcard, D. & Legendre, P. 2002. All-scale spatial analysis of ecological data by means of principal

coordinates of neighbour matrices. Ecological Modelling 153: 51-68.

2 Borcard, D., P. Legendre, Avois-Jacquet, C. & Tuomisto, H. 2004. Dissecting the spatial structures ofecologial data at all scales. Ecology 85(7): 1826-1832.



proposed a method for detecting and quantifying spatial patterns overa wide range of scales. This method can be applied to any set of sitesproviding a good coverage of the geographic sampling area. Thismethod will be presented below in the unidimensional context, whereit has the further advantage of being usable even for short (n > 25)data series. Most of the text below is adapted from Borcard &Legendre (2002).

The analysis begins by coding the spatial information in a formallowing to recover various structures over the whole range of scalesencompassed by the sampling design. This technique works on datasampled along linear transects as well as on geographic surfaces or inthree-dimensional space. The demonstration below is made on aunivariate, unidimensional case for the sake of clarity. Figure 47displays the steps of a complete spatial analysis using principalcoordinates of neighbour matrices (PCNM).

A. Modified (truncated) matrix of Euclidean distances

First, we construct a matrix of Euclidean distances among the sites.Then, we define a threshold under which the Euclidean distances arekept as measured, and above which all distances are considered to be“large”, the corresponding numbers being replaced by an arbitrarilylarge value. This “large” value has been empirically set equal to fourtimes the threshold value. Beyond this value, the principal coordinatesremain the same to within a multiplicative constant.

For instance, in the case of a linear transect made of sampling pointsregularly spaced 1 metre apart, we could set the threshold at 1 metreto retain only the closest neighbours, and replace all other distances inthe matrix by 1.0 m × 4 = 4.0 m.



Figure 47 - The computational steps of a PCNM analysis.



B. Principal coordinate analysis of the truncated distance matrix

The second step is to compute the principal coordinates of themodified distance matrix. This is necessary because we need ourspatial information to be represented in a form compatible withapplications of multiple regression or canonical ordination (redundancyanalysis, RDA, or canonical correspondence analysis, CCA), i.e., as anobject-by-variable matrix. We obtain several positive, one or severalnull, and several negative eigenvalues. Principal coordinate analysis(PCoA) of the truncated distance matrix makes it impossible torepresent the distance matrix entirely in a space of Euclideancoordinates because the truncated distance matrix is not Euclidean.When the PCoA is computed in the usual manner, the negativeeigenvalues cannot be used as such because the corresponding axesare complex (i.e., the coordinates of the sites along these axes arecomplex numbers). A modified form of the analysis allows them to becomputed, but it will not be detailed here.

The principal coordinates derived from these positive eigenvalues cannow be used as explanatory variables in multiple regression, RDA, orCCA, depending on the context.

When computed from a distance matrix corresponding to n equidistantobjects arranged as a straight line, as in Figure 47, truncated with athreshold of one unit (MAX = 1, i.e., only the immediate neighboursare retained), the principal coordinates correspond to a series of sinewaves with decreasing periods (Figure 48); the largest period is n+1,and the smallest one is equal to or slightly larger than 3. The numberof principal coordinates is a round integer corresponding to two-thirdsof the number of objects. If the truncation threshold is larger than 1,fewer principal coordinates are obtained, and several among the last(finer) ones are distorted, showing aliasing of structures having periodstoo short to be represented adequately by the discrete site coordinates,a behaviour that alters the performance of the method.



Thus, the PCNM method presents a superficial resemblance to Fourieranalysis and harmonic regression, but it is more general since it canmodel a wider range of signals, and can be used with irregularlyspaced data.

Figure 48 - Eight of the 67 principal coordinates obtained by principalcoordinate analysis of a matrix of Euclidean distances among 100objects, truncated after the first neighbours.

Borcard & Legendre (2002) have shown by simulations that PCNManalysis has a correct type I error and is powerful to detect varioustypes of spatial structures: gradients, single bumps, sine waves, as wellas random but spatially autocorrelated signals.

When used on structurally complex data, PCNM analysis alsosucceeds in recovering spatial structures at various scales. This can beachieved by building subsets of PCNM variables, thereby constructing



additive submodels that can be interpreted a posteriori by means ofenvironmental variables or used to build hypotheses about theprocesses that have generated the structures. Real-world applicationsare presented by Borcard et al. (2004) and, for instance, Brind'Amouret al. (2005)3.

C. Example on artificial data

Borcard et al. (2002) present an example involving artificial dataconstructed by combining various kinds of signals usually present inreal data, plus two types of noise. This provides a pattern that has thedouble advantage of being realistic and controlled, thereby permitting aprecise assessment of the potential of the method to recover thestructured part of the signal and to dissect it into its primarycomponents.

Construction of the artificial data - The data were constructed byadding the following components together (Figure 49) into a transectconsisting of 100 equidistant observations:

1) a linear trend (Fig. 49a);

2) a single normal patch in the centre of the transect (Fig. 49b);

3) 4 waves (= a sine wave with a period of 25 units) (Fig. 49c);

4) 17 waves (i.e., a sine wave with a period of approximately 5.9sampling units) (Fig. 49d);

5) a random autocorrelated variable, with autocorrelation determinedby a spherical variogram with nugget value = 0 and range = 5 (Fig.49e);

6) a noise component drawn from a random normal distribution withmean = 0 and variance = 4 (Fig. 49f).

3 Brind'Amour, A., D. Boisclair, P. Legendre and D. Borcard. 2005. Multiscale spatial distribution of a

littoral fish community in relation to environmental variables. Limnology and Oceanography 50:465-479.



Figure 49 - Construction of the artificial pseudo-ecological data set ofknown properties. The six components added together are shown,with their contributions to the variance of the final signal.

In the final artificial response variable, the random noise (Fig. 49f)contributed for more than half of the total variance. Thus, the spatially-structured components of the compound signal (Fig. 49a to 49e) werewell hidden in the noise, as it is often the case with real ecologicaldata.



Data analysis - The spatial analysis consists in the following steps:

(1) Detrending of the dependent variable (done here because a strongand significant trend was present).

(2) Since this example involves a single dependent variable, multiplelinear regression of the detrended dependent variable onto the 67spatial variables built as explained before.

The main question at this step is to decide what kind of model isappropriate: a global one, retaining all the spatial variables and yieldingan R2 as high as possible, or a more parsimonious model based on themost significant spatial variables? The answer may depend on theproblem, but most applications so far included some sort of thinning ofthe model. Remember that the number of parameters of the globalmodel is equal to about 67% of the number of objects, a situationwhich may often lead to an overstated value of R2 by chance alone(this can be corrected by the use of an adjusted R2, however). Aconvenient solution consists in testing the significance of all the(partial) regression coefficients and retaining only the principalcoordinates that are significant at a predetermined (one-tailed)probability value. All tests can be done using a single series ofpermutations if the permutable units are the residuals of a full model(Anderson & Legendre, 19994; Legendre and Legendre 1998), whichis the case here. The explanatory variables being orthogonal, norecomputation of the coefficients of the “minimum” model arenecessary. Note, however, that a new series of statistical tests basedupon the minimum model would give different results, since thedenominator (residual mean square) of the F statistic would havechanged.

Analytical results - The analysis of the (detrended) artificial datayielded a complete model explaining 75.3% of the variance whenusing the 67 explanatory variables. Reducing the model as described

4 Anderson, M.J. and Legendre, P., 1999. An empirical comparison of permutation methods for tests of

partial regression coefficients in a linear model. J. Statist. Comput. Simul., 62: 271-303.



above allowed to retain 8 spatial variables at p = 0.05, explainingtogether 43.3% of the variance. This value compares well with the47% of the variance representing the contributions of the single bump,the two variables with 4 and 17 waves, and the random autocorrelatedcomponent of the detrended data. The PCNM variables retained wereprincipal coordinates no. 2, 6, 8, 14, 28, 33, 35 and 41.

Additive submodels - It often happens that the significant variables aregrouped in series of roughly similar periods. In these data, for instance,there is a clear gap between the first four significant PCNM variablesand the last four. Thus, a first step may be to draw two submodels,one involving variables 2, 6, 8 and 14 (added together, using theirregression coefficients in the minimum model as weights) and theother involving variables 28, 33, 35 and 41. The results are shown inFigures 50a and 50d respectively. The “broad-scale” submodel (Fig.50a) shows four major bumps, the two central ones being muchhigher than the two lateral ones. This may indicate that twomechanisms are actually confounded, one producing the four bumpsand another process elevating the two central ones. Subdividing thissubmodel further by separating variable 2 from variables 6, 8 and 14allowed indeed to distinguish a central bump (Fig. 50b) and 4 waves(Fig. 50c). The fine-scale submodel (Fig. 50d) shows 17 waves, withhints of a 4-bump pattern. The spatial model made of the 8 variables isshown in Figure 50e.

The method has successfully revealed the four deterministiccomponents that were built into the data: trend, single central bump, 4waves and 17 waves, despite the large amount of noise added. Theamount of variance explained by the model suggests that most of thespatially-structured information present in the random autocorrelatedcomponent of the data is also contained in the model (in accordancewith the simulation results), but that it could not be separated from theperiodic signals because it was “diluted” over several scales.



Figure 50 - Minimum spatial model and its additive submodelsobtained by PCNM analysis on the (detrended) artificial data shown inFigure 49.

The successful extraction of the structured information can be furtherillustrated by comparing (Figure 51):

- the model of the detrended data obtained above (reproduced in Fig.51b) to the sum of the four components “central bump”, “4 waves”,“17 waves” and “random autocorrelated” (Fig. 51a), and

- the residuals of the spatial model (Fig. 51d) to the real noise built intothe data, i.e., the uncorrelated random variate (Fig. 51c).



Figure 51 - Comparison of the structured (a) and random (c)components of the data on the one hand, and the spatial model (b) andits residuals (d) on the other hand, and correlations between thehomologous components.

Ecological interpretation of a PCNM analysis

In the univariate case (as above), the simplest way of interpreting theresults of a PCNM analysis is to regress the fitted values of the PCNMmodel on the environmental variables available in the study. Thisensures that only the spatialized fraction of variation of the responsevariable is interpreted, but it bears the inconvenient that all spatialscales are confounded in the model. To unravel the scales whereecological processes take place, it is generally more fruitful to



decompose the PCNM model into submodels as above, and regressthe fitted values of these submodels separately on the environmentalvariables. Each submodel is likely to be explainable by a differentsubset of environmental variables and, since the submodels areorthogonal to one another, the results will reveal scale-dependentprocesses that act independently on the response variable. Examplescan be found in Borcard et al. (2004).

Setup and interpretation of a PCNM analysis in the multivariatecase

If the research involves a matrix of response variables Y (e.g. a matrixof species abundances), the PCNM analysis can be run on the basis ofcanonical ordination instead of multiple regression. A subset ofsignificant PCNM base functions can still be selected (for instance byforward selection). If RDA is used, one obtains an R2 (called in thiscase a bimultivariate redundancy statistic) that can be adjusted for thenumber of objects and explanatory variables in the same way as anordinary R2 can be. After this selection, several paths can be followedto further interpret the results:

Path 1: the RDA is run normally, and the fitted site scores of the mostimportant canonical axes are regressed on the environmental variablesas above. This path produces one orthogonal model of spatiallystructured data for each canonical axis, but since all PCNM basefunctions are involved in each axis, the spatial scales are confounded.

Path 2: this path consists in grouping the significant PCNM basefunctions into scales (as in the artificial example above), and running aseparate RDA for each group of PCNM base functions. Each RDAwill yield a series of canonical axes that will be spatially structured at ascale defined by the subset of PCNM variables used in the analysis.The most important axes of each RDA can be explained by regressingthem on the environmental variables.



Path 3: a more complex, but potentially very powerful approach, is tocombine PCNM analysis with variation partitioning. For instance, onecould proceed as follows:

- forward-select the significant PCNM base functions;

- group the significant PCNM variables into k subgroups of differentspatial scales (for instance k = 3);

- forward-select the environmental variables;

- run a variation partitioning using the k subgroups of PCNM variablesas well as the significant environmental variables (taken as oneseparate group of explanatory variables).

This path will yield a detailed assessment of the amount of spatial andnonspatial variation explained by or without the environmentalvariables at all scales.

Further remarks and summary notes on PCNM base functions

PCNM variables represent a spectral decomposition of the spatialrelationships among the study sites.

They can be computed for regular or irregular sets of points in spaceor time.

PCNM base functions are orthogonal. If the sampling design is regular,they look like sine waves. This is a property of the eigen-decomposition of the centered form of a distance matrix (Laplacian). Ifthe sampling design is irregular, the PCNM base functions haveirregular shapes as well, but they can still be roughly ordered frombroad-scale to fine-scale.

The grouping of PCNM variables into submodels of various scalesimplies arbitrary decisions about the building of the groups.

PCNM base functions can also be computed for circular sampling



designs. An example can be found in Brind'Amour et al. (2005).

PCNM analysis can be used for temporal analysis, as well as spatio-temporal analysis. Research is presently underway to allow theanalysis of spatio-temporal designs without spatial replication whilestill testing the interaction.

The concept of PCNM has been recently generalized to that ofDistance-Based Eigenvector Maps (DBEM); other ways of computingsuch vectors are now available (Dray et al., submitted)5.

5 Dray, S., P. Legendre and P. Peres-Neto. Spatial modelling: a comprehensive framework for principal

coordinate analysis of neighbour matrices (PCNM). Ecological Modelling (submitted).

[borcard, daniel] multivariate analysis(book4me.org)

Documents

data analysis

spatial analysis

definitionsnumerical

landscape ecology

simple statistics

field of quantitative

purpose ofnumerical

linuxgeneral statistics