data analysis and geostatistics - lecture...

Data analysis and Geostatistics - lecture XI

Clustering and spatial analysis of data

Cluster analysis

Cluster analysis requires substantial user input (selection of number of clusters, clustering routine, similarity criteria, etc)

and results can therefore be ambiguous:

always give detailed information on how your cluster analysis was performed

Group samples into clusters based on similarity

Cluster analysis

Group samples into clusters based on similarity

cluster mean A

cluster mean B

cluster mean C

(xi-xA)_

(xi-xB)_

(xi-xC)_

whichever deviation between sample and cluster mean is smallest:

assigned to that cluster

Cluster analysis is again controlled by the sum of squares:

SSwithin

SSwithin

SSwithin

SSbetween A-B

SSbetween B-C

SSbetween A-C

increasing the number of clusters will decrease the within variance, until all samples are their own cluster. That result is however meaningless....

small variance within: tight clusterslarge variance between: good separation

Cluster analysis - sample assignment criteria

cluster mean A

cluster mean B

cluster mean C

range of techniques that can be used to determine similarity

Wide range of techniques - see book for details

‣ Euclidian distance - r or r2 ‣ city block of Manhattan distance - this is useful when the two variables are separate characteristics (fossil length and width, the diagonal is not of interest) ‣ correlation similarity - sample with the same correlation are grouped together: deals with dilution effects ‣ association values - especially useful when you have only presence/absence data - specialized

Cluster analysis - two types

Two varieties of clustering: hierarchical and partitioning methods

hierarchical techniques: represent similarity in a tree or dendrogramthe method:

1. all samples are a separate cluster2. link the two most similar samples3. link two other samples to form a new cluster or add a third

sample to the first cluster depending on similarities4. continue until only one cluster remains

in this technique all intermediate steps and cluster associations are immediately available - depends on the user to select an appropriate “pruning” level in the tree

there are many ways to link samples and these do result in different trees (see book for details)

Hierarchical cluster analysis

An example of hierarchical clustering:

KV01KV20KV41KV43KV08KV10KV12KV14KV14KV21KV21

sample

the composition of a number of lava samples from Kawah Ijen volcano:

degree of dissimilarity

basaltandesite

daciteduplicate

dissimilarity based upon nearest neighbour

criterium

the resulting tree can be “pruned” at any level:

up to the user to select

should test if difference between groups is

significant (which test?)

clusters 11 9 4 2

Clustering - partitioning techniques

partitioning techniques: assigns samples to a known number of clusters based upon similarity criteria

the method:1. samples are assigned to the cluster they are most similar to in multi-dimensional space2. each assignment results in a shift in the characteristics of the cluster centre (means + variance or only variance)3. samples are re-assigned where necessary and this routine is iterated until the system stabilizes

There are two main approaches: clustering with specified cluster means (i.e. known groups) and clustering where the means are obtained during clusteringboth have their pros and cons:

Two varieties of clustering: hierarchical and partitioning methods

Partitioning techniques

advantages disadvantages

specified/fixed

‣ you always get the same answer during classification ‣ groups can relate to real dividing phenomena ‣ unknowns are (generally) easily classified

‣ boundaries commonly based on consensus (artificial) ‣ 2 samples close together can be in different clusters ‣ 2 very different samples can be in same cluster

assigned/sought

‣ data groups not split up over different clusters ‣ boundaries always in regions of low data density ‣ easy to apply to data sets with many variables

‣ instability issues: more data will result in shift in cluster means and sample assignment ‣ no fixed boundaries so unsuitable for classification schemes

Clustering with hard boundaries

2 samples close together can be in different clusters

2 very different samples can be in same cluster

Partitioning techniques

advantages disadvantages

specified/fixed

‣ you always get the same answer during classification ‣ groups can relate to real dividing phenomena ‣ unknowns are (generally) easily classified

‣ boundaries commonly based on consensus (artificial) ‣ 2 samples close together can be in different clusters ‣ 2 very different samples can be in same cluster

assigned/sought

‣ data groups not split up over different clusters ‣ boundaries always in regions of low data density ‣ easy to apply to data sets with many variables

‣ instability issues: more data will result in shift in cluster means and sample assignment ‣ no fixed boundaries so unsuitable for classification schemes

Cluster means assigned during clustering:

xxx x

x

x

x

x

x x

x

x

x

cluster Acluster Bcluster Ccenter

when cluster means are specified: use minimum distance to mean to assignif not: randomly assign each sample to a cluster and iterate to stable solution

both cluster means and cluster assign-ment change during the iteration

process stops when samples no longer change their assign-ment

Cluster analysis - method of assignment

samples are unambiguously attributed to a specific cluster - 0 or 1 assignment

Samples are normally assigned to a cluster in a “hard” way:

However, mother nature is rarely so black and white....“middle age” cluster depends very much on percon/country/continent

if age is between A and B: middle

age

0

1

young oldmiddleage

A B 0

1

young old

middleage

fuzzy approach:samples have

cluster member-ships between 0

and 1

Fuzzy clustering

fuzzy clustering has a number of distinct benefits:can deal with intermediate cases - not force-assigned

samples have share multiple clusters - extra information: (0.7 young + 0.3 middle age versus 0.5 young + 0.5 middle age)

ensures that single samples do not overly control individual clusterscan have a separate outlier assignment

most flexible and powerful: fuzzy clustering with seeking of cluster means

hard clusteringstrained assign-ment due to outlier and inter-mediate value

fuzzy clusteringoutlier not a problem and intermediate shown

Clustering in NCSS - the eating habits of Europe

can we distinguish the Europeans by their eating habits?the data (missing value = -999):

Real

coff

ee

Nes

cafe

Tea

Swee

tene

r

Bisc

uits

Pack

. sou

p

Tinn

ed s

oup

Froz

en fi

sh

Froz

en v

eg.

Appl

es

Tinn

ed fr

uit

Jam

Gar

lic

Butte

r

Mar

gerin

e

Oliv

e oi

l

Yogh

urt

lots of options available:use parametric and non-parametric data and even mix these (length + color)

variety of linkage types: nearest neighbour, furthest neighbour, Ward’s method

distance: Euclidian or Manhattan city block

see the NCSS hierarchical clustering tutorial for more information

hierarchical clustering of this data set: clear clustering



K-means - hard fuzzy-prob 1 fuzzy-prob 2 fuzzy-prob 3 fuzzy-prob 4Germany 2 0.04 0.02 0.83 0.11

Italy 3 0.01 0.93 0.04 0.02France 2 0.05 0.12 0.77 0.06

Netherlands 2 0.23 0.07 0.53 0.16Belgium 2 0.08 0.25 0.56 0.11

Luxembourg 2 0.09 0.06 0.75 0.1Britain 4 0.92 0.01 0.04 0.03

Portugal 3 0.02 0.92 0.04 0.03Austria 3 0.03 0.87 0.06 0.05

Switzerland 2 0.05 0.05 0.86 0.05Sweden 1 0.05 0.04 0.08 0.82Denmark 1 0.03 0.02 0.07 0.88Norway 1 0.03 0.07 0.1 0.8Finland 1 0.06 0.22 0.16 0.57Spain 3 0.02 0.83 0.11 0.04Ireland 4 0.88 0.04 0.06 0.03

hard and fuzzy clustering of this data set:

Clustering - number of clusters

the main difficulty in cluster analysis is choosing the no. of clusters

NCSS and other clustering packages will calculate assignments for a cluster number range

the residual variance will decrease with every additional cluster so this is not a good indicator of optimal no. of clusters

instead:choose no. of clusters where variance no longer strongly decreases

use the averaged silhouette value: comparison between a value’s dissimilarity with its cluster and the dissimilarity with its nearest neighbour:

ranges from 1 to -1: > 0.75: good model < 0.25: poor model

Use the fuzziness of the model (0; completely fuzzy to 1; hard)Fc(U) and Dc(U) parameters: max Fc(U) + min Dc(U) = best model

Plotting clusters on maps - Massif Central datasetWill look at an example from the Massif Central in France. A dataset of the chemical composition of stream sediments collected in an area with a diverse geology and old, now abandoned, mining for Sb, As, Pb, Au, Ba & F

Geology consists of:

Cronce river

Des

ges

river

+Lavoute-Chilhac

St Cirgues

+

+Chilhac

+Reilhac

Langeac

+Marsanges

+Pebrac

+Chazelles+

Desges

+Barlet

+Charraix

+Chanteuges

+Ally

+Védrines St Loup

+Pinols

+Lestival

+Chastel

+Prades

felsic gneisses

mafic gneisses and schists

(meta) - granite

sediment (incl coal) 5 km

“recent” volcanics

The dataset is best described when split up into six clusters

Clustering - groups in Massif Central dataset

clear link to the bedrock geology, but not 1 to 1

Clustering - properties per clusterwhen the data have been clustered: can look at the characteristics of each cluster (mean + stdev) and correlations within this

log (Li)

V

SiO2

K2O

MgO

Li

Clustering - groups in Massif Central dataset

cluster 1cluster 2cluster 3cluster 4

cluster separation isn’t perfect

only cluster 4 is distinct in Li: multi-element separation

Can plot clusters individually to look at spatial distribution and contents

Plotting data on maps: bubble plots

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

Data are plotted at their spatial coordinates with a symbol whose size represents the value of the data point

Can apply exactly the same tools as used on the element map:

adjust contrast, isolate features and perform data transformations

can also overlay these bubbles on another layer, such as a topo map, geol map, stream map etc

500 1000 1500 2000 2500 3000

500

1000

1500

2000

Plotting data on maps: bubble plotsStream sediments as a reflection of the local geology: Beryllium

Be concentrations without processing:

sometimes it just works!

Plotting data on maps: bubble plotsSilver concentrations: working with a non-normal distribution

500 1000 1500 2000 2500 3000

500

1000

1500

2000

500 1000 1500 2000 2500 3000

500

1000

1500

2000

Ag linear scale500 1000 1500 2000 2500 3000

500

1000

1500

2000

Ag square root scaleAg optimized

Plotting data on maps: bubble plotsDon’t have to plot all the data in the dataset: applying a cut-off at low values will highlight interesting samples, whereas a high cut-off removes outliers

Zn, only data with > 50 ppm

Plotting data on maps: bubble plotsLooking for element associations by combining bubble plots

Cd

Zn

Sb

W

Plotting data on mapsCombining elements by using multi-coloured bubble plots is useful, but fast becomes confusing: can lead you to miss interesting samples

Can also calculate such associations beforehand and plot them directly:• Sb + Zn• Sb / Zn

Or you can apply logical rules to the data before plotting:• plot Sb if S > 200 ppm• if SiO2 > 60 wt% then plot K / Zr

Note that such properties are calculated much easier and faster in programs designed for such calculations: e.g. Excel or Quattro Pro

Not limited to plotting data, but can also plot derived properties such as the mean, median, standard deviation, etc

and not just values, but also other observations: geol code / vegetation / mode in multi-modal distribution

Plotting data on maps

Plotting data on maps: bubble plotsPlotting processed data - standard deviation: the variability at a sample site

500 1000 1500 2000 2500 3000

500

1000

1500

2000

Plotting data on maps: artefacts in the gold map

Spatial data visualization

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

To be able to calculate contours and surfaces: interpolation

interpolation on As content grid;

x nearest neighbour

xx

x

o radius technique: 1/roo radius technique: 1/r2

o

need to know the concentration at any point in the sampling space to be able to draw smooth contours:

interpolate between values


-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

Results of different interpolation techniques:

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140


-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140 interpolation on As content grid;

x nearest neighbour

xx

x

o radius technique: 1/r

o

o radius technique: 1/r2o

main issue: what samples should be included in the interpolation:

what should the maximum radius be?

To be able to calculate contours and surfaces: interpolation

Interpolation radius

Spatial data have a very useful property: adjacent samples should be most similar, whereas samples that are far apart can be distinctly different, or:

the variance for a small interpolation radius is small, as the variance between adjacent samples is small

the variance increases as the interpolation radius increases (i.e. as samples further away from the point of interest are included)

at some radius the variance will no longer increase as we have reached the overall variance, which is called the “regional variance”

including values beyond the regional variance radius is pointless as such samples do not contain any information on the value at the point of interest


Interpolation radius in a sedimentary core:concentrationradius adjacent samples

are most similar:as interpolation radius increases so does the variances

when you enter another unit the variance increases significantly:

such samples should not be included in your interpolation



radius

varia

nce

radius

sem

ivar

ianc

e


radius

sem

ivar

ianc

ese

miv

aria

nce

Semivariance and semivariograms

This concept is semivariance and is shown in a semivariogramsemivariance: the variance between samples a specified interval

or distance apart

with: γ = semivariance for interval h n = total number of samples zi = value at position i

γh = (zi - zi+h)2Σ2(n - h)

as the interval increases, the semivariance will approach the total variance of the data set, so it is a spatially controlled partial variance of the data

as h increases, the relatedness of the samples decreases and the variance will therefore increase:


plotting the semivariance against h: semivriogram

distance

conc

entra

tion

interval

sem

ivar

ianc

e

distance

conc

entra

tion

interval

sem

ivar

ianc

e

distance

conc

entra

tion

interval

sem

ivar

ianc

e

no relation with distance: random

gradual changes in concentration

continuous variation with distance: trend

Semivariance and semivariogramsse

miv

aria

nce

interval interval

properties of a semivariogram :

range

sill

range

drift

the range is the interval within which there is similarity between the samples


Semivariograms provide our maximum radius criterion: only samples that fall within the range are included in interpolation

before we continue, a few notes:

‣ most semivariagrams have an apparent cut-off at zero distance that has a semivariance ≠ 0. This is called the nugget effect and is caused by sample heterogeneity (= field duplicate variance)

‣ semivariograms have to be determined for each variable as each has its own range: interpolation has to be performed separately as well

‣ semivariograms are generally different for different spatial directions (N, SW, etc). Such anisotropy can point to an underlying geological phenomenon such as layering or a fault control on conc. This can be corrected for either manually by stretching the coordinate system perpendicular to the main axis, or automatically by kriging software

Nugget effect in semi-variogramsse

miv

aria

nce

interval

There is always some uncer-tainty at a given sample site, which you could quantify by taking field duplicates.

This sample site variance is the “nugget” in a semivario-gram (in essence the variance at zero distance)

Every element will have such a nugget, but the effect is strongest for elements that are heterogeneously distri-buted, such as gold present as nuggets in a sediment because we use mean + var

range

sill

nugget

Using semivariogram information: kriging

The interpolation technique that employs the range information as obtained from semivariograms is called kriging

in kriging, only samples that are within the range are used to determine the value at a given intermediate position and the weighing for each sample is derived from its associated semivariance

A (xi,yi) = wt1 * A (x1,y1) + wt2 * A (x2,y2) + wt3 * A (x3,y3) + ...

as an added bonus this also gives us the variance associated with each interpolated value (the uncertainty), so we can immediately see where our interpolations are reliable and where they are not

because weights are based on the semivariance, obvious trends in the data should be removed as this leads to a continuous rise in the semi-variance: can be done by first subtracting a trend surface

Estimate of uncertainty for each interpolated value

source: wikipedia.org

Uncertainty in block kriging of gradesKriging is commonly applied to estimate the grade of blocks in open pit mining using a sample grid or the grade of adjacent blocks (or both).

In such cases it is invaluable to know the uncertainty on the grade estimate

Flavours of krigingThere are many flavours of kriging and discussing them all would be a course in its own right. A few terms that you come across commonly:

Simple/Ordinary kriging: no trend in the data, so there is a constant mean in the dataset and the variance is calculated as the difference from this mean. This mean is either known (Simple) or calculated from the data (Ordinary)

Universal kriging: there is a spatial trend in the data, so the mean varies with the spatial coordinates. Instead of using universal kriging, you can also remove the trend in pre-processing of the data

Indicator kriging: rather than estimating a numerical value at a given point, you estimate if it is higher or lower than a set value, and the prob. of this

Co-kriging: a second variable is included in the kriging which is correlated with the first variable. This should improve estimates of the first and main variable

Good kriging resource: Clark & Harper (2000) Practical Geostatistics ISBN 0970331703, or you can download the 1979 original at http://www.kriging.com/pg1979_download.html

Back to our example

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

Results of different interpolation techniques:

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

And now using kriging as the interpolation method

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

-80 -60 -40 -20 0 20 400

20

40

60

80

100

120

140

Results of kriging on this data set:

Kriging and trends

The effect of a strong spatial trend in the data

Kriging and trends

The effect of a strong spatial trend in the data

Some data are not suited to interpolation/kriging

There is a strong tendency to directly start with the most complex or fancy technique, such as kriging. However, kriging is not always appropriate !

raw concentrations plotted optimized kriging map

Kriging and sample coverageKriging works best when you have a high sample density and a more or less uniform distribution of data over the sample are. If not ➛ get artefacts

Areas without samples need to be blanketed out, not just removed afterwards

data analysis and geostatistics - lecture...

Documents