data analysis and geostatistics - lecture...
TRANSCRIPT
Data analysis and Geostatistics - lecture XI
Clustering and spatial analysis of data
Cluster analysis
Cluster analysis requires substantial user input (selection of number of clusters, clustering routine, similarity criteria, etc)
and results can therefore be ambiguous:
always give detailed information on how your cluster analysis was performed
Group samples into clusters based on similarity
Cluster analysis
Group samples into clusters based on similarity
cluster mean A
cluster mean B
cluster mean C
(xi-xA)_
(xi-xB)_
(xi-xC)_
whichever deviation between sample and cluster mean is smallest:
assigned to that cluster
Cluster analysis is again controlled by the sum of squares:
SSwithin
SSwithin
SSwithin
SSbetween A-B
SSbetween B-C
SSbetween A-C
increasing the number of clusters will decrease the within variance, until all samples are their own cluster. That result is however meaningless....
small variance within: tight clusterslarge variance between: good separation
Cluster analysis - sample assignment criteria
cluster mean A
cluster mean B
cluster mean C
range of techniques that can be used to determine similarity
Wide range of techniques - see book for details
‣ Euclidian distance - r or r2 ‣ city block of Manhattan distance - this is useful when the two variables are separate characteristics (fossil length and width, the diagonal is not of interest) ‣ correlation similarity - sample with the same correlation are grouped together: deals with dilution effects ‣ association values - especially useful when you have only presence/absence data - specialized
Cluster analysis - two types
Two varieties of clustering: hierarchical and partitioning methods
hierarchical techniques: represent similarity in a tree or dendrogramthe method:
1. all samples are a separate cluster2. link the two most similar samples3. link two other samples to form a new cluster or add a third
sample to the first cluster depending on similarities4. continue until only one cluster remains
in this technique all intermediate steps and cluster associations are immediately available - depends on the user to select an appropriate “pruning” level in the tree
there are many ways to link samples and these do result in different trees (see book for details)
Hierarchical cluster analysis
An example of hierarchical clustering:
KV01KV20KV41KV43KV08KV10KV12KV14KV14KV21KV21
sample
the composition of a number of lava samples from Kawah Ijen volcano:
degree of dissimilarity
basaltandesite
daciteduplicate
dissimilarity based upon nearest neighbour
criterium
the resulting tree can be “pruned” at any level:
up to the user to select
should test if difference between groups is
significant (which test?)
clusters 11 9 4 2
Clustering - partitioning techniques
partitioning techniques: assigns samples to a known number of clusters based upon similarity criteria
the method:1. samples are assigned to the cluster they are most similar to in multi-dimensional space2. each assignment results in a shift in the characteristics of the cluster centre (means + variance or only variance)3. samples are re-assigned where necessary and this routine is iterated until the system stabilizes
There are two main approaches: clustering with specified cluster means (i.e. known groups) and clustering where the means are obtained during clusteringboth have their pros and cons:
Two varieties of clustering: hierarchical and partitioning methods
Partitioning techniques
advantages disadvantages
specified/fixed
‣ you always get the same answer during classification ‣ groups can relate to real dividing phenomena ‣ unknowns are (generally) easily classified
‣ boundaries commonly based on consensus (artificial) ‣ 2 samples close together can be in different clusters ‣ 2 very different samples can be in same cluster
assigned/sought
‣ data groups not split up over different clusters ‣ boundaries always in regions of low data density ‣ easy to apply to data sets with many variables
‣ instability issues: more data will result in shift in cluster means and sample assignment ‣ no fixed boundaries so unsuitable for classification schemes
Clustering with hard boundaries
2 samples close together can be in different clusters
2 very different samples can be in same cluster
Partitioning techniques
advantages disadvantages
specified/fixed
‣ you always get the same answer during classification ‣ groups can relate to real dividing phenomena ‣ unknowns are (generally) easily classified
‣ boundaries commonly based on consensus (artificial) ‣ 2 samples close together can be in different clusters ‣ 2 very different samples can be in same cluster
assigned/sought
‣ data groups not split up over different clusters ‣ boundaries always in regions of low data density ‣ easy to apply to data sets with many variables
‣ instability issues: more data will result in shift in cluster means and sample assignment ‣ no fixed boundaries so unsuitable for classification schemes
Cluster means assigned during clustering:
xxx x
x
x
x
x
x x
x
x
x
cluster Acluster Bcluster Ccenter
when cluster means are specified: use minimum distance to mean to assignif not: randomly assign each sample to a cluster and iterate to stable solution
both cluster means and cluster assign-ment change during the iteration
process stops when samples no longer change their assign-ment
Cluster analysis - method of assignment
samples are unambiguously attributed to a specific cluster - 0 or 1 assignment
Samples are normally assigned to a cluster in a “hard” way:
However, mother nature is rarely so black and white....“middle age” cluster depends very much on percon/country/continent
if age is between A and B: middle
age
0
1
young oldmiddleage
A B 0
1
young old
middleage
fuzzy approach:samples have
cluster member-ships between 0
and 1
Fuzzy clustering
fuzzy clustering has a number of distinct benefits:can deal with intermediate cases - not force-assigned
samples have share multiple clusters - extra information: (0.7 young + 0.3 middle age versus 0.5 young + 0.5 middle age)
ensures that single samples do not overly control individual clusterscan have a separate outlier assignment
most flexible and powerful: fuzzy clustering with seeking of cluster means
hard clusteringstrained assign-ment due to outlier and inter-mediate value
fuzzy clusteringoutlier not a problem and intermediate shown
Clustering in NCSS - the eating habits of Europe
can we distinguish the Europeans by their eating habits?the data (missing value = -999):
Real
coff
ee
Nes
cafe
Tea
Swee
tene
r
Bisc
uits
Pack
. sou
p
Tinn
ed s
oup
Froz
en fi
sh
Froz
en v
eg.
Appl
es
Tinn
ed fr
uit
Jam
Gar
lic
Butte
r
Mar
gerin
e
Oliv
e oi
l
Yogh
urt
lots of options available:use parametric and non-parametric data and even mix these (length + color)
variety of linkage types: nearest neighbour, furthest neighbour, Ward’s method
distance: Euclidian or Manhattan city block
see the NCSS hierarchical clustering tutorial for more information
hierarchical clustering of this data set: clear clustering
Clustering in NCSS - the eating habits of Europe
Clustering in NCSS - the eating habits of Europe
K-means - hard fuzzy-prob 1 fuzzy-prob 2 fuzzy-prob 3 fuzzy-prob 4Germany 2 0.04 0.02 0.83 0.11
Italy 3 0.01 0.93 0.04 0.02France 2 0.05 0.12 0.77 0.06
Netherlands 2 0.23 0.07 0.53 0.16Belgium 2 0.08 0.25 0.56 0.11
Luxembourg 2 0.09 0.06 0.75 0.1Britain 4 0.92 0.01 0.04 0.03
Portugal 3 0.02 0.92 0.04 0.03Austria 3 0.03 0.87 0.06 0.05
Switzerland 2 0.05 0.05 0.86 0.05Sweden 1 0.05 0.04 0.08 0.82Denmark 1 0.03 0.02 0.07 0.88Norway 1 0.03 0.07 0.1 0.8Finland 1 0.06 0.22 0.16 0.57Spain 3 0.02 0.83 0.11 0.04Ireland 4 0.88 0.04 0.06 0.03
hard and fuzzy clustering of this data set:
Clustering - number of clusters
the main difficulty in cluster analysis is choosing the no. of clusters
NCSS and other clustering packages will calculate assignments for a cluster number range
the residual variance will decrease with every additional cluster so this is not a good indicator of optimal no. of clusters
instead:choose no. of clusters where variance no longer strongly decreases
use the averaged silhouette value: comparison between a value’s dissimilarity with its cluster and the dissimilarity with its nearest neighbour:
ranges from 1 to -1: > 0.75: good model < 0.25: poor model
Use the fuzziness of the model (0; completely fuzzy to 1; hard)Fc(U) and Dc(U) parameters: max Fc(U) + min Dc(U) = best model
Plotting clusters on maps - Massif Central datasetWill look at an example from the Massif Central in France. A dataset of the chemical composition of stream sediments collected in an area with a diverse geology and old, now abandoned, mining for Sb, As, Pb, Au, Ba & F
Geology consists of:
Cronce river
Des
ges
river
+Lavoute-Chilhac
St Cirgues
+
+Chilhac
+Reilhac
Langeac
+Marsanges
+Pebrac
+Chazelles+
Desges
+Barlet
+Charraix
+Chanteuges
+Ally
+Védrines St Loup
+Pinols
+Lestival
+Chastel
+Prades
felsic gneisses
mafic gneisses and schists
(meta) - granite
sediment (incl coal) 5 km
“recent” volcanics
The dataset is best described when split up into six clusters
Clustering - groups in Massif Central dataset
clear link to the bedrock geology, but not 1 to 1
Clustering - properties per clusterwhen the data have been clustered: can look at the characteristics of each cluster (mean + stdev) and correlations within this
log (Li)
V
SiO2
K2O
MgO
Li
Clustering - groups in Massif Central dataset
cluster 1cluster 2cluster 3cluster 4
cluster separation isn’t perfect
only cluster 4 is distinct in Li: multi-element separation
Can plot clusters individually to look at spatial distribution and contents
Plotting data on maps: bubble plots
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
Data are plotted at their spatial coordinates with a symbol whose size represents the value of the data point
Can apply exactly the same tools as used on the element map:
adjust contrast, isolate features and perform data transformations
can also overlay these bubbles on another layer, such as a topo map, geol map, stream map etc
500 1000 1500 2000 2500 3000
500
1000
1500
2000
Plotting data on maps: bubble plotsStream sediments as a reflection of the local geology: Beryllium
Be concentrations without processing:
sometimes it just works!
Plotting data on maps: bubble plotsSilver concentrations: working with a non-normal distribution
500 1000 1500 2000 2500 3000
500
1000
1500
2000
500 1000 1500 2000 2500 3000
500
1000
1500
2000
Ag linear scale500 1000 1500 2000 2500 3000
500
1000
1500
2000
Ag square root scaleAg optimized
Plotting data on maps: bubble plotsDon’t have to plot all the data in the dataset: applying a cut-off at low values will highlight interesting samples, whereas a high cut-off removes outliers
Zn, only data with > 50 ppm
Plotting data on maps: bubble plotsLooking for element associations by combining bubble plots
Cd
Zn
Sb
W
Plotting data on mapsCombining elements by using multi-coloured bubble plots is useful, but fast becomes confusing: can lead you to miss interesting samples
Can also calculate such associations beforehand and plot them directly:• Sb + Zn• Sb / Zn
Or you can apply logical rules to the data before plotting:• plot Sb if S > 200 ppm• if SiO2 > 60 wt% then plot K / Zr
Note that such properties are calculated much easier and faster in programs designed for such calculations: e.g. Excel or Quattro Pro
Not limited to plotting data, but can also plot derived properties such as the mean, median, standard deviation, etc
and not just values, but also other observations: geol code / vegetation / mode in multi-modal distribution
Plotting data on maps
Plotting data on maps: bubble plotsPlotting processed data - standard deviation: the variability at a sample site
500 1000 1500 2000 2500 3000
500
1000
1500
2000
Plotting data on maps: artefacts in the gold map
Spatial data visualization
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
To be able to calculate contours and surfaces: interpolation
interpolation on As content grid;
x nearest neighbour
xx
x
o radius technique: 1/roo radius technique: 1/r2
o
need to know the concentration at any point in the sampling space to be able to draw smooth contours:
interpolate between values
Spatial data visualization
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
Results of different interpolation techniques:
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
Spatial data visualization
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140 interpolation on As content grid;
x nearest neighbour
xx
x
o radius technique: 1/r
o
o radius technique: 1/r2o
main issue: what samples should be included in the interpolation:
what should the maximum radius be?
To be able to calculate contours and surfaces: interpolation
Interpolation radius
Spatial data have a very useful property: adjacent samples should be most similar, whereas samples that are far apart can be distinctly different, or:
the variance for a small interpolation radius is small, as the variance between adjacent samples is small
the variance increases as the interpolation radius increases (i.e. as samples further away from the point of interest are included)
at some radius the variance will no longer increase as we have reached the overall variance, which is called the “regional variance”
including values beyond the regional variance radius is pointless as such samples do not contain any information on the value at the point of interest
Interpolation radius
Interpolation radius in a sedimentary core:concentrationradius adjacent samples
are most similar:as interpolation radius increases so does the variances
when you enter another unit the variance increases significantly:
such samples should not be included in your interpolation
Interpolation radius
Interpolation radius
radius
varia
nce
radius
sem
ivar
ianc
e
Interpolation radius
radius
sem
ivar
ianc
ese
miv
aria
nce
Semivariance and semivariograms
This concept is semivariance and is shown in a semivariogramsemivariance: the variance between samples a specified interval
or distance apart
with: γ = semivariance for interval h n = total number of samples zi = value at position i
γh = (zi - zi+h)2Σ2(n - h)
as the interval increases, the semivariance will approach the total variance of the data set, so it is a spatially controlled partial variance of the data
as h increases, the relatedness of the samples decreases and the variance will therefore increase:
Semivariance and semivariograms
plotting the semivariance against h: semivriogram
distance
conc
entra
tion
interval
sem
ivar
ianc
e
distance
conc
entra
tion
interval
sem
ivar
ianc
e
distance
conc
entra
tion
interval
sem
ivar
ianc
e
no relation with distance: random
gradual changes in concentration
continuous variation with distance: trend
Semivariance and semivariogramsse
miv
aria
nce
interval interval
properties of a semivariogram :
range
sill
range
drift
the range is the interval within which there is similarity between the samples
Semivariance and semivariograms
Semivariograms provide our maximum radius criterion: only samples that fall within the range are included in interpolation
before we continue, a few notes:
‣ most semivariagrams have an apparent cut-off at zero distance that has a semivariance ≠ 0. This is called the nugget effect and is caused by sample heterogeneity (= field duplicate variance)
‣ semivariograms have to be determined for each variable as each has its own range: interpolation has to be performed separately as well
‣ semivariograms are generally different for different spatial directions (N, SW, etc). Such anisotropy can point to an underlying geological phenomenon such as layering or a fault control on conc. This can be corrected for either manually by stretching the coordinate system perpendicular to the main axis, or automatically by kriging software
Nugget effect in semi-variogramsse
miv
aria
nce
interval
There is always some uncer-tainty at a given sample site, which you could quantify by taking field duplicates.
This sample site variance is the “nugget” in a semivario-gram (in essence the variance at zero distance)
Every element will have such a nugget, but the effect is strongest for elements that are heterogeneously distri-buted, such as gold present as nuggets in a sediment because we use mean + var
range
sill
nugget
Using semivariogram information: kriging
The interpolation technique that employs the range information as obtained from semivariograms is called kriging
in kriging, only samples that are within the range are used to determine the value at a given intermediate position and the weighing for each sample is derived from its associated semivariance
A (xi,yi) = wt1 * A (x1,y1) + wt2 * A (x2,y2) + wt3 * A (x3,y3) + ...
as an added bonus this also gives us the variance associated with each interpolated value (the uncertainty), so we can immediately see where our interpolations are reliable and where they are not
because weights are based on the semivariance, obvious trends in the data should be removed as this leads to a continuous rise in the semi-variance: can be done by first subtracting a trend surface
Estimate of uncertainty for each interpolated value
source: wikipedia.org
Uncertainty in block kriging of gradesKriging is commonly applied to estimate the grade of blocks in open pit mining using a sample grid or the grade of adjacent blocks (or both).
In such cases it is invaluable to know the uncertainty on the grade estimate
Flavours of krigingThere are many flavours of kriging and discussing them all would be a course in its own right. A few terms that you come across commonly:
Simple/Ordinary kriging: no trend in the data, so there is a constant mean in the dataset and the variance is calculated as the difference from this mean. This mean is either known (Simple) or calculated from the data (Ordinary)
Universal kriging: there is a spatial trend in the data, so the mean varies with the spatial coordinates. Instead of using universal kriging, you can also remove the trend in pre-processing of the data
Indicator kriging: rather than estimating a numerical value at a given point, you estimate if it is higher or lower than a set value, and the prob. of this
Co-kriging: a second variable is included in the kriging which is correlated with the first variable. This should improve estimates of the first and main variable
Good kriging resource: Clark & Harper (2000) Practical Geostatistics ISBN 0970331703, or you can download the 1979 original at http://www.kriging.com/pg1979_download.html
Back to our example
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
Results of different interpolation techniques:
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
And now using kriging as the interpolation method
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
-80 -60 -40 -20 0 20 400
20
40
60
80
100
120
140
Results of kriging on this data set:
Kriging and trends
The effect of a strong spatial trend in the data
Kriging and trends
The effect of a strong spatial trend in the data
Some data are not suited to interpolation/kriging
There is a strong tendency to directly start with the most complex or fancy technique, such as kriging. However, kriging is not always appropriate !
raw concentrations plotted optimized kriging map
Kriging and sample coverageKriging works best when you have a high sample density and a more or less uniform distribution of data over the sample are. If not ➛ get artefacts
Areas without samples need to be blanketed out, not just removed afterwards