a jmp script for geostatistical cluster analysis of … · 2012. domaining by clustering...
TRANSCRIPT
A JMP® SCRIPT FOR GEOSTATISTICAL CLUSTER ANALYSIS OF MIXED DATA SETS WITH SPATIAL INFORMATION
Steffen Brammer
OBJECTIVE Example 1 Find a location for your luxury car sales outlet
Data set w samples across city at specific locations
Establish mean for each suburb to identify affluent clientele
Location of outlet
No customers ??? Why
Reality Affluent Suburb
Example 2 How much pesticide do you need to get rid of the bugs?
Strong bug population only in trees along roads -> remove tree lines from your statistics
Data set w samples at specific locations
Establish mean for each field
Example 3 Image processing
Available pixel Completed image Interpolation
algorithm Reality
Example 4 Mine geology
Sample data Domains
High grade gold mineralisation in quartz vein stockwork
(pink lines)
MIXED DATA • Single data set with two (or more) underlying populations that are independent of each other – need to separate into sub-sets (‘domains’) before any statistical analysis
Challenge: allocate samples within the range of overlap to the correct domain
DECOMPOSITION OF DATA • By statistical decomposition
– various methods and algorithms available
• In a spatial framework (that is, samples are not randomly distributed, but a spatial relationship exists between samples), not only the value of sample must be taken into account, but also its location
– By manual creation of polygons to separate various domains – By geostatistical decomposition, eg geostatistical cluster analysis
• ROMARY, T., RIVOIRARD, J. et al. 2012. Domaining by Clustering Multivariate Geostatistical Data. In: ABRAHAMSEN et al. (eds) Geostatistics Oslo 2012, pp. 455-466, Springer, Dordrecht
– Conventional geostatistical methods struggle or fail when the clusters are intertwined with irregular, discontinuous or complex geometries
– New concept developed and applied using JMP® • Assumption 1: Distribution of underlying populations are known – outcome after decomposition must honour
the distribution • Assumption 2: Populations occur in clusters with a certain degree of connectivity between its samples • Brammer, S. 2015. Domaining of long-tailed bimodal data-sets with statistical methods. In: The Danie Krige
Geostatistical Conference. SAIMM, Johannesburg. pp. 281-286 • Brammer, S. 2015. A self-guiding domaining tool for long-tailed bi-modal data sets. In: Proceedings of the 17th
annual conference of the International Association for Mathematical Geosciences. Sept 5-13, 2015, Freiberg (Saxony), Germany
CONCEPT & METHODOLOGY
• 1st step Establish statistical moments of
underlying sample populations*
*assuming both populations are approx. normally distributed
– Mean, spread, number of samples (a) – Build target histogram of expected
outcome (b)
(a)
(b)
Original sample data with small outlier population
CONCEPT & METHODOLOGY (cont.)
Seed Sample
Sample grid (detail) Domains (Reality) red dots – outlier domain
Sample grid
x
x
x x
x 2nd step Build a continuous search path through sample grid
• Pick random seed within upper domain • Follow progressively adjacent samples as long as they fit into the target histogram
x x
x
x x
CONCEPT & METHODOLOGY (cont.)
• Search path stops when no sample in neighbourhood fits into target histogram – outside high-grade zone; lower tail of
target histogram is filled up
• Once search is interrupted, repeat search from new random seed
• Repeat procedure until all samples potentially belonging to the upper domain are investigated
SCRIPT 1 .
E S T I M AT I O N O F S T AT I S T I C A L M O M E N T S
Original sample data with small outlier population
Input dialog 1 – estimated parameters (a) Input dialog 2 – iteration parameters (b)
(a)
(b)
1. Calculate statistical moments for various distribution scenarios (a) 2. Fit distribution and assess goodness-of-fit (b) 3. Record critical parameters (c)
(b)
(a)
(c)
4. Iterate through all possible combinations in nested loops (d) 5. Rank output values by goodness-of-fit tests and chose best option as final result (e)
(d)
(e)
SCRIPT 2 S E A RC H PAT H T H R O U G H S A M P L E G R I D
Input dialog 1 – assign columns (a) Input dialog 2 – statistical moments, as established by Script 1 (b) Input dialog 3 – search parameters (c)
(a)
(b)
(c)
1. Set up target histogram for outlier population (a) 2. Set up rotation matrix for oriented search (b)
(a)
(b)
3. Select random seed sample from outlier population (c) 4. Select all samples within specified neighbourhood (d)
(c)
(d)
5. Select sample within specified neighbourhood that fits into target histogram (e) 6. Increase the number of samples of the respective histogram bin (e)
(e)
7. Go to selected sample and continue search at new location (e) 8. Continue search as long as criteria of target histogram is satisfied, then chose new seed sample of next cluster and repeat search until whole grid is investigated (f) (f)
8. Post-processing to clean up results (g)
9. Repeat whole procedure several times to conduct cluster analysis with a variety of different seed samples and different search orientations (as result of single run depends on random sequence of seed samples)
(g)
10. Results are given as probabilities for each sample to belong to the outlier population
11. Select specified number of samples with highest probabilities for final result
FINAL RESULT
Reality Result After 10 runs
Result After 25 runs
Result After 50 runs
How to do this without ?!?
No idea.....!
Thank You!!!