a jmp script for geostatistical cluster analysis of … · 2012. domaining by clustering...

A JMP® SCRIPT FOR GEOSTATISTICAL CLUSTER ANALYSIS OF MIXED DATA SETS WITH SPATIAL INFORMATION

Steffen Brammer

OBJECTIVE Example 1 Find a location for your luxury car sales outlet

Data set w samples across city at specific locations

Establish mean for each suburb to identify affluent clientele

Location of outlet

No customers ??? Why

Reality Affluent Suburb

Example 2 How much pesticide do you need to get rid of the bugs?

Strong bug population only in trees along roads -> remove tree lines from your statistics

Data set w samples at specific locations

Establish mean for each field

Example 3 Image processing

Available pixel Completed image Interpolation

algorithm Reality

Example 4 Mine geology

Sample data Domains

High grade gold mineralisation in quartz vein stockwork

(pink lines)

MIXED DATA • Single data set with two (or more) underlying populations that are independent of each other – need to separate into sub-sets (‘domains’) before any statistical analysis

Challenge: allocate samples within the range of overlap to the correct domain

DECOMPOSITION OF DATA • By statistical decomposition

– various methods and algorithms available

• In a spatial framework (that is, samples are not randomly distributed, but a spatial relationship exists between samples), not only the value of sample must be taken into account, but also its location

– By manual creation of polygons to separate various domains – By geostatistical decomposition, eg geostatistical cluster analysis

• ROMARY, T., RIVOIRARD, J. et al. 2012. Domaining by Clustering Multivariate Geostatistical Data. In: ABRAHAMSEN et al. (eds) Geostatistics Oslo 2012, pp. 455-466, Springer, Dordrecht

– Conventional geostatistical methods struggle or fail when the clusters are intertwined with irregular, discontinuous or complex geometries

– New concept developed and applied using JMP® • Assumption 1: Distribution of underlying populations are known – outcome after decomposition must honour

the distribution • Assumption 2: Populations occur in clusters with a certain degree of connectivity between its samples • Brammer, S. 2015. Domaining of long-tailed bimodal data-sets with statistical methods. In: The Danie Krige

Geostatistical Conference. SAIMM, Johannesburg. pp. 281-286 • Brammer, S. 2015. A self-guiding domaining tool for long-tailed bi-modal data sets. In: Proceedings of the 17th

annual conference of the International Association for Mathematical Geosciences. Sept 5-13, 2015, Freiberg (Saxony), Germany

CONCEPT & METHODOLOGY

• 1st step Establish statistical moments of

underlying sample populations*

*assuming both populations are approx. normally distributed

– Mean, spread, number of samples (a) – Build target histogram of expected

outcome (b)

(a)

(b)

Original sample data with small outlier population

CONCEPT & METHODOLOGY (cont.)

Seed Sample

Sample grid (detail) Domains (Reality) red dots – outlier domain

Sample grid

x

x

x x

x 2nd step Build a continuous search path through sample grid

• Pick random seed within upper domain • Follow progressively adjacent samples as long as they fit into the target histogram

x x

x

x x

CONCEPT & METHODOLOGY (cont.)

• Search path stops when no sample in neighbourhood fits into target histogram – outside high-grade zone; lower tail of

target histogram is filled up

• Once search is interrupted, repeat search from new random seed

• Repeat procedure until all samples potentially belonging to the upper domain are investigated

SCRIPT 1 .

E S T I M AT I O N O F S T AT I S T I C A L M O M E N T S

Original sample data with small outlier population

Input dialog 1 – estimated parameters (a) Input dialog 2 – iteration parameters (b)

(a)

(b)

1. Calculate statistical moments for various distribution scenarios (a) 2. Fit distribution and assess goodness-of-fit (b) 3. Record critical parameters (c)

(b)

(a)

(c)

4. Iterate through all possible combinations in nested loops (d) 5. Rank output values by goodness-of-fit tests and chose best option as final result (e)

(d)

(e)

SCRIPT 2 S E A RC H PAT H T H R O U G H S A M P L E G R I D

Input dialog 1 – assign columns (a) Input dialog 2 – statistical moments, as established by Script 1 (b) Input dialog 3 – search parameters (c)

(a)

(b)

(c)

1. Set up target histogram for outlier population (a) 2. Set up rotation matrix for oriented search (b)

(a)

(b)

3. Select random seed sample from outlier population (c) 4. Select all samples within specified neighbourhood (d)

(c)

(d)

5. Select sample within specified neighbourhood that fits into target histogram (e) 6. Increase the number of samples of the respective histogram bin (e)

(e)

7. Go to selected sample and continue search at new location (e) 8. Continue search as long as criteria of target histogram is satisfied, then chose new seed sample of next cluster and repeat search until whole grid is investigated (f) (f)

8. Post-processing to clean up results (g)

9. Repeat whole procedure several times to conduct cluster analysis with a variety of different seed samples and different search orientations (as result of single run depends on random sequence of seed samples)

(g)

10. Results are given as probabilities for each sample to belong to the outlier population

11. Select specified number of samples with highest probabilities for final result

FINAL RESULT

Reality Result After 10 runs

Result After 25 runs

Result After 50 runs

How to do this without ?!?

No idea.....!

Thank You!!!

a jmp script for geostatistical cluster analysis of … · 2012. domaining by clustering...

Documents