grouping data methods of cluster analysis. goals 1 1.we want to identify groups of similar artifacts...
Post on 19-Dec-2015
218 views
TRANSCRIPT
Goals 1
1. We want to identify groups of similar artifacts or features or sites or graves, etc that represent cultural, functional, or chronological differences
2. We want to create groups as a measurement technique to see how they vary with external variables
Goals 2
3. We want to cluster artifacts or sites based on their location to identify spatial clusters
Real vs. Created Types• Differences in goals
– Real types are the aim of Goal 1– Created types are the aim of Goal 2
• Debate over whether Real types can be discovered with any degree of certainty
• Cluster analysis guarantees groups – you must confirm their utility
Initial Decisions 1• What variables to use?
– All possible– Constructed variables (from principal
components, correspondence analysis, or multi-dimensional scaling)
– Restricted set of variables that support the goal(s) of creating groups (e.g. functional groups, cultural or stylistic groups)
Initial Decisions 2
• How to transform the variables?– Log transforms– Conversion to percentages (to weight
rows equally)– Size standardization (dividing by
geometric mean)– Z – scores (to weight columns
equally)– Conversion of categorical variables
Initial Decisions 3
• How to measure distance?– Types of variables– Goals of the analysis– If uncertain, try multiple methods
Methods of Grouping
• Partitioning Methods – divide the data into groups
• Hierarchical Methods– Agglomerating – from n clusters to 1
cluster– Divisive – from 1 cluster to k clusters
Partitioning
• K – Means, K – Medoids, Fuzzy• Measure of distance – but do not
need to compute full distance matrix
• Specify number of groups in advance
• Minimizing within group variability• Finds spherical clusters
Procedure
• Start with centers for k groups (user-supplied or random)
• Repeat up to iter.max times (default 10)– Allocate rows to their closest center– Recalculate the center positions
• Stop• Different criteria for allocation• Use multiple starts (e.g. 5 – 15)
Evaluation 1• Compute groups for a range of
cluster sizes and plot within group sums of squares to look for sharp increases
• Cluster randomized versions of the data and compare the results
• Examine table of statistics by group
Evaluation 2
• Plot groups in two dimensions with PCA, CA, or MDS
• Compare the groups using data or information not included in the analysis
Partitioning Using R
• Base R includes kmeans() for forming groups by partitioning
• Rcmdr includes KMeans() to iterate kmeans() for best solution
• Package cluster() includes pam() which uses medoids for more robust grouping and fanny() which forms fuzzy clusters
Example• DarlPoints (not DartPoints) has 4
measurements for 23 Darl points• Create Z-scores to weight variables
equally with Data | Manage variables in active data set | Standardize variables …
• (or could use PCA and PC Scores)
Example (cont)
• Use Rcmdr to partition the data into 5, 4, 3, and 2 groups
• Statistics | Dimensional analysis | Cluster analysis | k-means cluster analysis …
• TWSS = 15.42, 19.78, 25.83, 34.24• Select group number and have
Rcmdr add group to data set
Evaluation
• Evaluate groups against randomized data– Randomly permute each variable– Run k-means– Compare random and non-random
results• Evaluate groups against external
criteria (location, material, age, etc)
KMPlotWSS <- function(data, ming, maxg) { WSS <- sapply(ming:maxg, function(x) kmeans(data, centers = x, iter.max = 10, nstart = 10)$tot.withinss) plot(ming:maxg, WSS, las=1, type="b", xlab="Number of Groups", ylab="Total Within Sum of Squares", pch=16) print(WSS)}
KMRandWSS <- function(data, samples, min, max) { KRand <- function(data, min, max){ Rnd <- apply(data, 2, sample) sapply(min:max, function(y) kmeans(Rnd, y, iter.max= 10, nstart=5)$tot.withinss) } Sim <- sapply(1:samples, function(x) KRand(data, min, max)) t(apply(Sim, 1, quantile, c(0,.005, .01, .025, .5, .975, .99, .995, 1)))}
# Compare data to randomized setsKMPlotWSS(DarlPoints[,6:9], 1, 10)Qtiles <- KMRandWSS(DarlPoints[,6:9], 2000, 1, 10)matlines(1:10, Qtiles[,c(1, 5, 9)], lty=c(3, 2, 3), lwd=2, col="dark gray")legend("topright", c("Observed", "Median (Random)", "Max/Min Random"), col=c("black", "dark gray", "dark gray"), lwd=c(1, 2, 2), lty=c(1, 2, 3))
Hierarchical Methods
• Agglomerative – successive merging
• Divisive - successive splitting– Monothetic – binary data– Polythetic – interval/ratio
Agglomerative
• At the start all rows are in separate groups (n groups or clusters)
• At each stage two rows are merged, a row and a group are merged, or two groups are merged
• The process stops when all rows are in a single cluster
Agglomeration Methods
• How should clusters be formed?– Single Linkage, irregular shape groups– Average Linkage – spherical groups– Complete Linkage – spherical groups– Ward’s Method – spherical groups– Median – dendrogram inversions– Centroid – dendrogram inversions– McQuitty – similarity by reciprocal
pairs
Agglomerating with R
• Base R includes hclus() for forming groups by partitioning
• Package cluster() includes agnes()• Rcmdr uses hclus() via Statistics |
Dimensional analysis | Cluster analysis | Hierarchical cluster analysis …
HClust
• Rcmdr menus provide– Cluster analysis and plot– Summary statistics by group– Adding cluster to data set
• To get traditional dendrogram:– plot(HClust.1, hang=-1, main= "Darl Points", xlab= "Catalog Number", sub="Method=Ward; Distance=Euclidian")
– rect.hclust(HClust.1, 3)
summary(as.factor(cutree(HClust.1, k = 3))) # Cluster Sizes 1 2 3 11 6 6
by(model.matrix(~-1 + Z.Length + Z.Thickness + Z.Weight + Z.Width, DarlPoints), as.factor(cutree(HClust.1, k = 3)), mean) # Cluster CentroidsINDICES: 1 Z.Length Z.Thickness Z.Weight Z.Width -0.1345150 -0.1585615 -0.2523805 -0.1241642 ------------------------------------------------------------ INDICES: 2 Z.Length Z.Thickness Z.Weight Z.Width -1.1085541 -0.9209550 -0.9400026 -0.8200594 ------------------------------------------------------------ INDICES: 3 Z.Length Z.Thickness Z.Weight Z.Width 1.355165 1.211651 1.402700 1.047694
> biplot(princomp(model.matrix(~-1 + Z.Length + Z.Thickness + Z.Weight + Z.Width, DarlPoints)), xlabs = as.character(cutree(HClust.1, k = 3)))
> cbind(HClust.1$merge, HClust.1$height) [,1] [,2] [,3] [1,] -12 -13 0.3983821 [2,] -2 -3 0.5112670 [3,] -9 -14 0.5247650 [4,] -10 -17 0.5572146 [5,] -15 3 0.7362171 [6,] -1 -11 0.7471874 [7,] -6 -18 0.8120594 [8,] -7 -8 0.8491895 [9,] 4 5 0.9841552[10,] 2 6 1.2150606[11,] -19 -21 1.2300507[12,] 1 10 1.4059158[13,] -22 11 1.4963400[14,] -16 -20 1.5800167[15,] -4 9 1.6195709[16,] -5 12 2.1556543[17,] -23 13 2.4007863[18,] 7 14 2.4252670[19,] 8 17 3.2632812[20,] 16 18 4.9021149[21,] 15 20 6.6290417[22,] 19 21 18.7730146