partitional and hierarchical based clustering lecture 22 based on slides of dr. ikle & chapter 8 of...
TRANSCRIPT
Introducing Textural Data Visualization to Students in Computational Mathematics Major
Partitional and Hierarchical Based clustering
Lecture 22
Based on Slides of Dr. Ikle& chapter 8 of Tan, Steinbach, Kumar
11. What is Cluster Analysis?Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groupsInter-cluster distances are maximizedIntra-cluster distances are minimized Biology: kingdom, phylum, class, order, family, genus, and species Information Retrieval: search engine query = movie, clusters = reviews, trailers, stars, theaters Climate: Clusters = regions of similar climate Psychology and Medicine: patterns in spatial or temporal distribution of a disease Business: Segment customers into groups for marketing activitiesExamples of Clustering3 Clustering for Understanding(see examples from previous slide) Clustering for UtilitySummarizing: different algorithms can run faster on a data set summarized by clusteringCompression: storing cluster information is more efficient that storing the entire data - example: quantizationFinding Nearest NeighborsTwo Reasons for Clustering4Applications of Cluster AnalysisUnderstandingGroup related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations
SummarizationReduce the size of large data sets
Clustering precipitation in AustraliaCLUSTERING: IntroductionClusteringNo class to be predictedGroups objects based solely on their attributesObjects within clusters similar to each otherObjects in different clusters dissimilar to each otherDepends on similarity measureClustering as unsupervised classification
6CLUSTERING: IntroductionWhat constitute clusters?The answer is not always clear
7CLUSTERING: IntroductionGrouping of text clusters
8CLUSTERING: IntroductionTypes of clusteringPartitionalSimple division of instances into mutually exclusive (non-overlapping) clustersDetermine desired number of clustersIteratively reallocate objects to clusters
91: IntroductionTypes of clusteringHierarchicalClusters allowed to have (nested) subclustersUses previous clusters to find subclustersTypes of hierarchical algorithmsAgglomerative (bottom-up)Merge smaller clustersDivisive (top-down)Divide larger clusters
10CLUSTERING: Common AlgorithmsPartitionalK-meansK-medoidsDensity-basedDBSCANSNN
11 K-means clustering is one of the most common/popular techniques Each cluster is associated with a centroid (center point) this is often the mean it is the cluster prototype Each point is assigned to the cluster with the closest centroid The number of clusters, K, must be specified ahead of time
2. K-Means Clustering 12 The most common version of k-means minimizes the sum of the squared distances of each point from its cluster center (page 500)
For a given set of cluster centers, (obviously) each point should be matched to the nearest center For a given cluster, the best center is the mean The basic algorithm is to iterate over these two relationships
K-Means Clustering 132. CLUSTERING: K-meansPseudo-codeChoose number of clusters, kInitialize k centroids (randomly, for example)RepeatForm k clusters by assigning each point to its nearest centroidRecalculate centroidsUntil convergence (centroids move less than some amount)14CLUSTERING: K-meansTime complexityO(n*k*l*d)n=number of pointsk=number of clustersl=number of iterationsd=number of attributesSpace complexityO(k+n)15CLUSTERING: K-meansTime complexityO(n*k*l*d)n=number of pointsk=number of clustersl=number of iterationsd=number of attributesSpace complexityO(k+n)16K-means disadvantagesBad choice for k may yield poor resultsFixed number of clusters makes it difficult to determine best value for kDependent upon choice of initial centroidsReally only works well for spherical cluster shapes17CLUSTERING: K-means advantagesSimple and effectiveConceptuallyTo implementTo runOther algorithms require more parameters to adjustRelatively quick18CLUSTERING: K-means solutionsOne does not always need high quality clustersCan use multiple runs to help with initial centroidsChoose best final resultProbability is not your friendRun with different values for k and obtain the best resultProduce hierarchical clusteringSet k=2Repeat recursively within each cluster19CLUSTERING: K-means solutionsOne does not always need high quality clustersCan use multiple runs to help with initial centroidsChoose best final resultProbability is not your friendRun with different values for k and obtain the best resultProduce hierarchical clusteringSet k=2Repeat recursively within each cluster203. Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical treeCan be visualized as a dendrogramA tree like diagram that records the sequences of merges or splits
Strengths of Hierarchical ClusteringDo not have to assume any particular number of clustersAny desired number of clusters can be obtained by cutting the dendogram at the proper level
They may correspond to meaningful taxonomiesExample in biological sciences (e.g., animal kingdom, phylogeny reconstruction, )Hierarchical ClusteringTwo main types of hierarchical clusteringAgglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters)
Traditional hierarchical algorithms use a similarity or distance matrixMerge or split one cluster at a time
Agglomerative Clustering AlgorithmMore popular hierarchical clustering technique
Basic algorithm is straightforwardCompute the proximity matrixLet each data point be a clusterRepeatMerge the two closest clustersUpdate the proximity matrixUntil only a single cluster remains Key operation is the computation of the proximity of two clustersDifferent approaches to defining the distance between clusters distinguish the different algorithmsStarting Situation Start with clusters of individual points and a proximity matrix
p1p3p5p4p2p1p2p3p4p5. . ....Proximity Matrix
Intermediate SituationAfter some merging steps, we have some clusters
C1C4C2C5C3C2C1C1C3C5C4C2C3C4C5Proximity Matrix
Intermediate SituationWe want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
C1C4C2C5C3C2C1C1C3C5C4C2C3C4C5Proximity Matrix
After MergingThe question is How do we update the proximity matrix?
C1C4C2 U C5C3? ? ? ? ???C2 U C5C1C1C3C4C2 U C5C3C4Proximity Matrix
How to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Similarity?MINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorProximity MatrixHow to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Proximity MatrixMINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorHow to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Proximity MatrixMINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorHow to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Proximity MatrixMINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorHow to Define Inter-Cluster Similarity p1p3p5p4p2p1p2p3p4p5. . ....Proximity MatrixMINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWards Method uses squared errorCluster Similarity: MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clustersDetermined by one pair of points, i.e., by one link in the proximity graph.
12345Hierarchical Clustering: MINNested ClustersDendrogram12345612345
Strength of MINOriginal PointsTwo Clusters
Can handle non-elliptical shapesLimitations of MINOriginal Points
Two Clusters
Sensitive to noise and outliersCluster Similarity: MAX or Complete LinkageSimilarity of two clusters is based on the two least similar (most distant) points in the different clustersDetermined by all pairs of points in the two clusters
12345Hierarchical Clustering: MAXNested ClustersDendrogram
12345612534Strength of MAXOriginal Points
Two Clusters
Less susceptible to noise and outliersLimitations of MAX
Original Points
Two ClustersTends to break large clustersBiased towards globular clustersCluster Similarity: Group AverageProximity of two clusters is the average of pairwise proximity between points in the two clusters.
Need to use average connectivity for scalability since total proximity favors large clusters
12345Hierarchical Clustering: Group AverageNested ClustersDendrogram
12345612534Hierarchical Clustering: Group AverageCompromise between Single and Complete Link
StrengthsLess susceptible to noise and outliers
LimitationsBiased towards globular clustersCluster Similarity: Wards MethodSimilarity of two clusters is based on the increase in squared error when two clusters are mergedSimilar to group average if distance between points is distance squared
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-meansCan be used to initialize K-meansHierarchical Clustering: ComparisonGroup AverageWards Method12345612534MINMAX123456125341234561253412345612345Hierarchical Clustering: Time and Space requirementsO(N2) space since it uses the proximity matrix. N is the number of points.
O(N3) time in many casesThere are N steps and at each step the size, N2, proximity matrix must be updated and searchedComplexity can be reduced to O(N2 log(N) ) time for some approaches
Hierarchical Clustering: Problems and LimitationsOnce a decision is made to combine two clusters, it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following:Sensitivity to noise and outliersDifficulty handling different sized clusters and convex shapesBreaking large clustersDiscovered ClustersIndustry Group
1Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWNTechnology1-DOWN
2Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWNTechnology2-DOWN
3Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWNFinancial-DOWN
4Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UPOil-UP
...p1p2p3p4p9p10p11p12...p1p2p3p4p9p10p11p12...p1p2p3p4p9p10p11p12...p1p2p3p4p9p10p11p12Sheet1I1I2I3I4I5I11.000.900.100.650.20I20.901.000.700.600.50I30.100.701.000.400.30I40.650.600.401.000.80I50.200.500.300.801.00
Sheet2
Sheet3
Sheet1I1I2I3I4I5I11.000.900.100.650.20I20.901.000.700.600.50I30.100.701.000.400.30I40.650.600.401.000.80I50.200.500.300.801.00
Sheet2
Sheet3
Sheet1I1I2I3I4I5I11.000.900.100.650.20I20.901.000.700.600.50I30.100.701.000.400.30I40.650.600.401.000.80I50.200.500.300.801.00
Sheet2
Sheet3