clustering and data mining in r - workshop...
TRANSCRIPT
![Page 1: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/1.jpg)
Clustering and Data Mining in R
Clustering and Data Mining in RWorkshop Supplement
Thomas Girke
October 3, 2010
![Page 2: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/2.jpg)
Clustering and Data Mining in R
Introduction
Data PreprocessingData TransformationsDistance MethodsCluster Linkage
Hierarchical ClusteringApproachesTree Cutting
Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques
![Page 3: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/3.jpg)
Clustering and Data Mining in R
Introduction
OutlineIntroduction
Data PreprocessingData TransformationsDistance MethodsCluster Linkage
Hierarchical ClusteringApproachesTree Cutting
Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques
![Page 4: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/4.jpg)
Clustering and Data Mining in R
Introduction
What is Clustering?
I Clustering is the classification of data objects into similaritygroups (clusters) according to a defined distance measure.
I It is used in many fields, such as machine learning, datamining, pattern recognition, image analysis, genomics,systems biology, etc.
![Page 5: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/5.jpg)
Clustering and Data Mining in R
Introduction
Why Clustering and Data Mining in R?
I Efficient data structures and functions for clustering.
I Efficient environment for algorithm prototyping andbenchmarking.
I Comprehensive set of clustering and machine learning libraries.
I Standard for data analysis in many areas.
![Page 6: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/6.jpg)
Clustering and Data Mining in R
Data Preprocessing
OutlineIntroduction
Data PreprocessingData TransformationsDistance MethodsCluster Linkage
Hierarchical ClusteringApproachesTree Cutting
Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques
![Page 7: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/7.jpg)
Clustering and Data Mining in R
Data Preprocessing
Data Transformations
Data Transformations Choice depends on data set!
I Center & standardize1. Center: subtract from each vector its mean2. Standardize: devide by standard deviation
⇒ Mean = 0 and STDEV = 1I Center & scale with the scale() fuction
1. Center: subtract from each vector its mean2. Scale: divide centered vector by their root mean square (rms)
xrms =
√√√√ 1
n − 1
n∑i=1
xi2
⇒ Mean = 0 and STDEV = 1
I Log transformation
I Rank transformation: replace measured values by ranks
I No transformation
![Page 8: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/8.jpg)
Clustering and Data Mining in R
Data Preprocessing
Distance Methods
Distance Methods List of most common ones!
I Euclidean distance for two profiles X and Y
d(X ,Y ) =
√√√√ n∑i=1
(xi − yi )2
Disadvantages: not scale invariant, not for negative correlations
I Maximum, Manhattan, Canberra, binary, Minowski, ...
I Correlation-based distance: 1− r
I Pearson correlation coefficient (PCC)
r =n
∑ni=1 xiyi −
∑ni=1 xi
∑ni=1 yi√
(∑n
i=1 x2i − (
∑ni=1 xi )2)(
∑ni=1 y2
i − (∑n
i=1 yi )2)
Disadvantage: outlier sensitiveI Spearman correlation coefficient (SCC)
Same calculation as PCC but with ranked values!
![Page 9: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/9.jpg)
Clustering and Data Mining in R
Data Preprocessing
Cluster Linkage
Cluster Linkage
Single Linkage
Complete Linkage
Average Linkage
![Page 10: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/10.jpg)
Clustering and Data Mining in R
Hierarchical Clustering
OutlineIntroduction
Data PreprocessingData TransformationsDistance MethodsCluster Linkage
Hierarchical ClusteringApproachesTree Cutting
Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques
![Page 11: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/11.jpg)
Clustering and Data Mining in R
Hierarchical Clustering
Hierarchical Clustering Steps
1. Identify clusters (items) with closest distance
2. Join them to new clusters
3. Compute distance between clusters (items)
4. Return to step 1
![Page 12: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/12.jpg)
Clustering and Data Mining in R
Hierarchical Clustering
Hierarchical Clustering Agglomerative Approach
g1 g2 g3 g4 g50.1
g1 g2 g3 g4 g50.1
0.4
g1 g2 g3 g4 g50.1
0.4
0.6
0.5
(a)
(b)
(c)
![Page 13: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/13.jpg)
Clustering and Data Mining in R
Hierarchical Clustering
Approaches
Hierarchical Clustering Approaches
1. Agglomerative approach (bottom-up)
hclust() and agnes()
2. Divisive approach (top-down)
diana()
![Page 14: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/14.jpg)
Clustering and Data Mining in R
Hierarchical Clustering
Tree Cutting
Tree Cutting to Obtain Discrete Clusters
1. Node height in tree
2. Number of clusters
3. Search tree nodes by distance cutoff
![Page 15: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/15.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
OutlineIntroduction
Data PreprocessingData TransformationsDistance MethodsCluster Linkage
Hierarchical ClusteringApproachesTree Cutting
Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques
![Page 16: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/16.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
Non-Hierarchical Clustering
Selected Examples
![Page 17: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/17.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
K-Means
K-Means Clustering
1. Choose the number of k clusters
2. Randomly assign items to the k clusters
3. Calculate new centroid for each of the k clusters
4. Calculate the distance of all items to the k centroids
5. Assign items to closest centroid
6. Repeat until clusters assignments are stable
![Page 18: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/18.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
K-Means
K-Means
X
X
X
XX
X
X
X
X
(a)
(b)
(c)
![Page 19: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/19.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
Principal Component Analysis
Principal Component Analysis (PCA)
Principal components analysis (PCA) is a data reduction techniquethat allows to simplify multidimensional data sets to 2 or 3dimensions for plotting purposes and visual variance analysis.
![Page 20: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/20.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
Principal Component Analysis
Basic PCA Steps
I Center (and standardize) dataI First principal component axis
I Accross centroid of data cloudI Distance of each point to that line is minimized, so that it
crosses the maximum variation of the data cloud
I Second principal component axisI Orthogonal to first principal componentI Along maximum variation in the data
I 1st PCA axis becomes x-axis and 2nd PCA axis y-axis
I Continue process until the necessary number of principalcomponents is obtained
![Page 21: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/21.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
Principal Component Analysis
PCA on Two-Dimensional Data Set
1st
2nd
1st
2nd
![Page 22: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/22.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
Principal Component Analysis
Identifies the Amount of Variability between Components
Example
Principal Component 1st 2nd 3rd OtherProportion of Variance 62% 34% 3% rest
1st and 2nd principal components explain 96% of variance.
![Page 23: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/23.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
Multidimensional Scaling
Multidimensional Scaling (MDS)
I Alternative dimensionality reduction approach
I Represents distances in 2D or 3D space
I Starts from distance matrix (PCA uses data points)
![Page 24: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/24.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
Biclustering
BiclusteringFinds in matrix subgroups of rows and columns which are as similar aspossible to each other and as different as possible to the remaining datapoints.
Unclustered ⇒ Clustered
![Page 25: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/.../Manuals/Clustering/zzz.pdf · 2011. 5. 14. · Clustering and Data Mining in R Non-Hierarchical Clustering](https://reader033.vdocument.in/reader033/viewer/2022052105/604044fe56f9ac6e0d525ae1/html5/thumbnails/25.jpg)
Clustering and Data Mining in R
Non-Hierarchical Clustering
Many Additional Techniques
Remember: There Are Many Additional Techniques!
Continue with R manual section:”Clustering and Data Mining”