cs 277, data mining exploratory data analysis padhraic smyth department of computer science bren...

Download CS 277, Data Mining Exploratory Data Analysis Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University

If you can't read please download the document

Upload: shawn-thurmond

Post on 14-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

  • Slide 1

CS 277, Data Mining Exploratory Data Analysis Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine Slide 2 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 2 Outline of Todays Lecture Assignment 1: Questions? Overview of Exploratory Data Analysis Analyzing single variables Analyzing pairs of variables Higher-dimensional visualization techniques Dimension reduction methods Clustering methods Slide 3 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 3 Exploratory Data Analysis: Single Variables Slide 4 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 4 Summary Statistics Mean: center of data Mode: location of highest data density Variance: spread of data Skew: indication of non-symmetry Range: max - min Median: 50% of values below, 50% above Quantiles: e.g., values such that 25%, 50%, 75% are smaller Note that some of these statistics can be misleading E.g., mean for data with 2 clusters may be in a region with zero data Slide 5 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 5 Histogram of Unimodal Data 1000 data points simulated from a Normal distribution, mean 10, variance 1, 30 bins Slide 6 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 6 Histograms: Unimodal Data 100 data points from a Normal, mean 10, variance 1, with 5, 10, 30 bins Slide 7 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 7 Histogram of Multimodal Data 15000 data points simulated from a mixture of 3 Normal distributions, 300 bins Slide 8 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 8 Histogram of Multimodal Data 15000 data points simulated from a mixture of 3 Normal distributions, 300 bins Slide 9 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 9 Skewed Data 5000 data points simulated from an exponential distribution, 100 bins Slide 10 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 10 Another Skewed Data Set 10000 data points simulated from a mixture of 2 exponentials, 100 bins Slide 11 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 11 Same Skewed Data after taking Logs (base 10) 10000 data points simulated from a mixture of 2 exponentials, 100 bins Slide 12 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 12 What will the mean or median tell us about this data? Slide 13 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 13 Histogram Most common form: split data range into equal-sized bins For each bin, count the number of points from the data set that fall into the bin. Vertical axis: frequency or counts Horizontal axis: variable values Slide 14 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 14 Issues with Histograms For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms. For large data sets, histograms can be quite effective at illustrating general properties of the distribution. Can smooth histogram using a variety of techniques E.g., kernel density estimation Histograms effectively only work with 1 variable at a time Difficult to extend to 2 dimensions, not possible for >2 So histograms tell us nothing about the relationships among variables Slide 15 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 15 US Zipcode Data: Population by Zipcode K = 50 K = 500 K = 50 Slide 16 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 16 Histogram with Outliers X values Number of Individuals Pima Indians Diabetes Data, From UC Irvine Machine Learning Repository Slide 17 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 17 Histogram with Outliers blood pressure = 0 ? Diastolic Blood Pressure Number of Individuals Pima Indians Diabetes Data, From UC Irvine Machine Learning Repository Slide 18 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 18 Box Plots: Pima Indians Diabetes Data Body Mass Index Healthy Individuals Diabetic Individuals Two side-by-side box-plots of individuals from the Pima Indians Diabetes Data Set Slide 19 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 19 Box Plots: Pima Indians Diabetes Data Body Mass Index Healthy Individuals Diabetic Individuals Box = middle 50% of data Plots all data points outside whiskers 1.5 x Q3-Q1 Q2 (median) Q3 Q1 Upper Whisker Lower Whisker Two side-by-side box-plots of individuals from the Pima Indians Diabetes Data Set Slide 20 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 20 Box Plots: Pima Indians Diabetes Data healthydiabetichealthydiabetic Diastolic Blood Pressure 24-hour Serum Insulin Plasma Glucose Concentration Body Mass Index Slide 21 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 21 Exploratory Data Analysis Tools for Displaying Pairs of Variables Slide 22 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 22 Relationships between Pairs of Variables Say we have a variable Y we want to predict and many variables X that we could use to predict Y In exploratory data analysis we may be interested in quickly finding out if a particular X variable is potentially useful at predicting Y Options? Linear correlation (X, Y) = E [ (X X ) (Y Y ) ], between -1 and +1 Scatter plot: plot Y versus X Slide 23 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 23 Examples of X-Y plots and linear correlation values Slide 24 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 24 Examples of X-Y plots and linear correlation values Slide 25 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 25 Examples of X-Y plots and linear correlation values Linear Dependence Non-Linear Dependence Lack if linear correlation does not imply lack of dependence Slide 26 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 26 Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Slide 27 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 27 Guess the Linear Correlation Values for each Data Set Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Slide 28 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 28 Correlation = 0.82 Actual Correlation Values Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Slide 29 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 29 Summary Statistics for each Data Set Summary Statistics of Data Set 1 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82 Summary Statistics of Data Set 3 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82 Summary Statistics of Data Set 4 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82 Summary Statistics of Data Set 2 N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Correlation = 0.82 Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Slide 30 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 30 Conclusions so far? Summary statistics are useful..up to a point Linear correlation measures can be misleading There really is no substitute for plotting/visualizing the data Slide 31 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 31 Scatter Plot: No apparent relationship Slide 32 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 32 Scatter Plot: Linear relationship Slide 33 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 33 Scatter Plot: Quadratic relationship Slide 34 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 34 Constant Variance Relationship Variation of Y Does Not Depend on X Slide 35 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 35 Increasing Variance variation in Y differs depending on the value of X e.g., Y = annual tax paid, X = income Slide 36 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 36 (from US Zip code data: each point = 1 Zip code) units = dollars Slide 37 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 37 Problems with Scatter Plots of Large Data 96,000 bank loan applicants appears: later apps older; reality: downward slope (more apps, more variance) scatter plot degrades into black smudge... Slide 38 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 38 Problems with Scatter Plots of Large Data 96,000 bank loan applicants appears: later apps older; reality: downward slope (more apps, more variance) scatter plot degrades into black smudge... Slide 39 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 39 Contour Plots Can Help (same 96,000 bank loan apps as before) recall: unimodal skewed shows variance(y) with x is indeed due to horizontal skew in density skewed Slide 40 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 40 Summary on Exploration/Visualization Always useful and worthwhile to visualize data human visual system is excellent at pattern recognition gives us a general idea of how data is distributed, e.g., extreme skew detect obvious outliers and errors in the data gain a general understanding of low-dimensional properties Many different visualization techniques Limitations generally only useful up to 3 or 4 dimensions massive data: only so many pixels on a screen - but subsampling is useful Slide 41 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 41 Exploratory Data Analysis Tools for Displaying More than 2 Variables Slide 42 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 42 Multivariate Visualization Multivariate -> multiple variables 2 variables: scatter plots, etc 3 variables: 3-dimensional plots Look impressive, but often not used Can be cognitively challenging to interpret Alternatives: overlay color-coding (e.g., categorical data) on 2d scatter plot 4 variables: 3d with color or time Can be effective in certain situations, but tricky Higher dimensions Generally difficult Scatter plots, icon plots, parallel coordinates: all have weaknesses Alternative: map data to lower dimensions, e.g., PCA or multidimensional scaling Main problem: high-dimensional structure may not be apparent in low-dimensional views Slide 43 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 43 Scatter Plot Matrix For interactive visualization the concept of linked plots is generally useful Slide 44 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 44 Using Icons to Encode Information, e.g., Star Plots Each star represents a single observation. Star plots are used to examine the relative values for a single data point The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. Useful for small data sets with up to 10 or so variables Limitations? Small data sets, small dimensions Ordering of variables may affect perception 1 Price 2 Mileage (MPG) 3 1978 Repair Record (1 = Worst, 5 = Best) 4 1977 Repair Record (1 = Worst, 5 = Best) 5 Headroom 6 Rear Seat Room 7 Trunk Space 8 Weight 9 Length Slide 45 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 45 Chernoff Faces Described by ten facial characteristic parameters: head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing, eye size, mouth length and degree of mouth opening Limitations? Only up to 10 or so dimensions Overemphasizes certain variables because of our perceptual biases Slide 46 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 46 Parallel Coordinates Method Interactive brushing is useful for seeing distinctions Epileptic Seizure Data 1 (of n) cases (this case is a brushed one, with a darker line, to standout from the n-1 other cases) Slide 47 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 47 More elaborate parallel coordinates example (from E. Wegman, 1999). 12,000 bank customers with 8 variables Additional dependent variable is profit (green for positive, red for negative) Slide 48 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 48 Interactive Grand Tour Techniques Grand Tour idea Cycle continuously through multiple projections of the data Cycles through all possible projections (depending on time constraints) Projects can be 1, 2, or 3d typically (often 2d) Can link with scatter plot matrices (see following example) Asimov (1985) Example on following 2 slides 7 dimensional physics data, color-coded by group, shown with (a)Standard scatter matrix (b)2 static snapshots of grand tour Slide 49 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 49 Slide 50 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 50 Slide 51 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 51 Exploratory Data Analysis Visualizing Time-Series Data Slide 52 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 52 Time-Series Data: Example 1 steady growth trend New Year bumps Summer peaks Summer double peaks (favor early or late) Historical data on millions of miles flown by UK airline passengers ..note a number of different systematic effects Slide 53 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 53 Time-Series Data: Example 2 Experimental Study: More milk -> better health? 20,000 children: 5k raw, 5k pasteurize, 10k control (no supplement) mean weight vs mean age for 10k control group Would expect smooth weight growth plot. Plot shows an unexpected pattern (steps), not apparent from raw data table. Why do the children appear to grow in spurts? Data from study on weight measurements over time of children in Scotland Age Weight Slide 54 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 54 Time-Series Data: Example 3 (Google Trends) Search Query = whiskey Slide 55 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 55 Time-Series Data: Example 4 (Google Trends) Search Query = NSA Slide 56 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 56 Spatial Distribution of the Same Data (Google Trends) Search Query = whiskey Slide 57 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 57 Non-Stationarity in Temporal Data Stationarity: (loose definition) A probability distribution p (x | t) is stationary with respect to t if p (x | t ) = p (x) for all t, where x is the set of variables of interest, and t is some other varying quantity (e.g., usually t = time, but could represent spatial information, group information, etc) Examples: p(patrient demographics today) = p(patient demographics 10 years ago)? p(weights in Scotland) = p(weights in US) ? p(income of customers in Bank 1) = p(income of customers in Bank 2)? Non-stationarity is common in real data sets Solutions? Model stationarity (e.g., increasing trend over time) and extrapolate Build model only on most recent/most similar data Slide 58 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 58 Exploratory Data Analysis Cluster Analysis Slide 59 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 59 Clustering automated detection of group structure in data Typically: partition N data points into K groups (clusters) such that the points in each group are more similar to each other than to points in other groups descriptive technique (contrast with predictive) for real-valued vectors, clusters can be thought of as clouds of points in d-dimensional space Slide 60 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 60 Clustering Sometimes easy Sometimes impossible and sometimes in between Slide 61 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 61 Why is Clustering Useful? Discovery of new knowledge from data Contrast with supervised classification (where labels are known) Long history in the sciences of categories, taxonomies, etc Can be very useful for summarizing large data sets For large n and/or high dimensionality Applications of clustering Clustering of documents produced by a search engine Segmentation of patients in a medical stidy Discovery of new types of galaxies in astronomical data Clustering of genes with similar expression profiles Cluster pixels in an image into regions of similar intensity . many more Slide 62 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 62 General Issues in Clustering Clustering algorithm = Representation + Score + Optimization Cluster Representation: What shapes of clusters are we looking for? What defines a cluster? Score: A clustering = assignment of n objects to K clusters Score = quantitative criterion used to evaluate different clusterings Optimization and Search Finding the optimal (minimal/maximal score) clustering is typically NP-hard Greedy algorithms to optimize the score are widely used Slide 63 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 63 Other Issues in Clustering Distance function, d[x(i),x(j)] critical aspect of clustering, both distance of individual pairs of objects distance of individual objects from clusters How is K, number of clusters, selected? Different types of data Real-valued versus categorical Input data: N vectors or an N 2 distance matrix? Slide 64 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 64 Different Types of Clustering Algorithms partition-based clustering Represent points as vectors and partition points into clusters based on distance in d- dimensional space probabilistic model-based clustering e.g. mixture models [both work with measurement data, e.g., feature vectors] hierarchical clustering Builds a tree (dendrogram) starting from an N x N distance matrix graph-based clustering represent inter-pointdistances via a graph and apply graph algorithms Slide 65 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 65 The K-Means Clustering Algorithm Input: N vectors x of dimension d K = number of clusters required (K > 1) Output: K cluster centers, c(1) c(K), each of dimension d A list of cluster assignments (values 1 to K) for each of the N input vectors Slide 66 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 66 Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: d [ x, c ] = j ( x j - c j ) 2 Sum is over the d components/dimensions of the vectors Slide 67 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 67 Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: d [ x, c ] = j ( x j - c j ) 2 Total squared error between a cluster center c(k) and all N k points assigned to that cluster: S k = i d [ x (i), c(k) ] Sum is over the d components/dimensions of the vectors Sum is over the N k points assigned to cluster k Slide 68 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 68 Squared Errors and Cluster Centers Squared error (distance) between a data point x and a cluster center c: d [ x, c ] = j ( x j - c j ) 2 Total squared error between a cluster center c(k) and all N k points assigned to that cluster: S k = i d [ x (i), c(k) ] Total squared error summed across K clusters S = k S k Sum is over the d components/dimensions of the vectors Sum is over the N k points assigned to cluster k Sum is over the K clusters Slide 69 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 69 K-means Objective Function K-means: minimize the total squared error, i.e., find the K clusters centers m(k), and assignments, that minimize S = k S k = k ( i d [ x (i), c(k) ] ) K-means seeks to minimize S, i.e., find the cluster centers such that the total squared error is smallest will place cluster centers strategically to cover data similar to data compression (in fact used in data compression algorithms) Slide 70 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 70 Example of Running Kmeans Slide 71 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 71 Example MSE Cluster 1 = 1.31 MSE Cluster 2 = 3.21 Overall MSE = 2.57 Slide 72 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 72 Example MSE Cluster 1 = 1.31 MSE Cluster 2 = 3.21 Overall MSE = 2.57 MSE Cluster 1 = 1.01 MSE Cluster 2 = 1.76 Overall MSE = 1.38 Slide 73 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 73 Example MSE Cluster 1 = 0.84 MSE Cluster 2 = 1.28 Overall MSE = 1.05 Slide 74 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 74 Example MSE Cluster 1 = 0.84 MSE Cluster 2 = 1.28 Overall MSE = 1.04 Slide 75 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 75 K-means Algorithm Select the initial K centers randomly, e.g., pick K out of the N input vectors randomly Iterate: Assignment Step: Assign each of the N input vectors to their closest mean Update the K Means Compute updated centers: the average value of the vectors assigned to k New c (k) = 1/N k i x(i) Convergence: Are all new c(k) = old c(k)? Yes: terminate No: return to Iterate step Sum is over the N k points assigned to cluster k Slide 76 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 76 K-means 1.Ask user how many clusters theyd like. (e.g. K=5) (Example is courtesy of Andrew Moore, CMU) Slide 77 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 77 K-means 1.Ask user how many clusters theyd like. (e.g. K=5) 2.Randomly guess K cluster Center locations Slide 78 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 78 K-means 1.Ask user how many clusters theyd like. (e.g. K=5) 2.Randomly guess K cluster Center locations 3.Each datapoint finds out which Center its closest to. 4.Thus each Center owns a set of datapoints) Slide 79 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 79 K-means 1.Ask user how many clusters theyd like. (e.g. K=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center its closest to. 4.Each Center finds the centroid of the points it owns Slide 80 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 80 K-means 1.Ask user how many clusters theyd like. (e.g. K=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center its closest to. 4.Each Center finds the centroid of the points it owns 5.New Centers => new boundaries 6.Repeat until no change Slide 81 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 81 Properties of the K-Means Algorithm Time complexity?? O( N K d ) in time This is good: linear time in each input parameter Convergence? Does K-means always converge to the best possible solution? No: always converges to *some* solution, but not necessarily the best Depends on the starting point Slide 82 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 82 Local Search and Local Minima TSE Hard (non-convex) K-means Slide 83 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 83 Local Search and Local Minima TSE Hard (non-convex) K-means Global Minimum Slide 84 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 84 Local Search and Local Minima TSE Hard (non-convex) K-means Local Minima Slide 85 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 85 Local Search and Local Minima TSE Hard (non-convex) K-means TSE K-means Slide 86 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 86 Issues with K-means clustering Simple, but useful tends to select compact isotropic cluster shapes can be useful for initializing more complex methods many algorithmic variations on the basic theme e.g., in signal processing/data compression is similar to vector-quantization Choice of distance measure Euclidean distance Weighted Euclidean distance Many others possible Selection of K screen diagram - plot SSE versus K, look for knee of curve Limitation: may not be any clear K value Slide 87 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 87 Issues Representing non-numeric data? E.g., color = {red, blue, green, .} Simplest approach: represent as multiple binary variables, one per value Standardizing the data Say we have Length of a persons arm in feet Width of a persons foot in millimeters In kmeans distance calculations, the measurements in millimeters will dominate the measurements in feet Solution? Try to place all the variables on a similar scale E.g., z = (x mean(x) )/ std(x) Or, y = (x mean(x) ) / interquartile_range(x) Slide 88 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 88 Hierarchical Clustering Representation: tree of nested clusters Works from a distance matrix advantage: xs can be any type of object disadvantage: computation two basic approachs: merge points (agglomerative) divide superclusters (divisive) visualize both via dendograms shows nesting structure merges or splits = tree nodes Applications e.g., clustering of gene expression data Useful for seeing hierarchical structure, for relatively small data sets Slide 89 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 89 Simple example of hierarchical clustering Slide 90 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 90 Agglomerative Methods: Bottom-Up algorithm based on distance between clusters: for i=1 to n let C i = { x(i) }, i.e. start with n singletons while more than one cluster left let C i and C j be cluster pair with minimum distance, dist[C i, C j ] merge them, via C i = C i C j and remove C j time complexity = O(n 2 ) to O(n 3 ) n iterations (start: n clusters; end: 1 cluster) 1st iteration: O(n 2 ) to find nearest singleton pair space complexity = O(n 2 ) accesses all distances between x(i)s interpreting large n dendrogram difficult anyway (like decision trees) large n idea: partition-based clusters at leafs Slide 91 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 91 Distances Between Clusters single link / nearest neighbor measure: D(Ci,Cj) = min { d(x,y) | x Ci, y Cj } can be outlier/noise sensitive Slide 92 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 92 Distances Between Clusters single link / nearest neighbor measure: D(Ci,Cj) = min { d(x,y) | x Ci, y Cj } can be outlier/noise sensitive complete link / furthest neighbor measure: D(Ci,Cj) = max { d(x,y) | x Ci, y Cj } enforces more compact clusters Slide 93 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 93 Distances Between Clusters single link / nearest neighbor measure: D(Ci,Cj) = min { d(x,y) | x Ci, y Cj } can be outlier/noise sensitive complete link / furthest neighbor measure: D(Ci,Cj) = max { d(x,y) | x Ci, y Cj } enforces more compact clusters intermediates between those extremes: average link: D(Ci,Cj) = avg { d(x,y) | x Ci, y Cj } centroid: D(Ci,Cj) = d(c i,c j ) where c i, c j are centroids Wardss SSE measure (for vector data): Merge clusters than minimize increase in within-cluster sum-of-squared-dists Note that centroid and Ward require that centroid (vector mean) can be defined Which to choose? Different methods may be used for exploratory purposes, depends on goals and application Slide 94 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 94 Dendrogram Using Single-Link Method Old Faithful Eruption Duration vs Wait Data Notice how single-link tends to chain. dendrogram y-axis = crossbars distance score Slide 95 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 95 Add pointer to Nature clustering paper for Text Slide 96 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 96 Slide 97 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 97 Approach Hierarchical clustering of genes using average linkage method Clustered time-course data of 8600 human genes and 2467 genes in budding yeast This paper was the first to show that clustering of expression data yields significant biological insights into gene function Slide 98 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 98 Heat-Map Representation (human data) Slide 99 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 99 Heat-Map Representation (yeast data) Slide 100 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 100 Evaluation Slide 101 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 101 Clustered display of data from time course of serum stimulation of primary human fibroblasts. Eisen M B et al. PNAS 1998;95:14863-14868 1998 by National Academy of Sciences Slide 102 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 102 Probabilistic Clustering Hypothesize that the data are being generated by a mixture of K multivariate probability density functions (e.g., Gaussians) Each density function is a cluster Data vectors have a probability of belonging to a cluster rather than 0-1 membership Clustering algorithm Learn the parameters (mean, covariance for Gaussians) of the K densities Learn the probability memberships for each input vector Can be solved with the Expectation-Maximization algorithm Can be thought of as a probabilistic version of K-means Slide 103 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 103 Slide 104 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 104 Slide 105 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 105 Slide 106 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 106 Slide 107 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 107 Slide 108 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 108 Slide 109 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 109 Slide 110 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 110 Anemia Group Control Group Slide 111 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 111 Summary Many different approaches and algorithms What type of cluster structure are you looking for? Computational complexity may be an issue for large n Data dimensionality can also be an issue Validation/selection of K is often an ill-posed problem Slide 112 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 112 References: Data Sets and Case Studies GEO Database Barrett et al, NCBI GEO: mining tens of millions of expression profiles database and tools update, Nucl Acids Res. (2007) 35(suppl 1): D760-765, doi:10.1093/nar/gkl887 Clustering expression data Eisen, Spellman, Brown, Botstein, Cluster analysis and display of genome-wide expression patterns, PNAS, Dec 2998, 95(25), 14863-14868 Link: http://www.pnas.org/content/95/25/14863.full.pdf+htmlhttp://www.pnas.org/content/95/25/14863.full.pdf+html Slide 113 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 113 General References on Clustering Cluster Analysis (5 th ed), B. S. Everitt, S. Landau, M. Leese, and D. Stahl, Wiley, 2011 (broad overview of clustering methods and algorithms) Algorithms for Clustering Data, A. K. Jain and R. C. Dubes, 1988, Prentice Hall. (a bit outdated but has many useful ideas and references on clustering) How many clusters? which clustering method? answers via model-based cluster analysis, C. Fraley and A. E. Raftery, the Computer Journal, 1998. (good overview article on probabilistic model-based clustering) Slide 114 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 114 Exploratory Data Analysis Dimension Reduction Methods Slide 115 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 115 Class Projects Project proposal due Wednesday May 2 nd Final project report due Monday May 21 st Details at http://www.ics.uci.edu/~smyth/courses/bmi/project_guidelines.html http://www.ics.uci.edu/~smyth/courses/bmi/project_guidelines.html We will discuss in more detail next week Slide 116 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 116 Software Packages Commercial packages SAS Statistica Many others see kdnuggets.org Free packages Weka Programming environments R MATLAB (commercial) Python Slide 117 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 117 Concepts in Multivariate Data Analysis Slide 118 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 118 An Example of a Data Set Patient IDZipcodeAge.Test ScoreDiagnosis 182619269755831 423569269719-991 002199000135770 83726243510650 .. 128379269740701 Notation: Columns may be called measurements, variables, features, attributes, fields, etc Rows may be individuals, entities, objects, samples, etc Slide 119 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 119 Vectors and Data Beyond 3 dimensions we cannot manually see our data Consider a data set of 10 measurements on 100 patients We can think of each patients data as a tuple with 10 elements If the variable values are numbers, we can represent this as a 10-dimensional vector e.g., x = ( x 1, x 2, x 3, ., x 9, x 10 ) -Our data set is now a set of such vectors -We can imagine our data as living in a 10-dimensional space -Each patient is represented as a vector, with a 10-dimensional location -Sets of patients can be viewed as clouds of points in this 10d space Slide 120 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 120 Hypersphere in d dimensions Hypercube in d dimensions High-dimensional data Volume of sphere relative to cube in d dimensions? Rel. Volume0.79????? Dimension234 567 (David Scott, Multivariate Density Estimation, Wiley, 1992) Slide 121 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 121 Hypersphere in d dimensions Hypercube in d dimensions High-dimensional data Volume of sphere relative to cube in d dimensions? Rel. Volume0.790.530.310.160.080.04 Dimension234 567 (David Scott, Multivariate Density Estimation, Wiley, 1992) Slide 122 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 122 The Geometry of Data Geometric view: data set = set of vectors in a d-dimensional space This allows us to think of geometric constructs for data analysis, e.g., Distance between data points = distance between vectors Centroid = center of mass of a cloud of data points Density = relative density of points in a particular region of space Decision boundaries -> partition the space into regions (Note that not all types of data can be naturally represented geometrically) Slide 123 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 123 Basic Concepts: Distance D(x, y) = distance between 2 vectors x and y How should we define this? E.g., Euclidean distance Manhattan distance Jaccard distance And more. Slide 124 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 124 Basic Concepts: Center of Mass What is the center of a set of data points Multidimensional mean Defined as: What happens if there is a hole in the center of the data? Slide 125 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 125 Geometry of Data Analysis Slide 126 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 126 Basic Concepts: Decision Boundaries In 2 dimensions we can partition the 2d space into 2 regions using a line or a curve, e.g., In d-dimensions, we can defined a (d-1)dimensional hyperplane or hypersurface to partition the space into 2 pieces Slide 127 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 127 Geometry of Data Analysis Good boundary? Better boundary? Slide 128 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 128 Example: Assigning Data Points to Two Exemplars Say we have 2 exemplars, e.g., 1 data point that is the prototype of healthy patients 1 data point that is the prototype of non-healthy patients Now we want to assign all individuals to the closest prototype Lets use Euclidean distance as our measure of distance This is equivalent to having a decision boundary that lies exactly halfway between the two exemplar points The decision boundary is at right angles to the line between the two exemplars (see next slide for a 2-dimensional example) Slide 129 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 129 Geometry of Data Analysis Slide 130 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 130 Geometry of Data Analysis Slide 131 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 131 High Dimensional Data Example: 100 dimensional data Visualization is of limited value Although can still useful to look at variables 1 or 2 at a time Curse of dimensionality Hypercube/hypersphere example (Scott) Number of samples need for accurate density estimate in high-d Question How would you find outliers (if any) with 1000 data points in a 100 dimensional data space? Slide 132 Padhraic Smyth, UC Irvine: CS 277, Winter 2014 132 Clustering: Finding Group Structure