avoiding common pitfalls when clustering biological data · of clustering analysis of biological...

R E V I E W

C O M P U T A T I O N A L B I O L O G Y

Avoiding common pitfalls when clusteringbiological dataTom Ronan, Zhijie Qi, Kristen M. Naegle*

Dow

nloaded fro

Clustering is an unsupervised learning method, which groups data points based on similarity, and isused to reveal the underlying structure of data. This computational approach is essential tounderstanding and visualizing the complex data that are acquired in high-throughput multidimensionalbiological experiments. Clustering enables researchers to make biological inferences for furtherexperiments. Although a powerful technique, inappropriate application can lead biological researchersto waste resources and time in experimental follow-up. We review common pitfalls identified fromthe published molecular biology literature and present methods to avoid them. Commonly encounteredpitfalls relate to the high-dimensional nature of biological data from high-throughput experiments, thefailure to consider more than one clustering method for a given problem, and the difficulty in determiningwhether clustering has produced meaningful results. We present concrete examples of problems andsolutions (clustering results) in the form of toy problems and real biological data for these issues. We alsodiscuss ensemble clustering as an easy-to-implement method that enables the exploration of multipleclustering solutions and improves robustness of clustering solutions. Increased awareness of commonclustering pitfalls will help researchers avoid overinterpreting or misinterpreting the results and missingvaluable insights when clustering biological data.

httm
on O
ctober 16, 2020p://stke.sciencem

ag.org/

Introduction

Technological advances in recent decades have resulted in the ability tomeasure large numbers of molecules, typically across a smaller number ofconditions. Systems-level measurements are mined for meaningful re-lationships between molecules and conditions. Clustering represents acommon technique for mining large data sets. Clustering is the unsupervisedpartitioning of data into groups, such that items in each group are moresimilar to each other than they are to items in another group. The purposeof clustering analysis of biological data is to gain insight into theunderlying structure in the complex data—to find important patterns with-in the data, to uncover relationships between molecules and conditions,and to use these discoveries to generate hypotheses and decide on furtherbiological experimentation. The basics of clustering have been extensivelyreviewed (1–3). Clustering has led to various discoveries, includingmolecular subtypes of cancer (4–7), previously unknown protein inter-actions (8), similar temporal modules in receptor tyrosine kinase cas-cades (9), metabolic alterations in cancer (10), and protease substratespecificity (11).

Although clustering is useful, it harbors potential pitfalls when appliedto biological data from high-throughput experiments. Many of thesepitfalls have been analyzed and addressed in publications in the fields ofcomputation, bioinformatics, and machine learning, yet the solutions tothese problems are not commonly implemented in biomedical literature.The pitfalls encountered when clustering biological data derive primarilyfrom (i) the high-dimensional nature of biological data from high-throughput experiments, (ii) the failure to consider the results from morethan one clustering method, and (iii) the difficulty in determining whetherclustering has produced meaningful results. Biological systems are complex,so there are likely to be many relevant interactions between different aspectsof the system, as well as meaningless relationships due to random chance.

Department of Biomedical Engineering, Center for Biological Systems Engi-neering, Washington University in St. Louis, St. Louis, MO 63130, USA.*Corresponding author. Email: [email protected]

Differentiating between a meaningful and a random clustering result canbe accomplished by applying cluster validation methods, determiningstatistical and biological significance, accounting for noise, and evaluatingmultiple clustering solutions for each data set. The high-dimensional natureof biological data means that the underlying structure is difficult to visualize,that valid but conflicting clustering results may be found in different subsetsof the dimensions, and that some common clustering algorithms anddistance metrics fail in unexpected and hidden ways. To address theseissues, clustering parameters and methods that are compatible with high-dimensional data must be chosen, the results must be validated and testedfor statistical significance, and multiple different clustering methodsshould be applied as part of routine analysis. Our main goal is to dispelthe belief that there exists a one-size-fits-all perfect method for clusteringbiological data.

Some solutions to address these pitfalls require awareness of the issueand the use of appropriate methods, whereas other solutions require sub-stantial computational skills and resources to implement successfully. How-ever, one method—ensemble clustering (that is, clustering data many wayswhile making some perturbation to the data or clustering parameters)—solves multiple pitfalls and can be implemented without extensive pro-gramming or computational resources. We mention the uses of ensembleclustering, as appropriate, and provide an overviewof ensemble clusteringat the end.

The Effect of High Dimentionality onClustering Results

Systems-level measurements and high-throughput experiments are diversein size of the data sets, number of dimensions, and types of experimentalnoise. Examples include measurements of the transcript abundance fromthousands of genes across several conditions (such as in multiple cell linesor in response to drug treatments) (6, 12), measurements of changes in theabundance of hundreds of peptides over time after a stimulation (13), ormeasurements of changes in the abundance of hundreds of microRNAs

www.SCIENCESIGNALING.org 14 June 2016 Vol 9 Issue 432 re6 1

http://stke.sciencemag.org/

R E V I E W

http://stke.scieD

ownloaded from

across tissue samples (6). Because the di-mensionality of the data in a clusteringexperiment depends on the objects and fea-tures selected during clustering, understand-ing how to determine dimensionality andits effects on clustering output are prere-quisites for approaching a clustering pro-blem. As a rule of thumb, data with morethan 10 dimensions should be consideredhigh dimensional and should be given spe-cial consideration.

Determining dimensionalityThe dimensionality of a clustering problemis defined by the number of features that anobject has, rather than by the number ofobjects clustered. However, the definitionof object and feature in a given clusteringproblem depends on the hypothesis beingtested and which part of the data is beingclustered. For example, in the measurementof 14,546 genes across 89 cell lines, as foundby Lu et al. (6), we can ask two questionswith these data. First, what is the relation-ship between genes based on their changesacross the 89 cell lines? This case representsa gene-centric clustering problem with14,456 observations, with each observationhaving 89 dimensions (Fig. 1A). Second,

on October 16, 2020

ncemag.org/

what is the relationship between cells based on the changes in gene ex-pression across all of the genes? This case represents a cell line clusteringproblem, obtained by transposing the original data matrix such that thematrix now has 89 observations, with each observation having 14,456dimensions (Fig. 1B).

The same data can be clustered in different ways to discover differentbiological insights, and these different ways of clustering the data can havelarge differences in dimensionalities. The gene-centric clustering problemrepresents a good basis for clustering because the observations greatlyexceed the dimensions. However, even that problem represents a high-dimensionality situation. The cell line clustering problem is even morechallenging because the relatively small number of observations (89)compared with the large dimensionality (>14,000) could be dominatedby noise in the expression data. Without careful handling of sparsity andfeature masking, clustering will almost certainly deliver poor or misleadingresults. Reliable and meaningful clustering results are likely achievableonly with careful dimensionality reduction or subspace clustering.

Geometry and distance in high-dimensional dataAs dimensionality increases beyond two- and three-dimensional spaces,the effects of high dimensionality, referred to as the “curse of dimensional-ity,” come into play. These effects manifest in three key ways: (i) Ge-ometry behaves nonintuitively in higher dimensions; (ii) sparsity iscommon in high-dimensional data sets; and (iii) relevant features tend tobecome masked by increasing numbers of irrelevant features.

As dimensionality increases, familiar relationships of distance, volume,and probability distributions behave nonintuitively (14). The method ofdetermining distance between points in a clustering problem (the distancemetric) influences the clustering results. With high-dimensional biolog-ical data, distance metrics fail to behave as expected based on our low-dimensional intuition. For example, as dimensionality increases beyond

16 dimensions, the nearest neighbor to a point can be the same distanceaway as the farthest neighbor to the point (for many common distancemetrics) (15–17). Put another way, because it is likely that any two pointsare far apart in at least a few dimensions, all points approach being equal-ly far apart (18). This means that for many types of distance functions inhigh-dimensional space, all points are effectively equidistant from one an-other (17, 19). As a result, some common distance metrics and the conceptof “nearest neighbor” can be potentially meaningless in high-dimensionalspace (17, 20). Although data with more than 16 dimensions should beconsidered high dimensional (17), data with as few as 10 dimensions canalso exhibit nonintuitive high-dimensional behavior (14).

Some algorithms that were developed for lower-dimensional spaces donot generalize well to high-dimensional spaces. Many centroid and density-based algorithms, such as k-means (centroid) and DBSCAN (density-based), rely on defining a nearest neighbor, which only works well inlower-dimensional spaces (21). Thus, k-means and DBSCAN (22) oftenfail to return useful results when used on high-dimensional data (15, 17).Furthermore, these algorithms will give no indication that they are notworking as expected. Despite these problems, reliable and interpretableresults with high-dimensional data can be achieved if ensembles of theseclustering algorithms are used (23, 24).

Some algorithms are specifically designed to function with high-dimensional data. Hypergraph-based clustering methods draw on the fieldof graph theory, a method of representing pairwise relations between objects.Hypergraph-based clustering can be used to transform sparse data in highdimensions to a problem in graph partitioning theory, drawing on uniquemethods from that field, to produce accurate and informative clustering(15). Grid-based clustering is a density-based approach and can fail to givemeaningful results on some high-dimensional data sets. However, Opti-Grid is a grid-based clustering approach that specifically addresses theproblems of distance and noise that confound other similar algorithms

A B

C

D

Fig. 1. Determining the dimensionality of a clustering problem. (A and B) Representation of the mRNA
clustering problem consisting of >14,000mRNAs measured across 89 cell lines. Data are from Lu et al. (6).When the mRNAs are clustered, the mRNAs are the objects and each cell line represents a feature,resulting in an 89-dimensional problem (A). When attempting to classify normal and tumor cell lines usinggene expression, the cells lines are the objects and each mRNA is a feature, resulting in a clusteringproblem with thousands of dimensions (B). (C) Effect of dimensionality on sparsity. (D) Effect of dimension-ality on coverage of the data based on SD from the mean.


R E V I E W

on October 16, 2020


Dow

nloaded from

when applied to high-dimensional data sets (21). Some algorithms, suchas NDFS, simultaneously solve problems with noise and find subgroups(25), which can improve the accuracy of clustering of high-dimensionaldata. Others, like OptiGrid, alternate rounds of grouping and dimensionalityreduction to cluster high-dimensional data (26). When working with high-dimensional data, clustering algorithms that are suited for the dimen-sionality of the clustering problem should be used.

Sparsity in high-dimensional dataSparsity is a problem with high-dimensional data because clusters aremost clearly defined when there are many objects in each cluster, result-ing in a robust representation of each cluster. Given a fixed amount of data,increasing dimensionality causes the data to become spread out in space.For example, given a set of data points in one dimension, data may bedensely packed (Fig. 1C). As the dimensionality increases to two or threedimensions, each unit volume of the space becomes less and less popu-lated. Extrapolating to an even more sparsely populated high-dimensionaldata set, it becomes increasingly difficult to ascertain whether a distant datapoint represents a noisy point far from one cluster or a new cluster that onlyhas a fewmembers and thus is difficult to identify. Effectively, randomnoisedominates clustering results based on sparse data.

Sparsity also affects our low-dimensional intuition for statistical rules ofthumb. We typically use 3 standard deviations (SDs, ±3s) to determine areasonable threshold for statistical significance. In a one-dimensional Gaus-sian distribution (represented by a typical bell curve), 3 SDs cover 99.7% ofthe data. In two or three dimensions, when independently distributed,this coverage is reduced slightly (to 99.5 and 99.2%, respectively). In10 dimensions, 3 SDs cover only 97.3%. By 100 dimensions, coverageis reduced to 76.3%; by 1000 dimensions, it is reduced to 6.7% (Fig.1D) (16). This means that, in high-dimensional space, our rules of thumbfor interpreting variance and what threshold should be considered statisti-cally significant need to be reconsidered.

Masking relationships in high-dimensional dataWith high-dimensional data, the signal can easily get lost in the noise.Biological noise and variation contribute to irrelevant features, whichcan mask important features, and ultimately influence cluster membershipin the full-dimensional space (27). In the example of gene expression anal-ysis across 89 cells lines, measuring tens of thousands of transcripts guar-antees that transcripts with no relevance to the biology under study willalso be measured. However, the noise that is present in those transcriptsfrom biological or technical variation can dominate clustering results be-cause clustering algorithms treat the noise as if they are true features in thedata. The background noise resulting from cell-cell variation mightswamp the most relevant features necessary to identify differences be-tween cell lines, or background noise can even masquerade as significantdifferences between cell lines when those differences are due to randomprocesses in the cell.

Strong relationships among only a subset of features can mask otherrelationships. For example, metastatic cells and normal cells of a particulartype may have similar gene expression profiles, so cell lines might tend tocluster by cell type (rather than by normal versus metastatic) when theentire gene expression data are used. However, when only expression datafrom a subset of genes are used, the metastatic cells may exhibit similar ex-pression patterns and thus would be grouped apart from the normal cells.

To exemplify how a signal can be detected in a lower-dimensional dataset and lost in a higher-dimensional data set, we turn to the study by Luand colleagues (6) in which they measured the relative abundances ofmicroRNA and mRNA molecules in 89 cell lines. Eighteen of 20 gas-trointestinal cell lines clustered together with the microRNA data, which

had 217 dimensions, but this relationship was lost when the mRNA datawith ~14,000 dimensions were clustered. Understanding that the higher-dimensionality data can result in loss of the signal in the noise, we agreewith the authors when they suggested that the loss of signal might resultfrom “the large amount of noise and unrelated signals that are embeddedin the high-dimensional mRNA data.” Fortunately, strategies are availablefor unmasking the hidden signals in biological data.

There are competing schools of thought for addressing the masking ofrelevant features by irrelevant features. If only a few features (dimensions)of the data are likely to contain the most relevant information, someadvocate applying techniques that reduce the dimensionality. Dimension-ality reduction involves the selection of only a subset of the features; theselection of which is based on a criterion such as predictive power orvariance. When only a few features are expected to contain signal and therest are expected to be noise, principal component analysis (PCA) is oftenused. PCA reduces high-dimensional data to fewer dimensions that capturethe largest amount of covariance in the data. Another method of reducingdimensionality is to remove features with low values, as is often done inmicroarray analysis where transcripts below a threshold, or transcriptschanging only a very small amount between conditions, are removed fromthe data set. Alternatively, if multiple independent signals exist in the data,selection of different subsets of features to cluster (28) may reveal differentrelationships among the data.

These three commonly applied methods for reducing dimensionalitywill produce clustering results, but each has limitations and, by definition,eliminates some features of the data. Dimensionality reduction techniquescan degrade clustering results for high-dimensional data sets (29) whenstructure that is present only in some subset of the data is eliminated asa result of the dimensionality reduction. To illustrate this problem, we use atoy data example in which three clusters are readily apparent when the dataare presented in two dimensions (Fig. 2A). A single projection onto any onedimension (determined by PCA, defined by the dotted line) (Fig. 2B)reduces the separability compared with the two-dimensional representa-tion such that the identity of at least one cluster is lost. This “local featurerelevance” suggests that dimensionality reduction techniques, such as PCA,may be inappropriate if different clusters lie in different subspaces, and thatone should avoid the application of a single instance of feature selection(27) to avoid missing important structures within the data.

In contrast, subspace clustering is a method that searches different sub-sets of original features to find all valid clustering results, no matter whichsubset they are found in. Subspace clustering does not use graph theory orpartitioning methods. Because it involves the selection of a subset offeatures, it also does not rely on the density of the data. In some data,objects will cluster only in subspaces of the full data space. For example,in the toy problem, when three different subspaces are chosen, the threeclusters are readily obvious (Fig. 2C). Because this is a simple data set, theability to see the clusters in two dimensions is also obvious, but note thatat least two of the green cluster data points and one of the brown clusterdata points touch the dividing line when analyzed in two dimensions (Fig.2A). In contrast, the three clusters are clearly separated when the ap-propriate subspaces are selected (Fig. 2C). Subspace clustering must becarefully tailored to the high-dimensional data to produce valid results(30). As a computationally intensive method, subspace clustering can hit lim-itations due to a combinatorial explosion of potential subspaces in highdimensions (15). However, subspace clustering can reveal the multiple,complex relationships in biological data, because although information is lostin each subspace, multiple subspaces are considered; therefore, subspaceclustering results are informed by information from all relevant dimensions.

The problematic effects of dimensionality reduction and the efficacy ofsubspace clustering can be seen on the expression data from Lu et al. (6).



R E V I E W

on October 16, 2020


Dow

nloaded from

The clustering results from the original study(clustering 89 cell lines using the mRNAdata, ~14,000 dimensions) (Fig. 2D) werecompared to the clustering results after usingPCA to reduce the dimensionality to the 10most relevant features (Fig. 2E) and afterusing subspace clustering to reduce the dimen-sionality to the 10 most informative features(Fig. 2F). As Lu et al. found, we do notsee significant grouping of cell lines ofgastrointestinal origin when we clusterthe data in the full feature space (Fig.2D), or if we reduce dimensionality usingthe first 10 principal components asfeatures fromPCA (Fig. 2E).However, a se-lection of one 10-dimensional subspaceshows strong clustering for gastrointestinalcell lines (Fig. 2F)—almost as strong as theresults that Lu and colleagues presentedfor much lower-dimensional microRNAdata. This analysis suggests that, althoughthe reduced dimensions of principal com-ponent space may have uncovered a struc-ture that we do not understand, PCA wasnot informative for grouping cells basedon their tissue of origin. However, there isa subspace (a subset of features from theoriginal dimensions) for the mRNA data inwhich we can successfully group cells bytheir origins. This example illustrates howirrelevant features in the high-dimensionalspace masked the grouping of cells by ori-gin, despite the data including expressionmeasurements of genes that reflect tissueorigin. The key was finding the right ap-proach to cluster the data.

Effects of Clustering Parameterson Clustering Results

In addition to issues related to the highdimensionality of biological data, cluster-ing parameters also affect the clusteringresult. Given the same data, varying a sin-
gle parameter of clustering, such as the transformation, the distance metric,or the algorithm, can drastically alter the clustering result. Unfortunately,there is often no clear choice of best metric or best transformation to useon a particular type of biological data. Each choice can mask or reveal adifferent facet of the organization within the data. Therefore, in addition toapplying different methods of clustering and different approaches to ad-dressing dimensionality, it is essential to consider results from multipleparameters when evaluating clustering results for biological data.
Transformations and distance metricsData are often transformed as part of analysis and processing. For ex-ample, transcriptional microarray data are commonly log2-transformed.This transformation expands the information for genes with low expres-sion variation across samples and simplifies the identification of geneswith differential expression. Similarly, in proteomic data sets, data maybe centered and scaled by autoscaling or Z-scoring (25) to make relative

comparisons between signals for which magnitude cannot be directlycompared. Although transformation can improve the ability to draw use-ful biological insight (31, 32), transformation also generates a new dataset with altered relationships that may reveal or mask some underlyingbiological relationships in the data (Fig. 3, A and B).

Although transformation of data is routine as a part of preclusteringanalysis, because of the effects on density and distance in the data set,any transformations done on the data at any point should be explicitlyconsidered as a clustering parameter during the clustering process. Wefound that, compared with using different distance metrics or clusteringalgorithms (32), transformations often had the greatest impact on aclustering result (Fig. 3B). In other cases, transformation has little impacton clustering results (Fig. 3, C and D).

The choice of a distance metric also greatly affects clustering resultsbecause different distance metrics accentuate different relationshipswithin the data (Fig. 3B). Thus, to avoid missing information in the data,

A

B

C

D

E

F

Fig. 2. Dimensionality reduction methods and effects. Comparison of PCA and subspace clustering. (A)Three clusters are plotted in two dimensions. The dashed red line indicates the one-dimensional line uponwhich the original two-dimensional data is projected as determined by PCA. (B) The clusters are plotted inthe new single dimension after reducing the dimensionality from two to one. (C) Three alternate one-dimensionalprojections (dashed red lines) onto which the data can be projected, each demonstrating better separabilityfor some clusters than the projection identified using PCA. (D to F) Comparison of the original clusteringresults of 89 cell lines in ~14,000-dimensional mRNA data (D) to clustering results after PCA (E) and aftersubspace clustering (F). Blue bars, gastrointestinal cell lines; yellow bars, nongastrointestinal cell lines.



R E V I E W

on October 16, 2020


Dow

nloaded from

different distance metrics and transforma-tions should be applied as a routine partof clustering analysis.

Clustering algorithmsThe choice of a clustering algorithm isbased on several factors: (i) the underlyingstructure of the data, (ii) the dimensionalityof the data, (iii) the number of relevantfeatures for the biological questions beingasked, and (iv) the noise and variance inthe data. Each algorithm incorporates dif-ferent assumptions about the data and canreveal different relationships among thedata. (Fig. 4). The four primary classes ofclustering algorithms are hierarchical, cen-troid, density, and statistical. The choice ofclustering algorithm depends on the pre-dicted structure of the data, and each algo-rithm class produces clusters with differentproperties. Hierarchical clustering is usefulwhen data are expected to contain clusterswithin clusters, or the observations are ex-pected to have a nested relationship. TheWard algorithm (Fig. 4, column 2) can pro-duce different results depending on thethreshold for similarity, highlighting hier-archical relationships. For example, at dif-ferent thresholds, the green cluster (Fig. 4,row 2, column 2, two lower groups) mighttake on separate cluster identities, indicat-ing that the group is composed of two sub-groups. Centroid clustering, such as k-meansclustering, assigns membership in a clusterbased on the distance from multiple centers,resulting in roughly spherical clusters evenwhen the underlying data are not spheri-cally distributed. The k-means algorithmdepends heavily on the selection of the cor-rect value for k and tends to find sphericalclusters (Fig. 4, column1). It performs poor-
ly on irregularly shaped data (Fig. 4, rows 3 to 5, column 1). Density-basedalgorithms, such as DBSCAN (22), connect groups of densely con-nected points with regions of lower density separating clusters. Becauseit does not rely on a particular cluster shape, it can capture more complexstructures in low-dimensional data (Fig. 4, rows 3 to 5, column 3) anddoes not tend to find clusters in uniform data (Fig. 4, row 1, column 3).However, variations in density can cause it to find additional clusters notfound by other algorithms (Fig. 4, rows 2 and 3, column 3). Statisticalmethods, such as self-organizing maps (33) and Gaussian mixture models(Fig. 4, column 4), fit statistical distributions to data to identify multiplegroups of observations, each belonging to their respective distribution,but have limited success on nonnormally distributed data (Fig. 4, rows3 to 5).
Whereas the data in Fig. 4 are toy examples, real-world examples areoften high-dimensional and difficult to plot. Often, the underlying struc-ture is not known for most biological data, and it is likely that a complexbiological data set will have multiple structures—nonspherical distributions,widely varying density scales, and nested relationships—that will only berevealed by applying multiple clustering algorithms.

Evaluating Clustering Results

How can you tell when the clustering result of biological data is meaning-ful? Because of the complexity of biological systems, there are likely to bemany valid clustering results, each revealing some aspect of underlyingbiological behavior. Unfortunately, there are likely to be many meaninglessrelationships simply due to random chance because the data are complex.Most clustering algorithms will find clusters, even if there is no trueunderlying structure in the data (as exemplified by Fig. 4, top row). There-fore, clusters must be evaluated for biological relevance, stability, and clusterfitness. Understanding and accounting for noise and uncertainty in thedata should be also considered when determining whether a clustering resultis meaningful.

Cluster validationValidation metrics are a measure of clustering fitness, assessing if theresult represents a well-defined structure within the data set, using con-cepts such as cluster compactness, connectedness, or separation, orcombinations of these attributes. Much like distance metrics, each vali-dation metric accentuates different aspects of the data to account for the

A

C

B

D

Manhattan CosineEuclideanReference data

AD

CB

AD

CB

AD

CB

A

C’

CB

A’A

D

B

A

C’

CB

AD

CB

AD

CB

(before log2 transformation)miRNAmiRNA

No

tran

sfor

mat

ion

Log

2

No

tran

sfor

mat

ion

Log

2

Fig. 3. Effect of transformations and distance metric on clustering results. (A) Demonstration of how trans-
formations affect the relationship of data points in space. A toy data set (reference set, https://github.com/knaegle/clusteringReview) was clustered into four clusters with agglomerative clustering, averagelinkage, and Euclidean distance. The four reference clusters without transformation (upper panel) and afterlog2 transformation (lower panel). (B) Transformations and distance metrics change clustering results whencompared to the reference clustering result. With no transformation (upper panels), Euclidean and cosinedistance do not change cluster identity, but with Manhattan distance, a new cluster A′ is added, and clusterC is merged into cluster B. With the log2 transformation (lower panels), the Euclidean and Manhattan metricscaused cluster C′ to emerge and cluster D to be lost. (C) Dendrogram from the microRNA (miRNA)clustering experiment result from 89 cell lines and 217 microRNAs (6). Gastrointestinal-derived cell lines(blue bars) predominantly cluster together in the full-dimensional space. Note: The data were log2-transformed as part of the preclustering analysis. (D) Same microRNA data as in (C) but without log2transformation.

https://github.com/knaegle/clusteringReview



R E V I E W

on October 16,


Dow

nloaded from

final score (Table 1). The compactness ofeach cluster can be measured by the rootmean square standard deviation (RMSSTD)method (34), the r-squared (RS) method(35), or the Hubert’s G statistic (36). Con-nectedness within a cluster can bemeasuredby k-nearest neighbor consistency (37) orHandl’s connectivity metric (38). Separate-ness is measured by the SD validity index(35), which compares average scatteringand total separation between clusters. Somemetrics combine measures of compactnessand separation, such as the pseudo-F sta-tistic (also known as the Calinski-Harabaszindex) (39), the Dunn index (40), the Davies-Bouldin index (41), silhouette width (42),or the gap statistic (43). When there are dif-ferent numbers of clusters in the clusteringresults, or predominantly different member-ship in each cluster, somemetricsmight fail.The Rand index (44) is appropriate for vali-dating these complex clustering differences.

To illustrate the differences in validationmetrics that measure compactness, connect-edness, and separation, we applied differentmetrics to the clustering results from Fig. 4(Table 2). For the toy data, DBSCAN gener-ally performs better on the more complexstructures, with some exceptions. DBSCANcorrectly identifies no structure (Fig. 4, row1),the half-moons (Fig. 4, row 4), and nestedcircles (Fig. 4, row 5). This is also reflectedin the validation metric results for connect-edness, which is a better measure than com-pactness for nonspherical data shapes. Thelack of structure in row 1 is also suggestedby the “zero” result to theRSmeasure of com-pactness and the “not available” result forseparation from the SD validity index. For

2020

the spherical clusters, k-means and mixture models perform the best (Fig.4, row 2), which is reflected in the best scores for validation metrics thatmeasure both compactness and separation. No algorithm performs well onthe parallel lines data (Table 2). Although DBSCAN appears to do betterwith the center of the clusters (Fig. 4, row 3), the errant results at the farleft and right result in a poor score for validationmetrics—on par with thosealgorithms that divide the horizontal parallel lines into vertical clusters.

The validation metric chosen should match the characteristics beingtested. One approach is to use a metric that matches the selected algo-rithm. For example, when an algorithm optimizes connectedness, ametric that evaluates connectedness instead of the one that evaluatescompactness should be used (Table 1). An alternate approach is use toa panel of validation metrics, each measuring a specific aspect of thedata. For example, when creating an ensemble of distance metrics andalgorithms, a panel of metrics that measure compactness, connectedness,and separation will eliminate inappropriate combinations of parameters.Theworst pitfall is to not use anymetric to assess the clustering outcome.

Clustering stabilityBecause complex biological data can demonstrate multiple and indepen-dent valid clustering results, any single clustering result can be inconclusive.

Whereas validation metrics provide an assessment of the quality of aclustering result with regard to specific aspects of the data, stability of theclustering result defines how robust the clustering result is to perturbation,that is, how many ways of assessing the data produce a similar clusteringresult. Stability analysis works under the premise that clustering results re-present the underlying structure of the data and should be relatively invariantto perturbations in the analysis. Stability analysis can identify robust cluster-ing results using a variety of perturbations, such as accounting for (i) noise(45–47); (ii) the effect of projections of high-dimensional data onto fewerdimensions (30); (iii) differences in algorithms, distance metrics, and datatransformations (32); and (iv) the effect of selecting different random start-ing positions in nondeterministic algorithms (48).

Stability analysis assumes that there is a single “true” structure withinthe data; however, high-dimensional biological data can have multipletrue structures that reveal distinct biological insights. Therefore, it may beequally valid to assume that there are multiple structures within the data.Multiple clustering analysismethodology (MCAM) (32) is an approach thattests multiple clustering results. The approach in MCAM is based on theassumption that some perturbations to clustering parameters can uncoverclustering results that reveal different biological insights. For example, sometransformations uncovered shared binding partners of phosphorylated sites,

K-means

No

str

uctu

reTh

ree

clus

ters

Two

wid

e cl

uste

rsTw

o h

alf-

mo

on

sTw

o n

este

d ci

rcle

s

Ward DBSCAN Mixture models

Fig. 4. The effect of algorithm on clustering results. Four toy data sets (https://github.com/knaegle/
clusteringReview) demonstrate the effects of different types of clustering algorithms on various knownstructures in two-dimensional data.




R E V I E W

whereas others highlighted shared biological function within a cell sig-naling pathway (32, 45).

Application of multiple clustering approaches can reveal a single stableand robust result or uncover additional structures in the data. Applying onlya single clustering approach risks drawing conclusions based on noise ormissing interesting and useful structures within the data.

Accounting for noise and measurement uncertaintyUncertainty in data means uncertainty in clustering results. However,many clustering algorithms do not account for noise or error when de-termining relationships between observations or when calculating dis-tance. Although the molecular and cellular biology research communityhas adopted a set of rules that enable meaningful interpretation of differ-

pac

http://stke.sciencemag.or

Dow

nloaded from

ences inmolecular measurements between pairs of conditions, this kind ofstandard has not been adopted for clustering complex, high-dimensionaldata. For example, when comparing means of measured data, we tend torequire a minimum of triplicate measurements and the use of Student’s ttest to determine statistically significant differences within biological data(49, 50). Surprisingly, similar standards and rules do not exist for identifyingdifferential patterns or groupings from clustering results despite the datahaving the same measurements and biology as low-dimensionality data be-tween pairs of conditions.

Several methods account for uncertainty in the data and properlypropagate that uncertainty into clustering results. One ensemble ap-proach is exemplified by Kerr and Churchill’s work (51) in which theresearchers performed repeated clustering on multiple data sets gener-ated by sampling gene expression from a statistical model that incor-porates the variance of the measurements. The perturbation wasresampling from reasonable, alternate representations of the originaldata based on noise. From this analysis, the variation in the clusteringresult due to the variability in the original data is determined, thusproducing a range of clustering results representing the variability. Analternative approach is to use model-based clustering algorithms that ac-count for the variance in the replicate measurements during the forma-tion of the clusters (47, 52, 53).

Because model-based approaches are not commonly available to mo-lecular biologists in standard software packages, we explored ways toaccount for noise with ensembles (45) without necessarily relying on astatistical model of the data. This exploration led us to two counterintuitiveconclusions about the effects of noise. First, when data are well separated,robust clusters can still be found even in datawith noisy distributions. Thus,clustering robustness cannot be predicted on the basis of the noise in thedata. Second, the noise and variance in data are useful and contain infor-mation that can be revealed from the ensemble analysis. Specifically, wefound that as signals propagated in time from a receptor tyrosine kinaseto the mitogen-activated protein kinase (MAPK) pathway, there was high

Table 1. Validation metrics. Validation metrics used for testing thequality of a clustering result and the measures on which they are based.

Metric R
eferences Measures
RMSSTD
(34) Compactness RS (35) Modified Hubert’s G statistic (36) K-nearest neighbor consistency (37) Connectedness Connectivity (38) Determinant ratio index (62) SD validity index (35) Separation Pseudo-F statistic(Calinski-Harabasz)
(39)
Combination ofcompactnessand separation
Dunn index
(40) Silhouette width (41) Davies-Bouldin index (41) Gap statistic (43) Rand index (44) S imilarity between solutions
g
on O
ctober/

Table 2. Validation metrics. Results of validation metrics [RMSSTD (34), RS (35), determinant ratio index (62), and SD validity index (35)]applied to the data from Fig. 4. The validation metrics indicate the type of best score (max or min). The bold values represent the best scorefor each structure and validation metric.

tne

16, 202
Com
www

ss

.SCIENCESIGNA

Connectedness

LING.org 14 June 2016 Vol 9 Is

Separation

0

Validation metric
RMMSTD RS Determinant ratio index SD validity index Algorithm Structure Min Max Min Min K-means No structure 0.38 0.23 4.04 6.67 Ward No structure 0.40 0.18 2.58 7.56 DBSCAN No structure 0.44 0.00 1.00 N/A Mixture models No structure 0.38 0.23 4.09 6.67 K-means Spherical clusters 0.78 0.81 249.08 0.32 Ward Spherical clusters 1.18 0.56 26.43 0.18 DBSCAN Spherical clusters 0.58 0.86 1683.53 2.52 Mixture models Spherical clusters 0.78 0.81 249.08 0.32 K-means Long parallel clusters 0.89 0.27 2.77 0.79 Ward Long parallel clusters 0.93 0.21 2.10 0.75 DBSCAN Long parallel clusters 0.84 0.21 22.36 4.32 Mixture models Long parallel clusters 0.89 0.27 2.79 0.79 K-means Half-moons 0.55 0.35 4.23 1.77 Ward Half-moons 0.57 0.30 3.77 1.98 DBSCAN Half-moons 0.60 0.21 3.03 2.61 Mixture models Half-moons 0.56 0.33 5.21 1.93 K-means Nested circles 0.52 0.17 2.71 4.04 Ward Nested circles 0.53 0.13 1.97 3.30 DBSCAN Nested circles 0.57 0.00 1.00 4059.71 Mixture models Nested circles 0.52 0.17 2.71 3.99
sue 432 re6 7


R E V I E W

on October 16, 2020


Dow

nloaded from

variability in an intermediate signaling state in the MAPK pathway. Theeffect of this noise in the clustering result reflected the relationship of thisintermediate signal (a singly phosphorylated kinase) with both the upstreamsignal (the receptor) and the downstream signal (the doubly phosphorylatedform of the kinase) and represented a meaningful biological relationship.Consequently, we recommend against prefiltering data to remove those withhigh variance; instead, noise should be addressed in the clustering analysis,even in the absence of replicates [as detailed in (45)].

Determining biological and statistical significance ofclustering resultsA primary challenge of clustering analysis is deriving biological insight.The most successful analyses often result from combining clustering withprevious biological understanding of the system.However, unless the processof attaching biological meaning to the clusters is done with statistical anal-ysis, we can overinterpret relationships that “make sense” to us. The twomost common pitfalls related to this are (i) using anecdotal observationsinstead of a statistical test and (ii) failing to account for the increase in falsepositives when multiple statistical tests are performed.

Ultimately, to avoid the first pitfall, a researcher must determine thelikelihood of having made a particular observation by random chance. This“null model” or “background model” can then be used to assess how likelythe relationship uncovered by clustering is truly related to the biologicalinformation under consideration, as opposed to the likelihood of it occurringby random chance. We demonstrate this process with the microRNAclustering result from the analysis of the 89 cells lines (Fig. 3C) (6). Totest the statistical hypothesis, we asked how likely it is that the 15 gastro-intestinal cell lines would have occurred in a cluster of 20 cell lines byrandom chance. We used the theoretical hypergeometric distribution tocalculate the probability using a right-tailed test. The resulting P value ap-proaches zero, indicating that this clustering result is unlikely to be due torandom chance alone. In contrast, the random chance of any two gastro-

intestinal cell lines appearing in a cluster of 20 cell lines is likely due torandom chance alone (P ≤ 0.9).

In many biological data sets, we are not simply testing one label in asingle cluster but rather multiple labels across multiple clusters. A commonexample in cell biology and signaling research is to test for enrichment inany cluster of Gene Ontology terms (54) or biological pathway assignmentsfrom other sources. Testing for the enrichment of such biological propertiesor classifications across the clustering results representsmultiple hypothesistesting and thus requires “multiple hypothesis correction.” For example, aP value cutoff of 0.01 for a single tested hypothesis yields the predictionthat false-positive results due to random chance occur rarely—only 1/100thof the time. However, as is common in biological data set analysis, we oftentest thousands of hypotheses (for example, asking if anyone of 10,000 genesis differentially expressed). In this case, evenwith aP value cutoff of 0.01,wewould expect to find 100 false-positive results due to random chance. If weidentify 105 statistically significant items, butwe know100 are false positives,we cannot separate which 5 are likely to be the true positives without mul-tiple hypothesis correction.

When choosing a procedure for multiple hypothesis correction, re-ducing false positives may simultaneously increase false-negatives (theelimination of real positives), which can result in missing biological insight.To reduce the frequency of false-negatives, we recommend using the falsediscovery rate (FDR) correction procedure introduced by Benjamini andHochberg (55), which is often applied in microarray analysis (56, 57),rather than the Bonferroni correction (58). Regardless of the type of cor-rection chosen, when askingmultiple questions, some form of multiple hy-pothesis correction is necessary to prevent overinterpreting the results.

Ensemble Clustering: A Solution to Many Pitfalls

Ensemble clustering refers to the act of clustering the data many timeswhile making some perturbation—either to the data matrix or to clustering

Table 3. Ensemble perturbations. Major perturbations applied to the data or to the clustering parameters in ensemble clustering and theirintended purpose.

Perturbation
Reason behind perturbation
www.SCIENCESIGNALING.org 14 June 2016 Vol 9 Iss

References

K
Identify the optimum number of clusters (46, 60, 63) Noise Identify relationships within the data that are not affected by biological or experimental noise (45, 47, 64) Starting point (nondeterministicalgorithms)
Identify those partitions that are independent of starting position or identify set of minima
(23, 48)
Projections into lower dimensions
Increase robustness to clustering noise resulting from high dimensionality (30, 63, 65–67) Subsampling Identify subsets of the data that cluster consistently (68–70) Parameters of clustering Identify unpredicted biological information (32)
Table 4. Summary of in-depth review articles. A collection of reviews for more in-depth coverage of each topic.

In-depth review area
References
Specific clustering algorithms, their trade-offs, and how they function
(2, 3, 71–73) Analysis of the effects of different distance metrics on clustering gene expression data (74, 75) Practical and mathematical implications of high dimensionality on clustering (15, 16) A thorough review of validation metrics (3 5, 38, 62, 76) The most common multiple hypothesis correction procedures including Bonferroni correction and FDR correction. (55, 58) The effects of specific distances on data clustering for lower-dimensional spaces (77) The effects of specific distances on data clustering for high-dimensional spaces (15, 78) Ensembles of some algorithms incompatible with high-dimensional data can be useful on higher-dimensional data, even when a singleclustering solution is uninformative.
(23, 24)

A more in-depth analysis of ensembles, including evaluating the results of multiple clustering runs and determining consensus
(60, 63)
ue 432 re6 8


R E V I E W

on Oct


Dow

nloaded from

parameters—and then accounting for all ofthe clustering results across the ensemble.The goal of ensemble clustering is to im-prove the quality and robustness of cluster-ing results when compared to any singleclustering result.

Why do ensembles improve quality androbustness? In short, it is because uncorre-lated “noise” cancels across the clusteringresults. In ensemble clustering, noise in theclustering results occurs when there arestrong biases due to the selected algorithmsor because the data contains poorly clusteringpoints. Fortunately, if the errors in each clus-tering result are not correlated, and the er-rors pertain to only a subset of the data,then the shared decisions made across theensemble will dominate, resulting in con-vergence to the robust clustering result. Acombination of diverse clustering resultsstrengthens the underlying signal while fil-tering out the individual noise from eachclustering result, enabling a more robustdetermination of the data structure than thatacquired from a single clustering result ob-tained through analysis of the data withoutperturbation.

Ensemble generation, finishing,and visualizationThe process of performing ensemble clus-tering involves selecting an appropriate per-turbation (Table 3), collecting the clusteringresults based on the perturbation, and theneither combining the individual clusteringresults into a single clustering result or ex-

ober 16, 2020

ploring the ensemble of clustering results for information about theunderlying structure of the data. To illustrate an ensemble of clustering solu-tions, we used random toy data and created an ensemble of clustering resultsusing k-means clustering with an increasing number of clusters (k) (Fig. 5A).

There have been many techniques proposed to combine results of in-dividual clustering results in the ensemble into one final clustering result.We refer to these methods as finishing techniques. Most finishing tech-niques use agreement across the ensemble to build a final clustering result.One method is to calculate the co-occurrence (or consensus) matrix. A co-occurrence matrix is anm ×mmatrix, where each entry, Ci,j, represents thenumber of times object i clusters with object j across all of the clusteringresults in the ensemble.

We clustered the co-occurrence matrix using hierarchical clusteringand Ward linkage and plotted the result as a heatmap (Fig. 5B). The clustersare formed on the basis of creating maximal in-group co-occurrence fre-quency and minimum co-occurrence with members outside the group.This representation reveals a wealth of detail about the relationships be-tween data points and highlights data points that cocluster robustly (that is,frequently) with each other across the ensemble.

Others have used majority voting, also based on the ideas of robustness(59, 60), as demonstrated in Fig. 5C. We subjected the co-occurrencematrix above to majority voting (Fig. 5C, left) and then plotted the en-semble majority-voting cluster assignments (Fig. 5C, right). Identifyingthe most robust clustering result is useful for generating hypotheses for

further experimental testing because hypotheses can be ranked on thebasis of strength of the co-occurrence (90% of the time compared to10% of the time).

Another finishing technique to identify robust portions of the ensembleis to apply graph theory. For example, if we assumed that the co-occurrencematrix represented edge weights (a numerical value indicating thestrength of connection) connecting the data points, we could traversethese weights to find maximally connected subgraphs and provide a dif-ferent representation of robust clusters (61). With this concept of graph the-ory representations of ensemble clustering, we discovered that robustlyclustered dynamic tyrosine phosphorylation data uncovered molecular-level interactions (8).

The finishing techniques mentioned above uniquely assign each obser-vation to one cluster, thereby creating hard partitions within the data.However, a benefit of ensemble clustering is the ability to identify theprobability of relationships (fuzzy partitioning), which can be appliedto the entire data (probability is calculated for membership of any ob-servation to any cluster) or a mixture of hard partitioning and probability-based assignment. An examination of the portion of the heatmap repre-sentation of the co-occurrence matrix containing the blue, gray, andorange clusters demonstrates fuzzy partitioning (Fig. 5D, left). Theheatmap indicates that the gray cluster members share partial member-ship with the blue and orange clusters (Fig. 5D, right). Rather thanconsidering the gray cluster as distinct, one can consider it to have a

BA

DC

k = 2 k = 3 k = 490%

100%

80%

70%

60%

50%

40%

30%

20%

10%

0%

k = 5 k = 6 k = 7

k = 8 k = 9 k = 10

Fig. 5. Ensemble clustering overview. Finishing techniques were applied to random toy data (see file S1for analysis details). (A) Set of clustering results obtained using the k-means algorithm with various values
of k (a k-sweep). (B) Hierarchically clustered (Ward linkage) co-occurrence matrix for the ensemble ofresults in (A). The heatmap represents the percentage of times any pair of data points coclusters acrossthe ensemble. (C) A majority vote analysis was applied (left panel) using a threshold of 50% on the co-occurrence matrix in (B). Six clusters (see dendrogram color groupings) result from the majority vote(right panel). (D) Application of fuzzy clustering to the ensemble. The left panel shows the details of theco-occurrence matrix for the blue, gray, and orange clusters, and the right shows the clustering assign-ments. The gray cluster provides an example of partially fuzzy clustering because it shares membershipwith the orange and dark blue clusters.


R E V I E W

on October 16, 2020


Dow

nloaded from

partial membership in either the orangeor blue clusters with a probability basedon the co-occurrence matrix value.

Ensembles for robustclustering resultsEnsemble clustering can reveal unexpectedresults. When an algorithm with limitedcapabilities is combined into an ensem-ble, the clustering result of the ensemblecan have new capabilities. For example,although a k-means algorithm can onlyidentify spherical clusters, when used aspart of an ensemble with majority voting,the ensemble can identify half-moons andspirals (59). This is possible because sig-nals from relatively weak relationshipsin each clustering result are combined toimprove the strength of the pairwise rela-tionship between points in the ensembleclustering results. Ensemble clustering canassess the impact of perturbations to clus-tering parameters on the clustering results,revealingwhen transformations to the datahave a larger impact on the clustering re-sult than when the algorithm or distancemetric does (32).

As an example of ensemble clustering,we describe a subset of the results of anensemble approach that we used to clus-ter the dynamics of tyrosine phospho-rylation in the epidermal growth factorreceptor (EGFR) network measured byWolf-Yadlin and colleagues (13). Fromthis analysis, we identified previously un-known protein interactions (8). We showa subset of the full analysis to illustrate thisprocess (Fig. 6). PDLIM1, a protein notpreviously reported as part of the EGFRnetwork, had similar phosphotyrosinedynamicswithmanyother proteins knownto directly interact with phosphorylated
tyrosine residues of EGFR (Tyr1197 and Tyr1172) (Fig. 6A; blue bar,PDLIM1; orange bars, EGFR interactors). To identify a more robust rep-resentation of the clustering behavior of the system, we generated anensemble of clusters by varying distance metrics and clustering algo-rithms. Across the ensemble, known interactions had a much higher ten-dency to cluster together than with noninteractors. Avisualization of theensemble results—a co-occurrence matrix—places the interactors ofthese EGFR phosphotyrosine sites in one of the clusters (Fig. 6B, upperleft and orange bars), along with the potential interactor PDLIM1 (bluebar). On the basis of these results, we experimentally tested and validatedPDLIM1 as a protein that interacted with EGFR (8). It is important tonote that, in many of the ensemble clustering results, PDLIM1 did notcluster with all known interactors of EGFR phosphorylated at Tyr1197
and Tyr1172. Rather, PDLIM1 tended to cluster with a subset of knowninteractors. Furthermore, in many ensemble clustering results, the knowninteractors of EGFRdid not all cluster together (Fig. 6C). This demonstratesthe value of ensemble clustering because analysis of a single clustering re-sult might have missed this important relationship (8).

w

Conclusion

Clustering biological data involves a number of choices, many of whichare critical to obtaining meaningful results. Evaluation of a data setshould be performed to ensure that it has a sufficient number of obser-vations (data points) and that the dimensionality of those observationsinforms subsequent clustering choices. For each new data set, dimen-sionality and any transformations applied should influence the choiceof appropriate distance metrics and algorithms for clustering. Data setsof more than 10 dimensions often behave unexpectedly, and clusteringcan produce meaningless results. Using only a single clustering resultfrom any data set can lead to wasted time and resources resulting fromerroneous hypothesis testing. The effect of permutations of clusteringparameters on the clustering results should be explored using validationmetrics and stability testing. When possible, noise and variance shouldbe accounted for in the clustering method directly rather than simply tak-ing averages at each data point. Once clustering results are obtained, theirvalidity should be evaluated using the appropriate metrics. The statisticalsignificance of clustering results should also be evaluated, and multiple

BA

CSingle

Eucl

idea

nC

orre

latio

nC

itybl

ock

Cos

ine

Complete Average

0% 25% 50% 75% 100%

Fig. 6. Ensemble clustering on phosphoproteomic data. (A) Single clustering solution showing knowninteractors with EGFR (orange bars) and PDLIM1 (blue bar) coclustering in the phosphoproteomic data
(blue heatmap). (B) Co-occurrence matrix heatmap demonstrating clustering of interactors with EGFR.The known interactors with EGFR (orange bars) and PDLIM1 (blue bar) are found in a single cluster(upper left). (C) Subset of clustering results across multiple distance metrics and clustering algorithms.Under the dendrogram, known interactors with EGFR are marked with orange bars and PDLIM1 ismarked with a blue bar.
ww.SCIENCESIGNALING.org 14 June 2016 Vol 9 Issue 432 re6 10


R E V I E W

hypothesis correction should be applied when necessary. For robustresults, ensemble clustering over a range of distance metrics, transfor-mations, and other clustering parameters is effective. Following thesesteps will result in obtaining robust and reliable results from clusteringand provide a basis for solid generation of testable hypotheses.

on October 16, 2020


Dow

nloaded from

SUPPLEMENTARY MATERIALSwww.sciencesignaling.org/cgi/content/full/9/432/re6/DC1File S1. Output of the iPython Notebook that generates all examples in this review.

REFERENCES AND NOTES1. A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: A review. ACM Comput. Surv. 31,

264–323 (1999).2. R. Xu, D. C. Wunsch II, Clustering algorithms in biomedical research: A review. IEEE

Rev. Biomed. Eng. 3, 120–154 (2010).3. B. Andreopoulos, A. An, X. Wang, M. Schroeder, A roadmap of clustering algorithms:

Finding a match for a biomedical application. Brief. Bioinform. 10, 297–314 (2009).4. T. Sørlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. B. Eisen,

M. van de Rijn, S. S. Jeffrey, T. Thorsen, H. Quist, J. C. Matese, P. O. Brown, D. Botstein,P. E. Lønning, A. L. Børresen-Dale, Gene expression patterns of breast carcinomasdistinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. U.S.A.98, 10869–10874 (2001).

5. J. Lapointe, C. Li, J. P. Higgins, M. van de Rijn, E. Bair, K. Montgomery, M. Ferrari,L. Egevad, W. Rayford, U. Bergerheim, P. Ekman, A. M. DeMarzo, R. Tibshirani,D. Botstein, P. O. Brown, J. D. Brooks, J. R. Pollack, Gene expression profilingidentifies clinically relevant subtypes of prostate cancer. Proc. Natl. Acad. Sci.U.S.A. 101, 811–816 (2004).

6. J. Lu, G. Getz, E. A. Miska, E. Alvarez-Saavedra, J. Lamb, D. Peck, A. Sweet-Cordero,B. L. Ebert, R. H. Mak, A. A. Ferrando, J. R. Downing, T. Jacks, H. R. Horvitz,T. R. Golub, MicroRNA expression profiles classify human cancers. Nature435, 834–838 (2005).

7. R. G. W. Verhaak, K. A. Hoadley, E. Purdom, V. Wang, Y. Qi, M. D. Wilkerson, C. R. Miller,L. Ding, T. Golub, J. P. Mesirov, G. Alexe, M. Lawrence, M. O’Kelly, P. Tamayo, B. A. Weir,S. Gabriel, W. Winckler, S. Gupta, L. Jakkula, H. S. Feiler, J. G. Hodgson, C. D. James,J. N. Sarkaria, C. Brennan, A. Kahn, P. T. Spellman, R. K. Wilson, T. P. Speed, J. W. Gray,M. Meyerson, G. Getz, C. M. Perou, D. N. Hayes; Cancer Genome Atlas ResearchNetwork, Integrated genomic analysis identifies clinically relevant subtypes of glioblastomacharacterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17,98–110 (2010).

8. K. M. Naegle, F. M. White, D. A. Lauffenburger, M. B. Yaffe, Robust co-regulation oftyrosine phosphorylation sites on proteins reveals novel protein interactions.Mol. Biosyst.8, 2771–2782 (2012).

9. A. Wolf-Yadlin, N. Kumar, Y. Zhang, S. Hautaniemi, M. Zaman, H.-D. Kim, V. Grantcharova,D. A. Lauffenburger, F. M. White, Effects of HER2 overexpression on cell signalingnetworks governing proliferation and migration. Mol. Syst. Biol. 2, 54 (2006).

10. M. Jain, R. Nilsson, S. Sharma, N. Madhusudhan, T. Kitami, A. L. Souza, R. Kafri,M. W. Kirschner, C. B. Clish, V. K. Mootha, Metabolite profiling identifies a key role forglycine in rapid cancer cell proliferation. Science 336, 1040–1044 (2012).

11. M. A. Miller, L. Barkal, K. Jeng, A. Herrlich, M. Moss, L. G. Griffith, D. A. Lauffenburger,Proteolytic Activity Matrix Analysis (PrAMA) for simultaneous determination of multipleprotease activities. Integr. Biol. 3, 422–438 (2011).

12. V. K. Mootha, C. M. Lindgren, K.-F. Eriksson, A. Subramanian, S. Sihag, J. Lehar,P. Puigserver, E. Carlsson, M. Ridderstråle, E. Laurila, N. Houstis, M. J. Daly, N. Patterson,J. P. Mesirov, T. R. Golub, P. Tamayo, B. Spiegelman, E. S. Lander, J. N. Hirschhorn,D. Altshuler, L. C. Groop, PGC-1a-responsive genes involved in oxidative phosphoryl-ation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267–273(2003).

13. A. Wolf-Yadlin, S. Hautaniemi, D. A. Lauffenburger, F. M. White, Multiple reactionmonitoring for robust quantitative proteomic analysis of cellular signaling networks.Proc. Natl. Acad. Sci. U.S.A. 104, 5860–5865 (2007).

14. M. Köppen, The curse of dimensionality, 5th Online World Conf. Soft Comput. Ind.Appl. 1, 4–8 (2000).

15. M. Steinbach, L. Ertöz, V. Kumar, The challenges of clustering high dimensional data,in New Directions in Statistical Physics (Springer, Berlin, 2004), pp. 273–309.

16. A. Zimek, E. Schubert, H.-P. Kriegel, A survey on unsupervised outlier detection inhigh-dimensional numerical data. Stat. Anal. Data Min. 5, 363–387 (2012).

17. K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is “nearest neighbor”meaningful?, in Database Theory—ICDT’99 (Springer, Berlin, 1999), pp. 217–235.

18. C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, J. S. Park, Fast algorithms forprojected clustering. ACM SIGMOD Rec. 28, 61–72 (1999).

w

19. V. Pestov, On the geometry of similarity search: Dimensionality curse and concentrationof measure. Inform. Process. Lett. 73, 47–51 (2000).

20. A. Hinneburg, C. C. Aggarwal, D. A. Keim, What is the nearest neighbor in high dimen-sional spaces? Proc. 26th VLDB Conf., 506–515 (2000).

21. A. Hinneburg, D. A. Keim, Optimal grid-clustering: Towards breaking the curse of dimen-sionality in high-dimensional clustering. Int. Conf. Very Large Databases, 506–517(1999).

22. M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discoveringclusters in large spatial databases with noise, Second Int. Conf. Knowl. Discov.Data Min., 226–231 (1996).

23. E.-Y. Kim, S.-Y. Kim, D. Ashlock, D. Nam, MULTI-K: Accurate classification of micro-array subtypes using ensemble k-means clustering. BMC Bioinformatics 10, 260(2009).

24. D. Amaratunga, J. Cabrera, Y.-S. Lee, Resampling-based similarity measures forhigh-dimensional data. J. Comput. Biol. 22, 54–62 (2015).

25. B. Byeon, K. Rasheed, Simultaneously removing noise and selecting relevantfeatures for high dimensional noisy data. Proc. 7th Int. Conf. Mach. Learn. Appl. ICMLA2008, 147–152 (2008).

26. C. Hou, F. Nie, D. Yi, D. Tao, Discriminative embedded clustering: A framework forgrouping high-dimensional data. IEEE Trans. Neural Netw. Learn. Syst. 26, 1287–1299(2014).

27. H.-P. Kriegel, P. Kröger, A. Zimek, Clustering high-dimensional data: A survey onsubspace clustering, pattern-based clustering, and correlation clustering. ACM Trans.Knowl. Discov. Data 3, 1–58 (2009).

28. J. Tang, S. Alelyani, H. Liu, in Data Classification: Algorithms and Applications,C. C. Aggarwal, Ed. (Chapman and Hall/CRC Press, Boca Raton, FL, 2014),pp. 37–64.

29. K. Y. Yeung, W. L. Ruzzo, Principal component analysis for clustering gene expressiondata. Bioinformatics 17, 763–774 (2001).

30. L. Parsons, E. Haque, H. Liu, Subspace clustering for high dimensional data: A review.ACM SIGKDD Explor. Newsl. 6, 90–105 (2004).

31. R. A. van den Berg, H. C. J. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. van derWerf, Centering, scaling, and transformations: Improving the biological informationcontent of metabolomics data. BMC Genomics 7, 142 (2006).

32. K. M. Naegle, R. E. Welsch, M. B. Yaffe, F. M. White, D. A. Lauffenburger, MCAM:Multiple clustering analysis methodology for deriving hypotheses and insights fromhigh-throughput proteomic datasets. PLOS Comput. Biol. 7, e1002119 (2011).

33. P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander,T. R. Golub, Interpreting patterns of gene expression with self-organizing maps:Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A.96, 2907–2912 (1999).

34. S. Sharma, Applied Multivariate Techniques (John Wiley & Sons Inc., New York, 1996),vol. 39, p. 512.

35. M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques. J. Intell.Inf. Syst. 17, 107–145 (2001).

36. L. Hubert, P. Arabie, Comparing partitions. J. Classif. 2, 193–218 (1985).37. C. Ding, X. He, K-nearest-neighbor consistency in data clustering: Incorporating local

information into global optimization, Proc. 2004 ACM Symp. Appl. Comput., 584–589(2004).

38. J. Handl, J. Knowles, D. B. Kell, Computational cluster validation in post-genomicdata analysis. Bioinformatics 21, 3201–3212 (2005).

39. T. Caliński, J. Harabasz, A dendrite method for cluster analysis. Commun. Stat. 3, 1–27(1974).

40. J. C. Dunn, Well-separated clusters and optimal fuzzy partitions. J. Cybernetics 4,95–104 (1974).

41. D. L. Davies, D. W. Bouldin, A cluster separation measure. IEEE Trans. Pattern Anal.Mach. Intell. 1, 224–227 (1979).

42. P. J. Rousseeuw, Silhouettes: A graphical aid to the interpretation and validation ofcluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

43. R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set viathe gap statistic. J. R. Stat. Soc. B Met. 63, 411–423 (2001).

44. W. M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Stat.Assoc. 66, 846–850 (1971).

45. R. Sloutsky, N. Jimenez, S. J. Swamidass, K. M. Naegle, Accounting for noise whenclustering biological data. Brief. Bioinform. 14, 423–436 (2013).

46. A. Ben-Hur, A. Elisseeff, I. Guyon, A stability based method for discovering structurein clustered data. Pac. Symp. Biocomput. 7, 6–17 (2002).

47. K. Y. Yeung, M. Medvedovic, R. E. Bumgarner, Clustering gene-expression data withrepeated measurements. Genome Biol. 4, R34 (2003).

48. L. I. Kuncheva, D. P. Vetrov, Evaluation of stability of k-means cluster ensembles withrespect to random initialization. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1798–1808(2006).

49. K. Naegle, N. R. Gough, M. B. Yaffe, Criteria for biological reproducibility: What does“n” mean? Sci. Signal. 8, fs7 (2015).


http://www.sciencesignaling.org/cgi/content/full/9/432/re6/DC1


R E V I E W

http://stke.sciencema

Dow

nloaded from

50. E. F. Ryder, P. Robakiewicz, Statistics for the molecular biologist: Group comparisons.Curr. Protoc. Mol. Biol. 67, A.3I.1–A.3I.22 (2001).

51. M. K. Kerr, G.A. Churchill, Bootstrapping cluster analysis: Assessing the reliability ofconclusions from microarray experiments. Proc. Natl. Acad. Sci. U.S.A. 98, 8961–8965(2001).

52. M. Medvedovic, K. Y. Yeung, R. E. Bumgarner, Bayesian mixture model basedclustering of replicated microarray data. Bioinformatics 20, 1222–1232 (2004).

53. E. J. Cooke, R. S. Savage, P. D. W. Kirk, R. Darkins, D. L. Wild, Bayesian hierarchicalclustering for microarray time series data with replicates and outlier measurements.BMC Bioinformatics 12, 399 (2011).

54. Gene Ontology Consortium, Gene Ontology Consortium: Going forward. NucleicAcids Res. 43, D1049–D1056 (2015).

55. Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: A practical and powerfulapproach to multiple testing. J. R. Stat. Soc. B Met. 57, 289–300 (1995).

56. C.-A. Tsai, H.-m. Hsueh, J. J. Chen, Estimation of false discovery rates in multipletesting: Application to gene microarray data. Biometrics 59, 1071–1081 (2003).

57. S. B. Pounds, Estimation and control of multiple testing error rates for microarray studies.Brief. Bioinform. 7, 25–36 (2006).

58. M. Aickin, H. Gensler, Adjusting for multiple testing when reporting research results:The Bonferroni vs Holm methods. Am. J. Public Health 86, 726–728 (1996).

59. A. L. N. Fred, A. K. Jain, Data clustering using evidence accumulation, Proc. 16th Int.Conf. Pattern Recognit. 4, 276–280 (2002).

60. A. Strehl, J. Ghosh, Cluster ensembles—A knowledge reuse framework for combiningmultiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002).

61. S. Mimaroglu, M. Yagci, CLICOM: Cliques for combining multiple clusterings. ExpertSyst. Appl. 39, 1889–1901 (2012).

62. B. Desgraupes, Clustering indices. CRAN Packag., 1–10 (2013).63. S. Monti, P. Tamayo, J. Mesirov, T. Golub, Consensus clustering: A resampling-

based method for class discovery and visualization of gene expression microarraydata. Mach. Learn. 52, 91–118 (2003).

64. E. R. Dougherty, J. Barrera, M. Brun, S. Kim, R. M. Cesar, Y. Chen, M. Bittner, J. M. Trent,Inference from clustering with application to gene-expression microarrays. J. Comput. Biol.9, 105–126 (2002).

65. R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering ofhigh dimensional data for data mining applications. ACM SIGMOD Rec. 27, 94–105(1998).

66. S. Goil, H. Nagesh, A. Choudhary, “MAFIA: Efficient and scalable subspace clus-tering for very large data sets” (Technical Report CPDC-TR-9906-010, Center forParallel and Distributed Computing, Evanston, IL, 1999).

w

67. A. Zimek, J. Vreeken, The blind men and the elephant: On meeting the problem ofmultiple truths in data from clustering and pattern mining perspectives. Mach. Learn.98, 121–155 (2015).

68. E. Levine, E. Domany, Resampling method for unsupervised estimation of clustervalidity. Neural Comput. 13, 2573–2593 (2001).

69. P. Bellec, P. Rosa-Neto, O. C. Lyttelton, H. Benali, A. C. Evans, Multi-level bootstrapanalysis of stable clusters in resting-state fMRI. Neuroimage 51, 1126–1139 (2010).

70. T. Lange, V. Roth, M. L. Braun, J. M. Buhmann, Stability-based validation of clusteringsolutions. Neural Comput. 16, 1299–1323 (2004).

71. A. K. Jain, Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31, 651–666(2010).

72. P. Berkhin, “Survey of clustering data mining techniques” (Technical Report, AccrueSoftware Inc., San Jose, CA, 2002).

73. P. A. Jaskowiak, R. J. G. B. Campello, I. G. Costa, On the selection of appropriate distancesfor gene expression data clustering. BMC Bioinformatics 15 (Suppl. 2), S2 (2014).

74. R. Giancarlo, G. Lo Bosco, L. Pinello, Learning and intelligent. Optimization 6073,125–138 (2010).

75. Y. Liu, Z. Li, H. Xiong, X. Gao, J. Wu, Understanding of internal clustering validationmeasures, 2010 IEEE Int. Conf. Data Min., 911–916 (2010).

76. I. B. Mohamad, D. Usman, Standardization and its effects on K-means clusteringalgorithm. Res. J. Appl. Sci. Eng. Technol. 6, 3299–3303 (2013).

77. C. C. Aggarwal, A. Hinneburg, D. A. Keim, On the surprising behavior of distancemetrics in high dimensional space Lect. Notes Comput. Sci. 1973, 420–434 (2001).

Acknowledgments: We thank T. Crum for critical reading. Funding: T.R. is funded, in part,by the Center for Biological Systems Engineering, Washington University in St. Louis (St.Louis, MO). Author contributions: T.R. and K.M.N. contributed equally to the conception,writing, research, and final approval of the paper. Z.Q. researched and implementedclustering validation metrics and their dependencies. Competing interests: The authorsdeclare that they have no competing interests. Data and materials availability: Allsupplemental executable code and file dependencies are available for checkout from theGitHub repository: https://github.com/knaegle/clusteringReview.

Submitted 7 August 2015Accepted 27 May 2016Final Publication 14 June 201610.1126/scisignal.aad1932Citation: T. Ronan, Z. Qi, K. M. Naegle, Avoiding common pitfalls when clusteringbiological data. Sci. Signal. 9, re6 (2016).

g.o


on October 16, 2020

rg/



Avoiding common pitfalls when clustering biological dataTom Ronan, Zhijie Qi and Kristen M. Naegle

DOI: 10.1126/scisignal.aad1932 (432), re6.9Sci. Signal.

ARTICLE TOOLS http://stke.sciencemag.org/content/9/432/re6

MATERIALSSUPPLEMENTARY http://stke.sciencemag.org/content/suppl/2016/06/10/9.432.re6.DC1

CONTENTRELATED

http://science.sciencemag.org/content/sci/210/4468/390.full.pdfhttp://stke.sciencemag.org/content/sigtrans/9/431/ra59.fullhttp://stke.sciencemag.org/content/sigtrans/2/81/ra39.fullhttp://stke.sciencemag.org/content/sigtrans/2/87/ra51.fullhttp://stke.sciencemag.org/content/sigtrans/5/220/pl2.fullhttp://stke.sciencemag.org/content/sigtrans/6/299/rs14.fullhttp://stke.sciencemag.org/content/sigtrans/1/49/ra16.fullhttp://stke.sciencemag.org/content/sigtrans/4/190/tr3.full

REFERENCES

http://stke.sciencemag.org/content/9/432/re6#BIBLThis article cites 62 articles, 7 of which you can access for free

PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions

Terms of ServiceUse of this article is subject to the

is a registered trademark of AAAS.Science SignalingYork Avenue NW, Washington, DC 20005. The title (ISSN 1937-9145) is published by the American Association for the Advancement of Science, 1200 NewScience Signaling

Copyright © 2016, American Association for the Advancement of Science

on October 16, 2020


Dow

nloaded from

http://stke.sciencemag.org/content/9/432/re6

http://stke.sciencemag.org/content/suppl/2016/06/10/9.432.re6.DC1

http://stke.sciencemag.org/content/sigtrans/4/190/tr3.full

http://stke.sciencemag.org/content/sigtrans/1/49/ra16.full

http://stke.sciencemag.org/content/sigtrans/6/299/rs14.full

http://stke.sciencemag.org/content/sigtrans/5/220/pl2.full




http://science.sciencemag.org/content/sci/210/4468/390.full.pdf

http://stke.sciencemag.org/content/9/432/re6#BIBL

http://www.sciencemag.org/help/reprints-and-permissions

http://www.sciencemag.org/about/terms-service


avoiding common pitfalls when clustering biological data · of clustering analysis of biological...

Documents