on two-way bayesian agglomerative clustering of gene expression data

14
On Two-Way Bayesian Agglomerative Clustering of Gene Expression Data Anna Fowler and Nicholas A. Heard Department of Mathematics, Imperial College London, London, UK Received 4 November 2011; revised 13 April 2012; accepted 6 August 2012 DOI:10.1002/sam.11162 Published online in Wiley Online Library (wileyonlinelibrary.com). Abstract: This article introduces an agglomerative Bayesian model-based clustering algorithm which outputs a nested sequence of two-way cluster configurations for an input matrix of data. Each two-way cluster configuration in the output hierarchy is specified by a row configuration and a column configuration whose Cartesian product partitions the data matrix. Variable selection is incorporated into the algorithm by identifying row clusters which form distinct groups defined by the column clusters, through the use of a mixture model. A primitive similarity measure between the two clusters is the multiplicative change in model posterior probability implied by their merger, and the hierarchy is formed by iteratively merging the cluster pair which maximize some fixed monotonic function of this quantity. A naive implementation of the algorithm would be to choose this function to be the identity function. However, when applying this naive algorithm to gene expression data where the number of genes being studied typically far exceeds the number of experimental samples available, this imbalanced dimensionality of the data results in an algorithmic bias toward merging samples. To counteract this bias, alternative functions of the similarity measure are considered which prevent degenerative behavior of the algorithm. The resulting improvements in the output cluster configurations are demonstrated on simulated data and the method is then applied to real gene expression data. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 Keywords: two-way clustering; variable selection; gene expression 1. INTRODUCTION Gene expression data are routinely collected using microarrays which enable the simultaneous measurement of a large number of genes for each experimental sample unit. In contrast, obtaining a large number of samples for a particular study is often prohibitively expensive [1]. This leads to data matrices which are typically imbalanced, consisting of measurements for thousands of rows of genes over usually less than a hundred columns of experimental samples. Clustering the rows of these data matrices enables groups of co-regulated genes to be identified and therefore helps to determine the function of genes. Since the motivation for this article is clustering of gene expression data, we will use the words row and gene interchangeably, and similarly the words column and sample ; however, the methods presented could equally be applied in a variety of other contexts. A Bayesian model-based approach to clustering is adopted since this enables us to incorporate any prior beliefs Correspondence to: Anna Fowler ([email protected]) Additional Supporting Information may be found in the online version of this article. about the data and the possibility of large amounts of noise within the model. Additionally, the posterior probability mass function provides a means of quantitatively comparing models consisting of any number of clusters and therefore provides a natural solution to the recurring issue of how many clusters should be fitted [2]. There are two main approaches to Bayesian model-based clustering. The first relies upon Markov chain Monte Carlo (MCMC) methods to obtain a sample from the posterior distribution and uses this to infer an optimal cluster config- uration [3 – 6]. The second deterministically clusters the data agglomeratively by initially placing each observation in a separate cluster and then iteratively merging the cluster pair which provide the largest multiplicative change in posterior probability [7–9]. The number of possible cluster configu- rations of n objects is given by the nth Bell number [10] which grows very rapidly with n, for example the 50th Bell number is 1.86 × 10 47 . Therefore, when clustering genes, there is a huge number of possible cluster configurations and obtaining a true sample from the posterior distribu- tion using MCMC methods can be difficult since chains are likely to converge to a local maximum or require an unfeasibly long running time. In contrast, agglomerative © 2012 Wiley Periodicals, Inc.

Upload: anna-fowler

Post on 12-Oct-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: On two-way Bayesian agglomerative clustering of gene expression data

On Two-Way Bayesian Agglomerative Clustering of Gene Expression Data

Anna Fowler∗ and Nicholas A. Heard

Department of Mathematics, Imperial College London, London, UK

Received 4 November 2011; revised 13 April 2012; accepted 6 August 2012DOI:10.1002/sam.11162

Published online in Wiley Online Library (wileyonlinelibrary.com).

Abstract: This article introduces an agglomerative Bayesian model-based clustering algorithm which outputs a nestedsequence of two-way cluster configurations for an input matrix of data. Each two-way cluster configuration in the outputhierarchy is specified by a row configuration and a column configuration whose Cartesian product partitions the data matrix.Variable selection is incorporated into the algorithm by identifying row clusters which form distinct groups defined by the columnclusters, through the use of a mixture model. A primitive similarity measure between the two clusters is the multiplicative changein model posterior probability implied by their merger, and the hierarchy is formed by iteratively merging the cluster pair whichmaximize some fixed monotonic function of this quantity. A naive implementation of the algorithm would be to choose thisfunction to be the identity function. However, when applying this naive algorithm to gene expression data where the numberof genes being studied typically far exceeds the number of experimental samples available, this imbalanced dimensionality ofthe data results in an algorithmic bias toward merging samples. To counteract this bias, alternative functions of the similaritymeasure are considered which prevent degenerative behavior of the algorithm. The resulting improvements in the output clusterconfigurations are demonstrated on simulated data and the method is then applied to real gene expression data. © 2012 WileyPeriodicals, Inc. Statistical Analysis and Data Mining, 2012

Keywords: two-way clustering; variable selection; gene expression

1. INTRODUCTION

Gene expression data are routinely collected usingmicroarrays which enable the simultaneous measurementof a large number of genes for each experimental sampleunit. In contrast, obtaining a large number of samplesfor a particular study is often prohibitively expensive [1].This leads to data matrices which are typically imbalanced,consisting of measurements for thousands of rows of genesover usually less than a hundred columns of experimentalsamples. Clustering the rows of these data matrices enablesgroups of co-regulated genes to be identified and thereforehelps to determine the function of genes. Since themotivation for this article is clustering of gene expressiondata, we will use the words row and gene interchangeably,and similarly the words column and sample; however, themethods presented could equally be applied in a variety ofother contexts.

A Bayesian model-based approach to clustering isadopted since this enables us to incorporate any prior beliefs

Correspondence to: Anna Fowler ([email protected])Additional Supporting Information may be found in the online

version of this article.

about the data and the possibility of large amounts of noisewithin the model. Additionally, the posterior probabilitymass function provides a means of quantitatively comparingmodels consisting of any number of clusters and thereforeprovides a natural solution to the recurring issue of howmany clusters should be fitted [2].

There are two main approaches to Bayesian model-basedclustering. The first relies upon Markov chain Monte Carlo(MCMC) methods to obtain a sample from the posteriordistribution and uses this to infer an optimal cluster config-uration [3–6]. The second deterministically clusters the dataagglomeratively by initially placing each observation in aseparate cluster and then iteratively merging the cluster pairwhich provide the largest multiplicative change in posteriorprobability [7–9]. The number of possible cluster configu-rations of n objects is given by the nth Bell number [10]which grows very rapidly with n, for example the 50th Bellnumber is 1.86 × 1047. Therefore, when clustering genes,there is a huge number of possible cluster configurationsand obtaining a true sample from the posterior distribu-tion using MCMC methods can be difficult since chainsare likely to converge to a local maximum or require anunfeasibly long running time. In contrast, agglomerative

© 2012 Wiley Periodicals, Inc.

Page 2: On two-way Bayesian agglomerative clustering of gene expression data

2 Statistical Analysis and Data Mining, Vol. (In press)

clustering aims to evaluate a much smaller subset of cred-ible cluster configurations, and so can be applied even tovery large data sets.

In addition to genes forming co-regulated clusters, sam-ples may also have an underlying group structure, forexample corresponding to different, but related experimen-tal conditions. Thus it is often of interest to simultaneouslycluster genes and samples, identifying groups of geneswhich are co-regulated within groups of experimental con-ditions. In this case, clusters of a data matrix are definedas sub-matrices which can be obtained by permuting therow and column vectors [11]. In the literature, this formof cluster is referred to in many ways, such as ‘bi-cluster’,‘sub-space cluster’ and ‘two-way cluster’. Here they will bereferred to as two-way clusters since the method presenteddevelops a cluster configuration which consists of two ele-ments, a partition of the columns and a partition of the rows.

There are many existing methods of two-way clustering,for an in-depth review, see ref. 12. A popular model-basedmethod is the Plaid model [13–16], which uses an ANOVAmodel to describe a cluster by a shift in the mean expressionaway from a background noise level. Under this model, theij th element of a data matrix Y is expressed as

Yij = μ0 +K∑

k=1

(μk + αik + βjk)rikcjk + εij ,

where K is the number of clusters; rik and cjk are binaryvariables indicating membership of the kth cluster for rowi and column j , respectively; αik and βjk are the effectof the ith row and j th column in the kth cluster; and εij

is the residual error [13]. This model allows overlappingclusters and so does not necessarily partition the data. Theflexibility of this model is demonstrated by Izenman et al.[17] where it is used to model a combination of discreteand continuous variables. Alternative models include thoseused in refs 18 and 19 which allow for a change in mean,variance or both and return a column configuration specificto each row cluster.

The cluster structure consisting of a row configurationand a column configuration which partition the data isreferred to by Kluger et al. [20] as a ‘checker board’pattern. In that paper, a spectral biclustering methodidentifies the clusters by using singular value decompositionto solve the eigenvalue problems AT Ar = λr and AAT c =λc, where A is a normalized form of the original data matrixand r and c are row and column allocation vectors. Thesame structure is assumed in the method by Cho et al. [21]which uses iterative algorithms to minimize a sum of squareresidue and uncover this structure.

Here, we propose extending the Bayesian agglomera-tive clustering (BAC) method of Heard et al. [9] for geneexpression clustering, to enable identification of two-way

clusters. At each stage of the hierarchy, either a row or acolumn merge is performed. The resulting hierarchy con-tains a nested sequence of two-way cluster configurations.The two-way configuration with the highest posterior prob-ability is returned as the optimal configuration visited. Incontrast, MCMC methods are used by Meeds and Roweis[22] to discover this underlying cluster structure, whereasthe algorithm presented here is deterministic.

In its naive form, the two-way algorithm iterativelymerges either the row clusters or column clusters whichcause the largest multiplicative change in posterior prob-ability. However, when using this simplistic approach tocluster gene expression data, the imbalanced dimensionalitycauses an algorithmic bias toward merging samples. Care-fully chosen monotonic functions of the change in posteriorprobability must be considered to correct this bias.

When clustering gene expression data, particularly ingenome-wide microarray studies, the high dimensionalityoften means that some of the group structure of the samplesis only reflected by some of the genes [23,24]. In thiscase, the genes which do not reflect the sample structuremay obscure it, so it is necessary to base sample clustersonly on the relevant genes. Friedman and Meulman [25]achieve this by assigning sample cluster-specific weightsto each gene, such that genes contribute to sample clustersaccording to their weighting. Alternatively, genes can besplit into those which discriminate between sample clustersand those which do not, allowing sample clusters to beformed based on the discriminating genes. This form ofvariable selection is applied in refs 24 and 26 through theuse of an MCMC sampler which simultaneously selectsdiscriminating genes and uses them to cluster the samples.

Variable selection can be incorporated into the two-wayclustering algorithm through the use of a two-componentmixture model for each row cluster. The first componentrepresents row clusters which reflect the structure of thecolumn clusters, whereas the second component representsthose which do not vary with the column clusters.

The structure of this article is as follows. Section2 contains the two-way BAC algorithm along with anexploration of the problem of dimensionality which occursin the naive implementation of the algorithm. Incorporatingvariable selection into the two-way BAC algorithm andthe implications for the problem of dimensionality arediscussed in Section 3. Results for simulated and real dataare presented in Section 4 and a discussion of the methodsis contained in Section 5.

2. TWO-WAY BAC

For an n × p gene expression data matrix, Y , inwhich each column represents an individual sample and

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 3: On two-way Bayesian agglomerative clustering of gene expression data

Fowler and Heard: Two-Way Bayesian Agglomerative Clustering 3

each row corresponds to a single gene, we proposethe following modeling strategy for two-way clustering.Let R = {R1, . . . ,R|R|} be a partition of the integers{1, . . . , n}, and C = {C1, . . . ,C|C|} be a partition of theintegers {1, . . . , p}, such that R and C denote clusterconfigurations for the rows and columns of Y , respectively.The Cartesian product of R and C will partition the datamatrix into two-way clusters such that the (i, j)th elementof Y , yij , is assigned to the two-way cluster (l,m) if andonly if i ∈ Rl and j ∈ Cm. Independent distributions witha common parametric form of probability density functionf are assumed for each two-way cluster, such that thelikelihood for the data is defined as

L(Y|R,C, θ) =|R|∏l=1

|C|∏m=1

∏i∈Rlj∈Cm

f (yij |θlm), (1)

where θ is the matrix whose (l,m)th entry, θlm, is the vectorof parameters associated with the two-way cluster (l,m).

This model assumes that genes in the same cluster are co-regulated across all sample clusters, and similarly samplesin the same cluster are co-regulated across all clusters ofgenes. In targeted genetic studies, the genes examined arethose which are likely to respond differently under differentexperimental conditions and which would therefore fit thisassumption.

When clustering the data, it is the cluster configuration(R,C) which is of primary interest; therefore the objectivefunction π(R,C) which we seek to maximize is definedup to proportionality by the marginal posterior probabilitymass function,

π(R,C) =∫

L(Y|R,C, θ)p(θ |R,C)p(R,C)dθ . (2)

where p(R,C) is the prior probability mass function forthe cluster configuration and p(θ |R,C) is the density ofthe prior distribution for the cluster parameters.

In principal, any distribution could be used for each of theindependent two-way clusters, and the choice of distributionshould be based upon the type of data being clustered. Weassume the following distributions independently for eachtwo-way cluster (l,m):

yijiid∼ N(μlm, σ 2

lm), i ∈ Rl , j ∈ Cm

μlm|σ 2lm

∼ N(0, V σ 2lm

), (3)

σ 2lm

∼ IG(a, b).

Gene expression data are typically normalized andtherefore the assumption of normality within each cluster isfrequently used [3–5,22,24,27]; this form also provides us

with a simple setting in which to demonstrate the propertiesof the algorithm. The choice of conjugate priors on thecluster parameters is recommended since their use ensuresπ(R, C) is analytically available and repeated calculationsof this quantity are computationally feasible.

When specifying a prior for (R,C), exchangeability isassumed so that a priori no two observations are more likelyto belong to the same cluster. The simplest prior underthis assumption is a uniform distribution over the spaceof all possible cluster configurations. Other popular priorsinclude the Dirichlet process [26] or specifying the numberof clusters to follow a uniform distribution and their sizesto follow a multinomial-Dirichlet distribution [27].

Missing values are very common in gene expressiondata for several reasons, such as experiment error or imagecorruption. An advantage of model-based clustering is thatmissing data present no extra methodological challenge.For Eq. (1), all non-missing observations assigned to thetwo-way cluster (l,m) are assumed to independently followthe same distribution with a common parameter θ lm, andcalculation of the objective function again follows Eq. (2).

2.1. Measuring Cluster Similarity

Agglomerative clustering requires the specification ofeither a distance or similarity measure between any pairof clusters. For a particular cluster configuration (R,C),a similarity measure for two row clusters or two columnclusters is now defined as the multiplicative change inposterior probability which results from their merger,

skl =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

π(Rkl,C)

π(R,C)if k and l are row clusters,

π(R,Ckl)

π(R,C)if k and l are column clusters,

(4)

where (Rkl,C) and (R,Ckl) are the configurations obtainedfrom (R,C) by merging the two row clusters or two columnclusters k and l, respectively.

This similarity measure provides a comparison betweentwo models of different parameter dimension which canbe performed fairly in the Bayesian framework, sincethe paradigm carries a natural penalty against over-fittingknown as Occam’s razor [28]. This states that provided bothmodels are reasonably supported by the data, the simplestmodel will always be preferred. Since there are fewerparameters in the simpler model, their prior distribution isconcentrated in a smaller area than that of those in the morecomplex model. Therefore the parameters in the simplermodel are assigned a higher prior probability. It is thisproperty which allows Bayesian model-based clustering tomake inference on the number of clusters.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 4: On two-way Bayesian agglomerative clustering of gene expression data

4 Statistical Analysis and Data Mining, Vol. (In press)

2.2. Two-Way BAC Algorithm

The agglomerative clustering algorithm is initiated byplacing all entries of the data matrix in np individ-ual two-way clusters, such that R = {{1}, . . . , {n}} andC = {{1}, . . . , {p}}. The pair of row or column clusters(k, l) which maximize some monotonic function of the sim-ilarity measure, g(skl), are then iteratively merged untilfinally a single two-way cluster remains; that is untilR = {{1, . . . , n}} and C = {{1, . . . , p}}. This iterative algo-rithm deterministically creates a nested sequence of clus-ter configurations, each of which partitions the data intoincreasingly large sub matrices. The optimal cluster con-figuration is identified as the configuration in the hierarchywith the largest marginal posterior probability, π(R,C), andthis determines the optimal number of clusters. A detailedalgorithm is given in Algorithm 1. It is important to notethat the cluster configurations explored by Algorithm 1 arenested, and hence any row or column cluster formed at onestage is preserved for the remainder of the hierarchy.

While constructing the hierarchy of cluster configura-tions, a merge of two row clusters alters the value ofthe similarity measure for all pairs of column clusters, aswell as the value of the similarity measure for the mergedrow clusters with all other row clusters, and vice versa forcolumns. Therefore, for an n × p matrix, the computationalcomplexity of the algorithm is O(n(n + p2) + p(p + n2)).

2.3. Naive Implementation of Two-Way Clustering

The simplest form of the two-way BAC algorithm takesthe function g of the similarity measure in Eq. (4) tobe the identity function. However, when clustering geneexpression data, the imbalanced dimensionality leads thisparticular choice of g to automatically prefer a columnmerge to a row merge, resulting in a degenerative procedurewhereby columns are merged first regardless of the data.For instance, consider the first merge of the n × p datamatrix Y . This reduces the number of parameter vectors

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 5: On two-way Bayesian agglomerative clustering of gene expression data

Fowler and Heard: Two-Way Bayesian Agglomerative Clustering 5

in the two-way cluster model by p if rows are merged,but by n if columns are merged. While both propose asimplified model, since n � p the model proposed bymerging columns is much simpler and therefore, by thearguments in Section 2.1, more likely to be preferable interms of π .

This problem is best illustrated by considering an n × p

data matrix of zeros. We assume the independent Gaussianmodel in Eq. (3) with a uniform prior over the set ofall possible cluster configurations. If we consider the firstmerge of this matrix, the similarity measure given in Eq. (4)reduces to

s(r) ={

P(0,0)

P(0)P(0)

}p

for any two row clusters,

s(c) ={

P(0,0)

P(0)P(0)

}n

for any two column clusters, (5)

where P(0,0) and P(0) are marginal likelihoods for a clustercontaining two 0s and a cluster containing a single 0,respectively. Thus, since n � p, to prove that columnmergers are favored we require that P(0,0) > P(0)P(0), orequivalently,

P(0,0)

P(0)P(0)

= 1 + V√1 + 2V

(a)(a + 1)

(a + 1/2)2> 1. (6)

Clearly (1 + V )2 > 1 + 2V , and the inequality followsfrom the convexity property of the log-gamma function:

log(a + 1/2) ≤ 1

2log(a) + 1

2log(a + 1)

⇒ (a + 1/2)2 ≤ (a)(a + 1).

So clearly, without any adjustment to the naive two-wayBAC algorithm, column mergers would be preferred.

The problem stems from the fact that a single row mergerand a single column merger are not comparable operations,since they do not reduce the number of model parametersby the same degree. Rather, a single column merger shouldbe compared against a combination of n/p row mergers.However, this would lead to a combinatorial increase inmodel comparisons which would reduce the simplicity andspeed of the algorithm. Instead, in the next section we seekto penalize column mergers against row mergers whereappropriate so that the two operations become comparable.

2.4. Dimensionality Adjusted Implementationof Two-Way BAC

The degenerative behavior occurring in the naive imple-mentation of Algorithm 1 can prohibit full two-way clus-tering if row mergers are eventually based on degenerate

column cluster configurations. As a result, the algorithm inits naive form does not perform well when used on non-square data matrices.

To redress this imbalance between row and columnmergers, we consider an alternative function of thesimilarity measure which accounts for the difference indimensionality:

g(skl) ={

s1/|C|kl if k and l are row clusters,

s1/|R|kl if k and l are column clusters.

(7)

This transformation of the similarity measure is simplythe geometric average of the change in objective functionacross all two-way clusters affected by the merge. Clearlyfrom inspection of Eq. (5), this scaling would remove anyalgorithmic preference for merging columns or rows. Notethat if the data set contains missing values, then the initialsingleton clusters containing these values will be empty andas such have no parameters associated with them and thiswould need to be taken into consideration when scalingby dimensionality; the quantities |R| and |C| would bereplaced by the number of common non-empty row orcolumn clusters shared in the merge.

3. VARIABLE SELECTION WITHIN TWO-WAYBAC

The model described in Eq. (1) assumes that all rowclusters are partitioned by the column clusters. However,when the number of genes is very large, for examplein genome-wide microarray studies rather than studies ontargeted genes, it is likely that not all genes will reflect thegroup structure of the samples [24]. Therefore, we need toconsider altering our model to incorporate the assumptionthat only some of the gene clusters are partitioned by thesample clusters. This ensures that the true sample clusterconfiguration is not masked by non-discriminating genes,and also enables us to select the gene clusters whichdiscriminate between sample groups.

3.1. Two Component Mixture Model

A mixture model with two components is fitted toeach row cluster. The first component assumes that arow cluster consists of two-way clusters defined by thecolumn clusters, and represents the discriminating genes.The second assumes that all data from the rows in a clusterindependently follow a single distribution; this componentcorresponds to gene clusters which do not distinguishbetween the samples.

Let Dl be the event that the data Yl in the row cluster Rl

were generated by the first component (and so discriminate

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 6: On two-way Bayesian agglomerative clustering of gene expression data

6 Statistical Analysis and Data Mining, Vol. (In press)

between the samples) and D′l the event that they were

generated by the second.The marginal likelihood for the data under this model is

|R|∏l=1

{ωP(Yl|Dl) + (1 − ω)P (Yl |D′

l)}

(8)

where ω is the prior probability that a gene cluster will bediscriminating and

P(Yl |Dl) =|C|∏

m=1

∫ ∏i∈Rl

∏j∈Cm

f0(yij |θl,m)p0(θl,m) d θl,m,

P (Yl |D′l) =

∫ ∏i∈Rl

p∏j=1

f1(yij |θl)p1(θl) d θl,

where fi , for i = 0, 1 is the density for the first andsecond components, respectively; θl,m are the parametersassociated with f0 for the (l,m)th two-way cluster, withprior density p0; θl are the parameters associated with f1

for the lth row cluster, with prior density p1.The posterior probability of a row cluster discriminating

between the sample clusters is

P(Dl |Yl) = ωP(Yl|Dl)

ωP (Yl |Dl) + (1 − ω)P (Yl |D′l)

(9)

∝ ω

|C|∏m=1

∫ ∏j∈Cm

∏i∈Rl

f0(yij |θl,m)p0(θl,m) d θl,m,

If P(Dl |Yl) > P (D′l |Yl) then the lth row cluster is more

likely to have been generated by the first component, andwe might infer that those genes reflect the sample clusterconfiguration.

This model can be used directly in Algorithm 1, whichwill return an optimal cluster configuration for both rowsand columns. In order to determine how likely it is that eachreturned row cluster is generated by the first component, thequantity given in Eq. (9) must then be calculated.

3.2. Problem of Dimensionality

The parameter ω requires some consideration. It servesas the probability that a row cluster is generated by the firstcomponent, but it also acts as a regularization parameterwhich potentially balances the effect of dimensionality.

If ω = 1, clustering is based entirely on the firstcomponent and leads to the problem of dimensionalityin the naive implementation of Algorithm 1, discussed inSection 2.3, where column mergers are favored. However, ifω = 0, clustering is based entirely on the second component

in which column mergers have no effect, so row mergesare automatically preferred provided they increase theobjective function. An approximate balance between thesetwo extremes will occur for some values of ω in [0, 1] andcounteract the bias in the naive algorithm.

We illustrate this by again considering the first merger inan n × p data matrix of zeros. The independent Gaussianmodel given in Eq. (3) is assumed for each two-way clusterfor the first component, and for each row cluster for thesecond component, together with a uniform distribution forthe cluster configuration. Under the mixture model Eq. (9),the similarity measure simplifies to

s(r) = ωP0(2)p + (1 − ω)P1(2p)

{ωP0(1)p + (1 − ω)P1(p)}2

for any two row clusters,

s(c) = {ωP0(1)p−2P0(2) + (1 − ω)P1(p)}n{ωP0(1)p + (1 − ω)P1(p)}n

for any two column clusters,

where Pj (x) for j = 0, 1 is the marginal likelihood for acluster containing x zeros under the component density fj

and parameter prior pj .In order to show that there exists an omega which makes

these two quantities equal we define the continuous function

�(ω) = log(s(r)) − log(s(c)).

By the intermediate value theorem, to show that there existsan ω ∈ (0, 1) such that �(ω) = 0, we only need to showthat �(0) > 0 > �(1) as follows. First,

�(1) = (p − n) log

{P0(2)

P0(1)2

}< 0

since p − n < 0 and log{P0(2)/P0(1)2} > 0 from Eq. (6).Second,

�(0) = log

{P1(2p)

P1(p)2

}

= log

{1 + pV√1 + 2pV

(a)(a + p)

(a + p/2)2

}> 0

since (1 + pV )2 > 1 + 2pV and log(a) + log(a + p) ≥2log(a + p/2) by the convexity property of the log-gamma function.

The regularization property which ω possesses reducesthe need for scaling of the similarity measure, providedthe chosen value happens to lie near the root of �(ω).The precise value of ω for which �(ω) = 0 depends onthe dimensions of the data and the other parameters ofthe probability model. A plot of �(ω) under this model

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 7: On two-way Bayesian agglomerative clustering of gene expression data

Fowler and Heard: Two-Way Bayesian Agglomerative Clustering 7

0.0

−100

−300

−500

0.2 0.4 0.6 0.8 1.0ω

Λ(ω

)

Fig. 1 �(ω) for 0 < ω < 1; the dashed line marks the pointwhere �(ω) = 0. [Color figure can be viewed in the online issue,which is available at wileyonlinelibrary.com.]

is shown in Fig. 1 for 0 ≤ ω ≤ 1, using n = 2000 andp = 20 as fairly typical dimensions of a gene expressiondata set. The prior parameters of model (3) used are V = 1,a = 2 and b = 1 for the first component, and V = 1,a = 2 and b = 4 for the second component. This choice ofpriors assumes that in clusters generated from the secondcomponent, the data have a larger variance than data inclusters generated from the first, reflecting the belief thatunder the first component, column clusters partition the rowclusters into distinct clusters with low variance. In Fig. 1,the point where �(ω) = 0 is marked at approximatelyω = 0.37. This plot also shows that it is only at the extremevalues of ω that either rows or columns are significantlyfavored. Therefore, provided our prior belief in the valueof ω does not lie in the extremes, there is likely to be someregularization occurring.

It is possible that ω alone may not be sufficiently correctfor the imbalance of dimensionality, and there may still besome algorithmic bias affecting the results. This algorithmicbias is now more complex than when using the modeldescribed in Eq. (1) since it is also influenced by ω; thefunction given in Eq. (7) does not correctly counteractthe algorithmic bias under the mixture model, and nocorresponding analytic results are available.

Instead, as a less model-specific correction we aim tocompare the maximum row similarity measure relative toall other row similarity measures with its counterpart forcolumns. Thus, we consider the following function:

g(skl) ={skl/sR if k and l are row clusters,skl/sC if k and l are column clusters,

(10)

where sR and sC are the means of the row or columnsimilarity measures across all possible pairs of row orcolumn clusters. Here, rather than attempting to makerow mergers and column mergers directly comparable, we

look to measure whether there is a particularly outstandingmerger within these two classes of operation. Using thisfunction in Algorithm 1 removes any initial algorithmicpreferences which may occur when fitting a mixture model.However, as a correction it is less robust than the model-specific correction in Eq. (7), since it depends upon thesimilarity scores of the other possible cluster pairs at eachstage in the agglomerative clustering algorithm.

4. RESULTS

Simulated data are used to compare the naive,dimensionality- and mean-scaled forms of the two-wayBAC algorithm. The most appropriate form suggested bythese simulation studies is then applied to the yeast galac-tose gene expression data set of Ideker et al. [29]. In eachanalysis, we assume an independent normal distributionwith conjugate priors as in Eq. (3) for each two-way cluster,and a Dirichlet process prior with concentration parameterα = 0.5 for R and C independently. Prior parameters formodel Eq. (3) need to be chosen; here we utilize the empir-ical distribution of the whole data set. The parameters forthe distribution of the cluster means {μlm} are chosen suchthat they permit clusters to form with centers which areclose to the extremes of the data by setting V as the maxi-mum of the absolute values of the data. The parameters forthe distribution of the cluster variances {σ 2

lm} are chosensuch that the median of the distribution is the median ofthe empirical distribution of row variances.

When fitting a mixture model, it is assumed that thevariance of the first component is smaller than that ofthe second, reflecting the belief that expression levels forgenes which are differentiated by the sample clusters arelikely to have lower variance within those sample clusters.Alternative quantiles of the row variances are used tospecify the median of the distributions for the clustervariances under the two cases; here we use the 0.75 quantilefor clusters under the first component, and the 0.25 quantilefor clusters under the second.

To evaluate the cluster configurations returned by thealgorithms against the true configurations, we use severaldifferent scores. First, we consider the difference betweenthe value of the objective function for the returned con-figuration, π(R∗,C∗), and the value of the objective func-tion for the true configuration, π(Rtrue,Ctrue). Additionallywe calculate the sensitivity, specificity and their harmonicmean, known as the F1 measure [14]. The purity score ofHeller and Ghahramani [8] is used to evaluate the identifiedhierarchy.

We implemented the two-way BAC algorithm using R[30] linked with C; the source code is available in thesupporting materials. Using a standard desktop computer

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 8: On two-way Bayesian agglomerative clustering of gene expression data

8 Statistical Analysis and Data Mining, Vol. (In press)

with a 2 GHz processor, implementation typically took 2s to cluster a 100 × 100 matrix and 4 min to cluster a2000 × 20 matrix, like those used in the simulations inSection 4.1. For a 3935 × 20 matrix, the same dimensionsas the real data matrix used in Section 4.2 were used; ourimplementation of the algorithm took 20 min.

4.1. Simulated Data

One hundred cluster configurations were generated forthree different basic structures which determine the sizeof the data matrix and the form of the clusters. Ineach simulation the number of clusters and their sizeswere generated at random, from a Dirichlet process withconcentration parameter α = 0.5. The purpose is to evaluatethe effectiveness of the different forms of the two-way BACalgorithm, for any possible cluster configuration, on datasets with basic structures which are known to influence thealgorithm. Data matrices were then generated accordingto these configurations using the model given in Eq. (3)with parameter values V = 10, a = 2 and b = 1 in allcases.

The first basic structure is a square 100 × 100 matrix, andthe second is a non-square 2000 × 20 matrix; both consistonly of two-way clusters. The third basic structure is anon-square 2000 × 20 matrix which assumes that the rowclusters are partitioned into two-way clusters by columnclusters with probability 0.3, and otherwise consist of asingle column cluster. Examples of cluster configurationsgenerated under these basic structures are shown in Fig. 2.

Two non-Bayesian forms of two-way agglomerativeclustering were also applied to the data sets for comparison.Like two-way BAC, these methods form hierarchies ofcluster configurations consisting of a row configuration anda column configuration by iteratively merging either rows

or columns of a data matrix. We use the Euclidean distancebetween pairs of row vectors and pairs of column vectorsof an n × p data matrix, and therefore the distance measureis defined as,

dkl =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

√√√√ p∑i=1

(Xki − Xli)2 if k and l are row vectors,

√√√√ n∑i=1

(Xik − Xil)2 if k and l are column vectors,

although other distance measures could be used. Completelinkage is used to calculate the distance between clusterssince this method was found to perform best, producingcompact clusters.

The dimensionality of the data matrix has the oppositeeffect when using this non-Bayesian distance measure tothe effect it has on naive two-way BAC. When n � p

it is the rows, which consist of shorter vectors andconsequently have smaller distances between them, whichare clustered first. This contrasts with the naive two-wayBAC algorithm, which would initially cluster columnsunder the same circumstances. Therefore, we also considera scaled Euclidean distance measure:

dkl =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

√√√√ p∑i

(Xki − Xli)2

pif k and l are row vectors,

√√√√ n∑i

(Xik − Xil)2

nif k and l are column vectors,

to account for the contrasting dimensions of the row andcolumn vectors. Unlike two-way BAC, this algorithm is

(a) (b) (c)

Fig. 2 Examples of data matrices generated under the three different basic structures. (a) Square matrix; (b) non-square matrix; (c)non-square matrix with variable selection. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 9: On two-way Bayesian agglomerative clustering of gene expression data

Fowler and Heard: Two-Way Bayesian Agglomerative Clustering 9

0 50 100 150 200

−200

00−1

4000

−800

0

Merge Index

50 100 150 200

Merge Index

0 50 100 150 200

Merge Index

log

π(R

,C)

−200

00−1

4000

−800

0

log

π(R

,C)

−200

00−1

4000

−800

0

log

π(R

,C)

(a) (b) (c)

Fig. 3 Log π(R,C) for the hierarchies created when clustering a square data matrix. (a) Naive algorithm; (b) dimensionality-scaledalgorithm; (c) mean-scaled algorithm. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

unable to automatically determine the number of biclusters,and so this information must be specified.

The Plaid model [14], and spectral biclustering [20]discussed in Section 1 were also applied to the simulateddata sets for comparison. The R package ‘biclust’ [31] wasused to implement both of these methods. Neither methodcreates a cluster hierarchy, so their results do not havea purity score. Unlike the model presented in this paper,the Plaid model does not partition the data and thereforethe results for this model have no value for the objectivefunction π(R,C). The biclusters identified by the Plaidmodel are evaluated by comparing them to the true clusterswhich are most similar.

Table 1 shows the mean scores from applying eachclustering method to the square data matrices. All formsof the two-way BAC algorithm perform well, correctlyidentifying the true cluster configuration in the majority ofsimulations, demonstrated by the high purity and F1 scores.In the 100 simulations, a configuration with the highestobjective function value was exclusively identified by thenaive method 25 times, dimensionality-scaled 26 times, themean-scaled 24 and the remaining 25 were ties, 21 ofwhich were between the naive and dimensionality-scaledmethods.

To illustrate the behavior of each form of the two-way BAC algorithm, we examine the hierarchies created

when clustering the data set from Fig. 2a generated underthe first basic structure. Figure 3 shows log π(R,C) forall configurations in the hierarchies; configurations arisingfrom column merges are denoted by crosses and otherwisechanges are attributed to row merges. Since there are anequal number of rows and columns there is no initialalgorithmic preference toward either type of merge, and thecolumn merges are spread across the hierarchy when usingany form of the algorithm. Note that even when using ascaled form of the algorithm, there is still some bunchingof row and column merges. This is caused by a knownproperty of agglomerative clustering, where there is a strongtendency for clusters included in the most recent merge toagain be included in the next [32].

When the data matrix being clustered is square, theEuclidean and scaled-Euclidean methods are equivalent andso produce exactly the same cluster configuration. Thesemethods are able to identify some configurations correctly,but the mean purity for the hierarchies created using theseclustering methods is lower than the mean purity score forthe hierarchies created using any form of two-way BAC.

Both the Plaid model and spectral biclustering are alsoable to identify some configurations correctly. However, themeans for all evaluation measures, except specificity, arelower than those obtained using any form of the two-wayBAC algorithm. The Plaid model has a lower sensitivity

Table 1. Results for the square data matrices.

Method log π(R∗,C∗)

π(Rtrue,Ctrue)Purity Sensitivity Specificity F1

Naive −498.9 0.808 0.996 0.731 0.753Dimensionality-scaled −328.6 0.849 1.0 0.804 0.824Mean-scaled −416.1 0.843 0.998 0.792 0.818Euclidean −9659.9 0.682 0.431 0.855 0.468Scaled-Euclidean −9659.9 0.682 0.431 0.855 0.468Plaid model — — 0.523 0.955 0.619Spectral biclustering −11311.1 — 0.663 0.615 0.536

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 10: On two-way Bayesian agglomerative clustering of gene expression data

10 Statistical Analysis and Data Mining, Vol. (In press)

Table 2. Results for the non-square data matrices.

Method log π(R∗,C∗)

π(Rtrue,Ctrue)Purity Sensitivity Specificity F1

Naive −1533.2 0.907 0.951 0.958 0.956Dimensionality-scaled −2.4 0.996 0.999 0.998 0.998Mean-scaled −142.3 0.915 0.999 0.878 0.903Euclidean −53037.24 0.201 0.626 0.417 0.401Scaled-Euclidean −32591.7 0.618 0.559 0.782 0.561Plaid model — — 0.708 0.753 0.625Spectral biclustering −28276.9 — 0.567 0.244 0.251

than specificity score which indicates that although theclusters identified may contain the true clusters, they alsocontain other elements. The mean value of π(R,C) evalu-ated at the configurations returned by spectral biclusteringis far lower than the mean value achieved by any form ofthe two-way BAC algorithm.

The non-square data matrices were then clustered usingthe same methods, and the mean results are contained inTable 2. Changing the dimensions of the data clearly has aneffect on the performance of the two-way BAC algorithm,particularly on the objective function values obtained. Thevalues of the objective function for the configurationsidentified by the naive method are far lower than those forthe true configurations, whereas for the other two forms ofthe algorithm they are much closer. In the 100 simulations,a cluster configuration with the largest objective function isexclusively identified 48 times by the dimensionality-scaledmethod, 15 times by the mean-scaled method and never bythe naive method. Of the remaining 37 simulations whichresult in ties, the naive method is included in 20 of them.The dimensionality-scaled algorithm also has the highestscore across all measures.

Figure 4 shows the value of log π(R,C) for thehierarchies created by each form of the algorithm whenclustering the data matrix shown in Fig. 2b, generatedunder the second basic structure. The hierarchy created by

the naive method consists of column merges at the startof the hierarchy which provide initial large increases inthe objective function. This is a result of the algorithmicbias discussed in Section 2.3. Column merges are moredispersed in the hierarchies which are formed using scaling,and which ultimately achieve a higher posterior probability.

The benefit of using scaled-Euclidean distance rather thanEuclidean distance in hierarchical clustering is illustratedby these results, since the configurations identified usingscaled-Euclidean distance have a much higher mean purityand objective function score. All clustering methods usedfor comparison have lower sensitivity, specificity and F1scores than all forms of the two-way BAC algorithm. Thisindicates that the configurations identified are not as similarto the true configuration as those identified by any form oftwo-way BAC. Of the methods other than two-way BAC,the Plaid model performs the best, with the highest F1measure.

Finally, the mixture model from Eq. (9) was usedto cluster the non-square data matrices in which onlysome row clusters are partitioned by the column clusters.Table 3 shows the mean results obtained using the naive,dimensionality-scaled and mean-scaled forms of the two-way BAC algorithm. The naive method has the largestmean difference in objective function values, whilst theother forms of the two-way BAC algorithm have mean

0 500 1000 2000

−1e+

05−6

e+04

−1e+

05−6

e+04

−1e+

05−6

e+04

Merge Index

5000 1000 2000

Merge Index

0 500 1000 2000

Merge Index

log

π(R

,C)

log

π(R

,C)

log

π(R

,C)

(a) (b) (c)

Fig. 4 Log π(R,C) for the hierarchies created when clustering a non-square data matrix. (a) Naive algorithm; (b) dimensionality-scaledalgorithm; (c) mean-scaled algorithm. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 11: On two-way Bayesian agglomerative clustering of gene expression data

Fowler and Heard: Two-Way Bayesian Agglomerative Clustering 11

Table 3. Results for the non-square data matrices with variable selection.

Method log π(R∗,C∗)

π(Rtrue,Ctrue)Purity Sensitivity Specificity F1

Naive −1811.4 0.917 0.946 0.998 0.964Dimensionality-scaled −122.2 0.989 0.997 0.999 0.998Mean-scaled −249.1 0.991 0.997 0.845 9.872Euclidean −53788.9 0.077 0.625 0.287 0.316Scaled-Euclidean −59814.6 0.215 0.797 0.907 0.805Plaid model — — 0.649 0.614 0.509Spectral biclustering −28212.6 — 0.692 0.134 0.175MCMC −2546.3 — 0.466 0.606 0.484

0 500 1000 2000

−800

00−6

0000

−800

00−6

0000

−800

00−6

0000

Merge Index

5000 1000 2000

Merge Index

0 500 1000 2000

Merge Index

log

π(R

,C)

log

π(R

,C)

log

π(R

,C)

(a) (b) (c)

Fig. 5 Log π(R,C) for the hierarchies created when clustering the non-square data matrix. (a) Naive algorithm; (b) dimensionality-scaledalgorithm; (c) mean-scaled algorithm. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

score differences which are reasonably close to 0. Thedimensionality- and mean-scaled forms of the algorithmboth perform very well across all evaluation measures.The dimensionality-scaled form however, performs slightlybetter. In the 100 simulations, the configuration withthe highest valued objective function was exclusivelyidentified 73 times by the dimensionality-scaled algorithm,15 times by the mean-scaled algorithm and by both on theremaining 12.

Figure 5 shows the value of log π(R,C) for eachconfiguration in the hierarchies created when clustering thedata matrix shown in Fig. 2c, generated under the thirdbasic structure. In the hierarchy created by the naive form,column merges are initially preferred; however, they donot provide as large an increase in objective function asin the previous set of simulations (Fig. 4). This preferencesuggests that there is still a bias toward merging columnsoccurring as a result of the dimensionality of the data. Thehierarchies created by the dimensionality- and mean-scaledforms of the algorithm suggest that, when scaling is used,there is no preference toward merging columns since thepattern of merges is non-degenerate.

In addition to the four methods applied to the previoussimulated data sets, the MCMC method of Tadesse etal. [24], described in Section 1, was also applied to thedata, assuming independence between observations. This

method is only able to place genes into a discriminatingor a non-discriminating cluster and thus has a low meansensitivity score, and the value of the objective functionfor the identified cluster configuration is lower when usingthis method than when using the two-way BAC algorithm.The spectral biclustering, Euclidean and scaled-Euclideanhierarchical two-way clustering methods fully partition thedata according to the row and column cluster configurationsand as such are not able to identify the true clusterconfiguration.

Unless the data matrix being clustered is square, thesesimulations indicate that the naive method will not returna configuration with a higher objective function value thanthe mean- and dimensionality-scaled algorithms. Therefore,when clustering a real data set, both the mean- andthe dimensionality-scaled algorithms should be consideredsince either could return a configuration with a higherobjective function value.

4.2. Real Data

The yeast galactose data set of Ideker et al. [29]concentrates on the effect of the GAL gene family whichenables cells to utilize galactose, a type of sugar, as acarbon source. In each sample, a pathway component hasbeen perturbed either by deleting one of the GAL genes

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 12: On two-way Bayesian agglomerative clustering of gene expression data

12 Statistical Analysis and Data Mining, Vol. (In press)

or by a wild-type perturbation. Each of these perturbationsis performed in the presence and absence of galactose. Asubset of this data set is often analyzed in the clusteringliterature [5,33]; however, the speed of the two-way BACalgorithm enables the analysis of the whole data setconsisting of 3935 genes and 20 samples. The full dataset can be obtained from idekerlab.ucsd.edu.

The reference condition containing no perturbations islabeled wt + gal and all expression levels are ratios relativeto this condition. The data are on the log10 scale and nopreprocessing for normalization was performed; however,ref. 20 contains a discussion of normalization processeswhich may be considered necessary for identifying this typeof cluster structure in gene expression data, depending onthe method used. Each value is a mean from four replicateexperiments, and therefore the data set does not contain anymissing values.

Two-way clustering of this data set aims to identifyclusters of genes which respond similarly to geneticperturbations as well as clusters of genetic perturbationswhich produce a similar effect in genes. This can provideinsights into the interactions between genes which couldimprove biological models and therefore aid prediction ofcellular behavior.

The two component mixture model was fitted to the datasince it is a genome-wide data set, and we therefore want toallow for the possibility that only some gene clusters reflectthe structure of the samples. From the simulations, we sawthat ω alone was not necessarily enough to counteract theeffect of dimensionality, and so we consider the mean-and the dimensionality-scaled versions of the algorithm.Prior parameters were chosen as described above and weset ω = 0.3; however, similar results were obtained usingseveral values of ω ∈ [0.2, 0.8]. For this data set, themean-scaled form of the two-way BAC algorithm produceda cluster configuration with a higher objective functionscore than the dimensionality-scaled algorithm, and so theresults presented are those generated by the mean-scaledalgorithm.

The heat map in Fig. 6 shows the clustered data matrix,with white lines indicating the optimal configuration in thehierarchy and the dendrograms showing all later merges.The column labels indicate the GAL gene which wasperturbed, with the suffix .gal indicating the presence ofgalactose, and the suffix .gal.1 indicating the absence.To ensure there is no algorithmic bias toward mergingeither genes or samples, we examine the hierarchy created.Figure 7 shows log π(R,C) for each configuration in thehierarchy, denoting cluster configurations resulting froma column merge with a cross, and attributing all otherconfigurations to row merges. We see that the pattern ofclustering is non-degenerate, indicating that there is nosignificant algorithmic bias affecting the results.

Fig. 6 Yeast-galactose data set; white lines indicate the optimalcluster configuration identified and dendrograms show all latermerges. [Color figure can be viewed in the online issue, which isavailable at wileyonlinelibrary.com.]

Despite the choice of a reasonably low ω, almost all geneclusters were more likely to reflect the sample structure.This suggests that deleting the pathway components hassome effect on most of the genes, indicating that thesecomponents are globally important in determining geneexpression levels. Inspection of the sample dendrogramshows that the sample in which the gal80 gene is deletedin the absence of galactose remains in a singleton clusteruntil the final merge. gal80 is a known regulatory geneassociated with galactose which exercises transcriptional

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 13: On two-way Bayesian agglomerative clustering of gene expression data

Fowler and Heard: Two-Way Bayesian Agglomerative Clustering 13

0 1000 2000 3000 4000

−200

000

2000

040

000

Merge Index

log

π(R

,C)

Fig. 7 Log π(R,C) for the hierarchy created when clusteringthe yeast-galactose data set. [Color figure can be viewed in theonline issue, which is available at wileyonlinelibrary.com.]

control over other genes, and this hierarchy highlights thedifferences between gal80 and other genes in the GALfamily.

Ideker et al. [29] choose to cluster only 997 genes fromthis data set which contained mRNA expression ratiossignificantly altered by the perturbations. Of these 997genes over half are contained in the second and thirdclusters which are distinctly colored, whereas very feware contained in the first cluster. The second and thirddistinctive clusters also contain genes which were notoriginally highlighted as significant but which this methodhas clearly identified.

The perturbations of the gal1, gal4, gal3 and gal7 genesin the presence of galactose are all placed in the samecluster as the perturbation of the same genes in the absenceof galactose. This may suggest that these genes are notactivated by the presence of galactose.

The Plaid model and the scaled Euclidean hierarchicaltwo-way clustering algorithm were also applied to this dataset, and plots of the returned clusters are contained in thesupplementary files. The clusters identified by the Plaidmodel are similar to those identified by the two-way BACalgorithm, but do not partition the data and as such have aslightly different interpretation. The main cluster identifiedby the Plaid model places the predominantly downregulated(green) gene cluster at the bottom of Fig. 6 in a singlecluster together with approximately half of the elementsof the predominantly downregulated gene cluster in thecenter of Fig. 6. By partitioning the row clusters by columnclusters, two-way BAC is able to identify more genes whichpotentially belong to this cluster of under expressed genes.

The scaled Euclidean algorithm requires the number ofbiclusters to be specified, so this was done by running thealgorithm using several possible values and choosing themost successful. The clusters identified by this algorithmare not as significantly over or under expressed as the mainclusters identified by two-way BAC, and it does not identify

the same amount of cluster structure as the two-way BACalgorithm. One large sample cluster and a singleton samplecluster were identified by this algorithm. The singletonsample cluster contains the gal80.gal.1 sample, which wasalso identified by the two-way BAC algorithm as a singletoncluster and was the last cluster to merge with other columnclusters in the hierarchy.

5. DISCUSSION

We have presented an algorithm for Bayesian hierarchi-cal two-way clustering of a matrix of data which returnsa hierarchy of two-way cluster configurations and identi-fies the optimal configuration within this hierarchy. It wasshown that the intuitive similarity measure used to deter-mine whether to merge rows or columns at each iterationneeded to be scaled to prevent degenerative behavior causedby any imbalance in the dimensions of the data matrix.

The clustering model was extended to identify the rowclusters which discriminate between the column clusters,a process which is equivalent to variable selection; themixture model formulation was also seen to have abeneficial effect of reducing the need for scaling of thesimilarity measure for a large range of possible mixtureweights.

The scaling methods required to employ two-wayagglomerative clustering would naturally extend to datawith an arbitrary number of dimensions, scaling eachdimension in an identical way. Additionally, the model usedcould be trivially adapted to include further data sourcessuch as replicated data, and so provides a very flexibleframework.

REFERENCES

[1] D. J. Hand and N. A. Heard, Finding groups in geneexpression data, J Biomed Biotechnol 2 (2005), 215–225.

[2] C. Fraley and A. E. Raftery, How many clusters? Whichclustering method? Answers via model-based cluster anal-ysis, Comput J 41 (1998), 578–588.

[3] M. Medvedovic and S. Sivaganesan, Bayesian infinitemixture model based clustering of gene expression profiles,Bioinformatics 18 (2002), 1194–1206.

[4] M. Medvedovic, K. Y. Yeung, and R. E. Bumgarner,Bayesian mixture model based clustering of replicatedmicroarray data, Bioinformatics 20 (2004), 1222-1232–.

[5] Z. S. Qin, Clustering microarray gene expression data usingweighted Chinese restaurant process, Bioinformatics 22(2006), 1988–1997.

[6] J. G. Booth, G. Casella, and J. P. Hobert, Clustering usingobjective functions and stochastic search, J Royal Stat Soc:B (Stat Methodol) 70 (2008), 119–139.

[7] C. Fraley and A. E. Raftery, Model-based clustering,discriminant analysis, and density estimation, J Am StatAssoc 97 (2002), 611–631.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 14: On two-way Bayesian agglomerative clustering of gene expression data

14 Statistical Analysis and Data Mining, Vol. (In press)

[8] K. A. Heller and Z. Ghahramani, Bayesian hierarchical clus-tering, In Proceedings of the 22nd International Conferenceon Machine learning, 2005, 297–304.

[9] N. A. Heard, C. C. Holmes, and D. A. Stephens, Aquantitative study of gene regulation involved in theimmune response of anopheline mosquitoes: An applicationof Bayesian hierarchical clustering of curves, J Am StatAssoc 101 (2006), 18–26.

[10] G. C. Rota, The number of partitions of a set, Am MathMon 71 (1964), 498–504.

[11] J. A. Hartigan, Direct clustering of a data matrix, J Am StatAssoc 67 (1972), 123–129.

[12] S. C. Madeira and A. L. Oliveira, Biclustering algorithmsfor biological data analysis: a survey, IEEE Trans ComputBiol Bioinformatics 1 (2004), 24–45.

[13] L. Lazzeroni and A. Owen, Plaid models for geneexpression data, Statistica Sinica 12 (2002), 61–86.

[14] H. Turner, T. Bailey, and W. Krzanowski, Improved biclus-tering of microarray data demonstrated through system-atic performance tests, Comput Stat Data Anal 48 (2005),235–254.

[15] J. Gu and J. Liu, Bayesian biclustering of gene expressiondata, BMC Genomics 9 (2008), 4–14.

[16] J. Zhang, A Bayesian model for biclustering with appli-cations, J Roy Stat Soc: Series C (Appl Stat) 59 (2010),635–656.

[17] A. J. Izenman, P. W. Harris, J. Mennis, J. Jupin, andZ. Obradovic, Local spatial biclustering and prediction ofurban juvenile delinquency and recidivism, Stat Anal DataMin 4 (2011), 259–275.

[18] P. D. Hoff, Model-based subspace clustering, BayesianAnal 1 (2010), 321–344.

[19] H. Lian, Sparse Bayesian hierarchical modeling of high-dimensional clustering problems, J Multivariate Anal 101(2010), 1728–1737.

[20] Y. Kluger, R. Basri, J. T. Chang, and M. Gerstein, Spectralbiclustering of microarray data: coclustering genes andconditions, Genome Res 4 (2003), 703–716.

[21] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra, Minimum sum-squared residue co-clustering of gene expression data, In

Proceedings of the Fourth SIAM International Conferenceon Data Mining, 2004, 114–125.

[22] E. Meeds and S. Roweis, Nonparametric Bayesian biclus-tering. Technical Report. In UTML TR, 2007.

[23] E. B. Fowlkes, R. Gnanadesikan and J. R. Kettenring,“Variable selection in clustering,” J Classif, vol. 5, 1988,pp. 205–228.

[24] M. G. Tadesse, N. Sha, and M. Vannucci, Bayesian variableselection in clustering high-dimensional data, J Am StatAssoc 100 (2005), 602–618.

[25] J. H. Friedman and J. J. Meulman, Clustering objectson subsets of attributes, J Roy Stat Soc: Series B (StatMethodol) 66 (2004), 815–849.

[26] S. Kim, M. G. Tadesse, and M. Vannucci, Variableselection in clustering via Dirichlet process mixture models,Biometrika 93 (2006), 877–893.

[27] J. W. Lau and P. J. Green, Variable selection in clusteringvia Dirichlet process mixture models, J Comput Graph Stat16 (2007), 526–558.

[28] D. J. C. MacKay, Bayesian interpolation, Neural Comput 4(1992), 415–447.

[29] T. Ideker, V. Thorsson, J. A. Ranish, R. Christmas, J.Buhler, J. K. Eng, R. Bumgarner, D. R. Goodlett, R.Aebersold, and L. Hood, Integrated genomic and proteomicanalyses of a systematically perturbed metabolic network,Science 292 (2001), 929–934.

[30] R Development Core Team, R: A Language and Environ-ment for Statistical Computing, R Foundation for StatisticalComputing, 2010.

[31] S. Kaiser, R. Santamaria, M. Sill, R. Theron, L. Quintales,and F. Leisch, biclust: BiCluster Algorithms, R packageversion 0.9.1, 2009.

[32] N. A. Heard, Iterative reclassification in agglomerativeclustering, J Comput Graph Stat 20 (2011), 920–936.

[33] K. Y. Yeung, M. Medvedovic, and R. E. Bumgarner, Clus-tering gene-expression data with repeated measurements,Genome Biol 4 (2003), 34.

Statistical Analysis and Data Mining DOI:10.1002/sam