high performance parallel/distributed biclustering using

High Performance Parallel/Distributed Biclustering Using Barycenter

Heuristic

Arifa Nisar∗ Waseem Ahmad† Wei-keng Liao‡ Alok Choudhary §

Abstract

Biclustering refers to simultaneous clustering of objects and

their features. Use of biclustering is gaining momentum

in areas such as text mining, gene expression analysis and

collaborative filtering. Due to requirements for high per-

formance in large scale data processing applications such

as Collaborative filtering in E-commerce systems and large

scale genome-wide gene expression analysis in microarray ex-

periments, a high performance prallel/distributed solution

for biclustering problem is highly desirable. Recently, Ah-

mad et al [1] showed that Bipartite Spectral Partitioning,

which is a popular technique for biclustering, can be refor-

mulated as a graph drawing problem where objective is to

minimize Hall’s energy of the bipartite graph representation

of the input data. They showed that optimal solution to this

problem is achieved when nodes are placed at the barycenter

of their neighbors. In this paper, we provide a parallel algo-

rithm for biclustering based on this formulation. We show

that parallel energy minimization using barycenter heuristic

is embarrassingly parallel. The challenge is to design a bi-

cluster identification algorithm which is scalable as well as

accurate. We show that our parallel implementation is not

just extremely scalable, it is comparable in accuracy as well

with serial implementation. We have evaluated proposed

parallel biclustering algorithm with large synthetic data sets

on upto 256 processors. Experimental evaluation shows large

superlinear speedups, scalability and high level of accuracy.

1 Introduction

Biclustering (Subspace Simultaneous Clustering) is avery useful technique for text mining, collaborative fil-tering and gene expression analysis and profiling [2] [3].Recently, it has gained increasing popularity in the anal-ysis of gene expression data [3] . Biclustering is sig-

∗Department of Electrical Engineering and Computer Science,

Northwestern University.†Current Affiliation:A9.COM, Research was carried out when

author was with ECE department at University of Illinois,

Chicago.‡Department of Electrical Engineering and Computer Science,

Northwestern University.§Department of Electrical Engineering and Computer Science,

Northwestern University.

nificantly useful and considerably harder problem thantraditional clustering. Whereas a cluster is a set of ob-jects with similar values over the entire set of attributes,a bicluster can be composed of objects with similarityover only a subset of attributes. Many clustering tech-niques such as those based on Nearest Neighbor Searchare known to suffer from ”Curse of Dimensionality” [4].With large number of dimensions, similarity functionssuch as Euclidean distance tend to perform poorly assimilarity diffuses over these dimensions. For example,in text mining, the size of keywords set describing dif-ferent documents is very large and yields sparse datamatrix [5]. In this case, clusters based on the entirekeywords set may have no meaning for the end users.

Biclustering finds local patterns where a subset ofobjects might be similar to each other based on only asubset of attributes. The comparison between biclus-ters and clusters is illustrated in Figure 1. Figure 1(a)demonstrates the concept of clusters and 1(b) demon-strates the biclusters in input data matrix. Note thatbiclusters can cover just part of rows or columns andmay overlap with each other as shown in figure 1(b).Biclustering process can generate a wide variety of ob-ject groups that capture all the significant correlationinformation present in a data set. Biclustering has be-come very popular in discovering patterns from gene mi-croarray experiments data. Since a bicluster can iden-tify patterns among a subset of genes based on a subsetof conditions, it therefore models condition-specific pat-terns of co-expression. Moreover, biclustering can iden-tify overlapping patterns thus catering to the possibilitythat a gene may be a member of multiple pathways.

Unprecedented growth in web related text baseddata repositories [6] and evolution of large biologicaldata sets [7] has resulted into the need of high per-formance biclustering algorithms in order to cope withthe largeness of the data sets efficiently. Given thememory size limit in single processor machines, paral-lel/distributed realization of the biclustering algorithmsis desirable to enable scalable applications for arbitrarilylarge data sets.

Distributed solutions for biclustering problem be-come essential also when the underlying data is inher-

1050 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Cluster 1

Cluster 2

(a) Clusters of Objects.

Bicluster 1

Bicluster 3

Bicluster 2

(b) Biclusters of Objects.

Figure 1: Illustration of Clusters and Biclusters

ently distributed in nature. Consider the example of aretail store chain such as Wal-Mart who have numerousstores spread across different parts of the world. Theconsumer transaction data gets stored in these storessuch that users are different across stores but items aregenerally the same. It can be viewed as row-wise distri-bution of the customer transaction data. Biclusteringof this distributed data can be carried out in two ways.One is to transfer the data at a central location andperform subsequent sequential mining. The other wayis to perform distributed biclustering of the data. Theformer approach has serious scalability issues as it is notfeasible to transfer data from thousands of locations toa central location reliably and efficiently. Central ap-proach is also harder to adopt because of security andprivacy issues. This leaves us with the task of developingdistributed solution to the biclustering problem. Therehas already been some work done in parallel/distributedbiclustering area specially in context of document-wordco-clustering [8], collaborative filtering for online rec-ommendation systems [9] and for gene expression anal-ysis [10].

Bipartite Spectral partitioning is a powerful tech-nique to achieve biclustering. Normalized cut is themost popular objective function for spectral partition-ing problem [11], [12]. In the absence of a polynomialtime exact solution for this problem, emphasis is on find-ing efficient heuristics. A popular implementation [12]requires computation of the Fiedler vector for Lapla-cian matrix of input data. Its computation complex-ity is quadratic in data size and hence not suitablefor large data. Recently Ahmad et al. [1] proposeda new formulation of the spectral partitioning prob-lem as Graph Drawing(GD) problem. They showedthat minimization of Hall’s energy function [1] corre-sponds to finding the normalized cut of the bigraph.Moreover, they also showed that an optimal solution toHall’s energy minimization problem is achieved by plac-ing each node at the barycenter of its neighbors. Basedon these results, they proved that Barycenter heuris-

tic can provide a scalable linear time solution to bipar-tite spectral partitioning problem. In this paper, westudy the effectiveness of this approach for dealing withparallel/distributed biclustering problem in large scaledatasets.

Going by the above-mentioned retail chain scenario,we assume a horizontal partitioning of the data whereeach location has the same feature space but objectsare different. Similar assumptions were made in [8]and [9]. The parallel implementation is divided in tothree stages. First stage involves parallel implementa-tion of barycenter heuristic for crossing minimization.Identification of local biclusters is part of second stage.The third stage involves determination of global biclus-ters at each processor. This is accomplished by virtueof comparison of local biclusters with bicluster repre-sentatives received from other processors. A biclusterrepresentative is merely a vector composed of means ofbicluster columns over all rows of the bicluster. Eachprocessor performs a Euclidean distance based similar-ity comparison with bicluster representatives receivedfrom other processors. The similar biclusters are sub-sequently merged to obtain global biclusters. Clearlyduring both stages of communication, the size of com-munication is bounded by the number of columns in thedata matrix.

We show that parallel/distributed implementationsfor most of the existing biclustering algorithms, incurcommunication overhead which grows at least linearlywith the size of rows. On the other hand, proposed par-allel algorithm of crossing minimization based approachbeing independent of the number of rows, has a signif-icant advantage for huge data sets on a large numberof processors. We show that this efficient distributedimplementation is comparable in accuracy to the se-quential one [1]. We have evaluated the performanceand accuracy of parallel biclustering algorithm on largedata sets comprising of millions of rows. The evaluationresults are discusses in detail in Section 6.

Organization: Rest of the paper is organized asfollows. We discuss related work in section 2, section 3talks briefly about the sequential biclustering algorithm,section 4 proposes a new model for parallel biclusteringusing barycenter heuristic and provides an efficientalgorithm. Section 5 discusses the complexity of theproposed scheme. Experimental framework is given inSection 6, conclusions are drawn in Section 7.

2 Related Work

Recently, there have been some research efforts aimedat solving parallel biclustering problem. ParRescue [8]is one such solution which is based on Minimum Sum-Squared Residue(MSSR) clustering algorithm of Dhillon


et al [13]. In our view, MSSR based biclustering ap-proach inherently suffers from following two problems.

1. MSSR based biclustering approach is unable to de-termine overlapping biclusters. Overlapping biclus-ters are highly desired in emerging application ar-eas such as gene expression analysis and collabora-tive filtering [14].

2. In this approach, row and column clusters are sepa-rately identified. It is unclear how these separatelyidentified row and column clusters can lead to aneffective biclustering solution.

3. The parallel solution incurs an O(t(pn(k+l))) com-munication overhead where t is the total number ofiterations, p is the number of processors, n is thenumber of columns and k and l refer to row and col-umn clusters respectively. As we show later, thiscommunication overhead is around (k + l) timesmore than introduced by our implementation.

4. Finally, we believe that the MSSR based approachsuffers from the initialization problem. Note thatthis algorithm initializes cluster and column indi-cator matrices randomly. This random assignmentdetermines the quality of overall clustering solutionin some cases which is a drawback. Moreover, thealgorithm also requires the user to specify numberof row and column clusters. Since these numbershave strong impact on final output clusters, the al-gorithm has undesirable dependence on these inputparameters. Note that barycenter heuristic basedbiclustering algorithm does not suffer from thesedrawbacks.

Dhillon et al in [15] gave an information theoreticbiclustering (Co-Clustering) algorithm. They treat the(normalized) non-negative contingency table as a jointprobability distribution between two discrete randomvariables that take values over the rows and columns.Their subsequent work [16] provided a Bregman Diver-gence based loss function which was applicable to alldensity functions belonging to the exponential family.Bregman Divergence is defined as follows [17].

If f is a strictly convex real-valued function, thef -entropy of a discrete measure p(x) ≥ 0 is defined by

Hf (p) = −∑x

(f(p(x))

and the Bregman divergence Bf (p; q) is given as

(2.1)Bf (p; q) = −

∑x

f(p(x))−f(q(x))−∇f(q(x))(p(x)−q(x))

When f(x) = x log x, Hf is the Shannon entropyand Bf (p; q) is the I-divergence, when f(x) = − log(x)we get the Burg entropy and discrete Itakura-Saitodistortion Bf (p; q) =

∑x(log q(x)

f(x) + p(x)q(x) − 1)

A parallel biclustering (co-clustering) frameworkwas also proposed in [9] which is aimed at providingscalable solution to the collaborative filtering problemfor online recommendation systems. Their contributionis based on parallelization of weighted biclustering (co-clustering) algorithm of [16]. The paper does notdescribe the parallel algorithm in great detail. Intheir theoretical analysis, they suggest that the overallcomputation time of the algorithm is O(mn+mkl+nkl

N +Nkl) whereN is number of processors, m and n are rowsand columns respectively. Also k and l are number ofcolumn clusters and number of row clusters respectively.They have not fully discussed the communication costof their approach. It is interesting to note that theirgraph on execution time with increasing number ofprocessors (Figure 4 of [9]) shows that the executiontime for BookCrossing dataset is around 30 seconds fora single processor case. While the execution time incase of 15 processors has decreased only by a factorof 3 to 10 seconds. This indicates that a 15 timesincrease in number of processors has only resulted in3 times improvement in execution time. Moreover,the curve essentially straightens up (saturates) rightafter 8 processor case. Clearly the algorithm doesnot have a linear speed-up characteristics. We believethis is because of high communication cost incurredduring accumulation of global matrix averages. On theother hand, our proposed parallel implementation ofbarycenter heuristic based biclustering is able to attainsuper-linear speedups for even hundreds of processors.

3 Biclustering using Sequential BarycenterHeuristic

3.1 Preliminaries: Throughout the paper, we de-note a matrix by capital boldface letters such as G,Ietc. Vectors are denoted by small boldface letters suchas p and matrix elements are represented by small let-ters such as wij . Also, we denote a graph by G(V,E)where V is the vertex set and E is the edge set of thegraph. Moreover each edge, denoted by {i, j}, has aweight wij . The adjacency matrix of G, denoted as G,is defined as

G ={wij if there is an edge {i, j}0 otherwise

}Definition 3.1. Bipartite Graph: A graph G(V,E) istermed as Bipartite if V = V0 ∪ V1 where V0 and V1 arethe disjoint sets of vertices (i.e. V0 ∩ V1 = φ) and eachedge in E has one end point in V0 and the other end


point in V1.

We consider weighted bipartite graphG(V0, V1, E,W ) with W = (wij) where wij > 0denotes the weight of the edge {i, j} between vertices iand j. Moreover, wij = 0 if there is no edge between iand j.

For above mentioned retail chain example, users arerepresented by vertex set V0 and items are representedby vertices in V1. The weight matrix W, in this case,represents the quantity of each item bought by eachuser. Generally, in microarray experiment data, genesand conditions are represented by V0 and V1 vertexsets respectively. The edge weight wij represents theresponse of i’th gene to j’th condition. Similar isthe case for document-word data where documents arerepresented by V0 vertex set and words are representedby V1 vertex set. In this case, the matrix W is the termfrequency matrix for the whole corpus.

In graph theory, a cut is a partition of the verticesof a graph into two sets. Formally, for a partition ofvertex set V into two subsets S and T , a cut can bedefined as follows

cut(S, T ) =∑

i∈S,j∈Twij

This definition of cut can easily be extended to k vertexsubsets,

cut(V1, V2, . . . , Vk) =∑i<j

cut(Vi, Vj)

It was shown in [12] that partitioning (clustering)of vertices on one layer of the bipartite graph willinduce a specific clustering of vertices on the other layerwhich then itself induces a new clustering on the firstlayer. This recursive process would yield the ”best”clustering of vertices on both layers when it correspondsto a partitioning of the graph such that the crossingedges between partitions have minimum weight. Thisessentially implies

cut(V01 ∪ V11, V02 ∪ V12, . . . , V0k ∪ V1k) =minV1,V2,...,Vk

cut(V1, V2, . . . , Vk)

where V1, V2, . . . , Vk is any partitioning of the over-all Vertex set V = V0 ∪ V1 into k vertex subsets. AndV0k and V1k denote the k’th subsets of the two groupsof vertices in a bipartite graph.

Min-cut may result in unbalanced partitions. Thesolution is to employ normalized cut instead. MinimumNormalized cut for two partitions S and T can bedefined as

(3.2) Ncut(S, T )min =cut(S, T )vol(S)

+cut(S, T )vol(T )

Here vol(S) refers to the total weight of all edgesoriginating from group S. Normalized cut yields morebalanced partitioning.

It is well known that graph partitioning problem isNP-complete [18]. Many heuristic methods exist thatcan find the local minimum of the problem. Spectralgraph partitioning heuristics based on finding the nor-malized cut of the graph are known to perform well [11][12].

Definition 3.2. Bipartite Drawing: A bipartite draw-ing of bigraph G is the embedding of its vertex sets V0

and V1 onto distinct points on two horizontal lines y0, y1respectively while edges are drawn by straight line seg-ments.

Definition 3.3. Hall’s Energy Function for a Bipar-tite Drawing: Let h(vk) be the x-coordinate of vk ∈ V ,then the energy of bipartite drawing is defined as

E =12

n∑i,j=1

wij(h(vi)− h(vj))2

Hall’s energy function essentially assigns similar co-ordinates to vertices that are connected by edges withlarge weights. If the edges indicate the similarity be-tween the vertices, minimization of Hall’s energy wouldamount to bring similar vertices together. Intuitively,it can also be thought of as a process where graph isdecomposed into clusters of vertices such that all ver-tices in a clusters are tightly connected to other verticesin the same cluster while they are loosely connected tovertices in other clusters. This essentially correspondsto Normalized-cut (Ncut). Below we give a formal prooffor this intuition.

Theorem 3.4. If q = (q1, q2, . . . , qn)T is the general-ized partition vector as defined in [12], and L is theLaplacian matrix corresponding to the given graph thenHall’s Energy is

E = qTLq =12

n∑i,j=1

wij(qi − qj)2

Proof.

E =12

n∑i,j=1

wij(qi − qj)2

=12

n∑i,j=1

wijq2i +

12

n∑i,j=1

wijq2j −

n∑i,j=1

wijqiqj


Since W is symmetric so wij = wji

E =n∑

i,j=1

wijq2i −

n∑i,j=1

wijqiqj

=n∑i=1

n∑j=1

wij

q2i −n∑

i,j=1

wijqiqj .

Since wii = 0,∀i

E =n∑i=1

n∑j=1

wij

q2i −n∑i=1

n∑j=1,j 6=i

wijqiqj

=n∑i=1

Liiq2i +

n∑i=1

n∑j=1,j 6=i

Lijqiqj

=n∑

i,j=1

Lijqiqj

= qTLq

Above theorem shows that minimization of Hall’s en-ergy corresponds to finding the normalized cut of thegraph. Now we will show that assignment of x-coordinates to each node based on barycenter of itsneighbors results in optimal minimum value for Hall’senergy function.

Theorem 3.5. The minimum value of Hall’s energyfunction is achieved when

h(vi) =

∑nj=1 wijh(vj)∑n

j=1 wij

i.e. each node is assigned a new co-ordinate value whichis the barycenter of its neighbors’ co-ordinates.

Proof. Since Hall’s energy function is a convex function,we can determine its minimum value by simply comput-ing ∂E

∂h(vi)= 0.

∂E∂h(vi)

=∂

∂h(vi)12

n∑i,j=1

wij(h(vi)− h(vj))2 = 0

=n∑

i,j=1

wij(h(vi)− h(vj)) = 0

h(vi) =

∑ni,j=1 wijh(vj)∑n

j=1 wij

It is clear from Theorem 3.5 that a bipartite drawingwhich assigns each node an x-coordinate that is barycen-ter of its neighbors, would provide optimal solution toHall’s energy minimization problem for the given bipar-tite graph. Also, from Theorem 3.4, we know that Hall’senergy function is equivalent to the objective functionfor normalized cut of the bigraph which in turn is theobjective function for bipartite spectral partitioning.

Spectral clustering solutions are based on findingthe Fiedler Vector (an eigenvector) of the Laplacian ofthe given graph. Eigenvalues and their correspondingeigenvectors are computed using techniques such asSingular Value decomposition (SVD). These techniqueshave typically quadratic complexity and require randomaccess to complete data sets. This complexity limitstheir effectiveness for processing very large matrices.

3.2 Bigraph Crossing Minimization usingBarycenter Heuristic As described above, optimalsolution to Hall’s energy minimization problem isattained when each vertex is placed at the barycenter(mean of ordinal values) of its neighbors. In graphdrawing domain, Barycenter heuristic is a well-knowntechnique which attempts to draw a graph on a planesuch that edge crossings are minimized. It is an iter-ative process. During each iteration, vertex positionson one layer are fixed while the position of each vertexon other layer is computed as the mean of positionsof its neighbors on the other layer. Each iteration ofbarycenter requires O(|E|+ |V |log|V |) time. For graphswith constant degree bound, the barycenter heuristiccan also be implemented in linear time per iteration.

Let vi represent the i’th node in the non-static(dynamic) layer and set Ni represent the set ofneighbors of vi. Also let rj represent the rank of j’thmember of the set Ni. Then new rank of vi, denoted asri can be calculated as follows.

1. We first calculate the Weighted Mean, denoted asµ, of the ranks of neighbor nodes using Equation3.3.

2. These weighted means represent the new orderingof the nodes. Since these means are not necessarilyunique, we adjust them so that each node is as-signed a unique rank based on its weighted mean.

(3.3a) s̃ =∑j∈Ni

wi,j × rj

(3.3b) s =∑j∈Ni

wi,j

(3.3c) µi =s̃

s


3.3 Bicluster Identification After crossings areminimized, we perform the bicluster identification pro-cess. This process is performed on the reordered matrixrepresentation of the Bigraph. Bicluster identificationprocess starts from the first element of the reorderedmatrix and keeps on adding new columns and rows whilecalculating and updating a score representing the coher-ence of the values in the current block of the reorderedmatrix. This coherence is determined by virtue of twokinds of distance functions namely Bregman Divergenceand Manhattan Distance.

Bregman Divergence is used to compare two rowsover same set of columns. As described earlier, we canuse different functions in Bregman Divergence Equa-tion (Equation 2.1) to emulate Euclidean Distance, KL-Divergence and Itakura-Saito distortion [17]. Manhat-tan Distance is used to compare the values of adjacentcolumns in the same row. Manhattan distance for twopoints P1(x1, y1) and P2(x2, y2) in XY plane is givenin Equation 3.4

(3.4) D(P1, P2) =| x1 − x2 | + | y1 − y2 |

Manhattan distance is simple to compute and workswell in our case as we are using it to compare distanceamong columns of the same row which are assumed tobe identically distributed.

The bicluster identification procedure is a localsearch procedure. We keep pointers StartRow andStartColumn which are initialized to first row andcolumn of the reordered matrix respectively. Coherencescore for adjacent rows is calculated using BregmanDistance. Similarly the evolution pattern over adjacentcolumns is determined using Manhattan Distance. Ouriterative procedure is row-major. We compare tworows at one time to see if they have matching columns.Starting with the StartRow and Startrow+ 1, we keepa reference set of matching columns. Columns areadded to this set if either of following two conditionsis satisfied.

1. Bregman Distance between two rows over same setof columns is less than the given threshold δ

2. The manhattan distance between adjacent columnson both rows is the same.

The first condition makes sure that we alwaysidentify the constant value biclusters constrained by thenoise. Second condition, on the other hand, guaranteesthat we would be able to determine the biclusters whichexhibit coherent evolution over current set of columns.For each subsequent row, we find out that if it has thesame coherent columns. If it has, we add it to our row

set. If any row has more coherent columns than the onesin reference set, we set the OverlapF lag and store thecurrent value of row iterator i in StartRow and that ofcolumn iterator j to StartColumn.

If both of the above mentioned conditions fail, wecall the current submatrix a bicluster and continue withthe Bicluster identification process. The next iterationof the identification process would start from StartRowand Startcolumn if OverlapF lag was set otherwise itcould continue with the next value of row iterator tillit reaches the last row. Moreover, when we reach theend of columns, we add the current row to our currentRow Cluster. If we reach the end of rows, and currentcolumn and row clusters contain more than requiredminimum number of columns(Columnmin) and rows(Rowmin) respectively, then we declare these row andcolumn clusters to be a bicluster and add it to theBicluster set S.

4 Parallel/Distributed Biclustering

Complete process of biclustering comprises of two maintasks; first the matrix is reordered by performing cross-ing minimization, followed by the identifying the biclus-ters in this reordered matrix. Parallel biclustering algo-rithm also consists of parallel crossing minimization al-gorithm and parallel bicluster identification. As shownin figure 2, input data matrix is horizontally partitionedamongst all the processes. After every process acquiresits share of data, normalization of the data is performedto construct the joint distribution of rows and columns.In the second step, rows and columns are reordered it-eratively until there is no further change. During thisphase only columns’ information is exchanged amongthe processes, rows are reordered locally. The fact thateach process has to share only the column rank informa-tion is very useful in privacy preserving biclustering ap-plications (for complete details see [14] and [19]). Afterreordering of local matrix, process enters in its secondphase of bicluster denitrification. With the reorderedmatrix, each process performs a local bicluster search.After identifying the biclusters locally, bicluster repre-sentatives are shared amongst all the processes. Withthis new information acquired by every process, anotherphase of identifying local biclusters is completed. Atthe end, updated information is shared with all the pro-cesses and results are returned. Following sections dis-cuss above mentioned tasks in detail.

4.1 Model for Distributed Biclustering Data ishorizontally partitioned across a set S : {s1 . . . sN} ofN Servers such that each server si has the same set A :{a1 . . . am} of m attributes and a set Ri : {ri1 . . . rini}of ni different objects(records) such that global object


Horizontal Partitioning

Global Exchangeof bicluster representative

Comparison with Local Bilclusters

Column Rows

Local Bicluster Search

Until no furtherchange

Only once

Reordering

(Global)

Reordering

(Local)

Initialization

Crossing Minimization Bicluster Identification

ReorderedMatrix

Figure 2: System Design of Parallel Biclustering Algorithm

set R can be represented as R =⋃∀i∈S Ri.

4.1.1 Parallel/Distributed Barycenter Heuris-tic Each server can build bigraph on its local datasuch that objects are represented by vertices/nodes onone layer (say Layer0) and the attributes/features arerepresented by vertices on the other layer (Layer1). Itwould then perform barycenter heuristic based crossingminimization on the Bigraph. As we can recall fromprevious discussion, barycenter heuristic requires itera-tive computation of the global ranks of object nodes andattribute nodes which are represented by two layers ofthe bigraph. Rank computation for each node in onelayer of the bigraph requires the knowledge of ranks ofits neighbors on the other layer.

Since we assume horizontal partitioning, each ob-ject node in the bigraph built over local data will haveall necessary information (i.e. Ranks of its neighbor at-tribute nodes) to calculate its rank locally using (3.3).This implies that no inter-server communication is re-quired for calculating the ranks of objects at each server.

On the other hand since attribute nodes are con-nected to object nodes which might be separated acrossdifferent servers, we will have to engage in inter-servercommunication for exact computation of the ranks ofthese attribute nodes. Each server i shares its localvalue of µij for j’th attribute to all other servers andthen global weighted mean is calculated. This globalweighted mean µ

(j)G for an attribute aj ∈ A is simply

the mean of all µij values over n servers.

µ(j)G =

∑1≤i≤n µij

n

The global weighted mean for an attribute node

is the same over all the servers and thus results inassignment of a unique rank to each attribute node.This is an embarrassingly parallel procedure where rankof each node can be effectively computed independentof the rank of any other node in the same layer. This isspecially useful for hardware based implementations ofbiclustering algorithm.

4.1.2 Parallel/Distributed Bicluster Identifica-tion By the end of the above mentioned distributedbarycenter heuristic based crossing minimization algo-rithm, each server will have a reordered representationof its local data. Since ranks of object nodes are cal-culated locally, each server does not know the globalrank of its object nodes. Global rank of object nodes isuseful only if each server knows the ranks of objects onother servers too. This is because of the fact that bi-cluster identification procedure works on contiguouslyranked object nodes. It will have to be implementedin a distributed setting by having a hash table so thateach server si can lookup the table to determine the setof servers Slookup which have object nodes adjacent toits own object node. The server si would then engagein distributed bicluster identification procedure in col-laboration with those servers which belong to the setSlookup.

It would result in tremendous increase in communi-cation overhead. Firstly, all servers will have to engagein all-to-all broadcast of local ranks of each object nodeso as to assign unique global rank to each object nodeat each server. This is also required for building the re-quired look up table. Given the large number of objectnodes, the communication cost for this process wouldbe prohibitively high. Secondly, during the bicluster


identification process each server will have to engage indistributed bicluster identification with other servers be-longing to the set Slookup. This would require biclusteridentification procedure to be performed in a synchro-nized manner at each server. Given the fact that theseservers will be communicating on internet speeds andare distributed geographically, synchronization require-ment will be impractical for most practical applications.

The solution to above mentioned problem is to em-ploy a simple approach whereby object ranks are com-puted locally at each server and then bicluster identifi-cation process is performed locally as well. Once eachserver has identified its local biclusters, it broadcastsrepresentatives of these biclusters to all other servers.A bicluster representative is a vector consisting of themean of each attribute in the bicluster. Upon receivingthe bicluster representatives from other servers, eachserver determines through Euclidean distance if its lo-cal biclusters can be combined with those from otherservers. In case it finds strong similarity between oneof its biclusters and incoming bicluster representatives.It updates the local bicluster representative for that bi-cluster such that the attribute means now reflect theglobal attribute means for the bicluster.

5 Complexity Analysis

Lets assume that n =| V0 |= total number of rows of theinput data matrix,m =| V1 |= total number of columnsof the input matrix and Rc = average number of rowsper bicluster. Also Cc = average number of columns perbicluster and C = total number of biclusters and k =average number of iterations of barycenter heuristic.

5.1 Time complexity of Sequential Algorithm:Now if O(n) = time to compute weighted means andO(nlogn) = time to perform sorting based on means andO(n) = time taken in adjusting node positions. AlsoO(CRcCc) = time to identify biclusters. Combiningall of these components would yield sequential timecomplexity which is given as Tsequential = O(k(n +nlog(n) + n)) +O(CRcCc). The worst case complexityfor identification of a bicluster is O(nm). But theidentification process is carried out such that the sub-matrix which is part of an already identified biclusterwon’t have to be revisited again during identificationof subsequent biclusters. In the case where a singlebicluster covers the whole matrix, we will have totalnumber of biclusters equal to one thus keeping thecomplexity bounded by the data size itself.

Here, without loss of generality, we can assume thatO(CRcCc) = O(nm). This implies that

Tsequential = O(k(n+ nlog(n) + n)) +O(nm)

We have noted that k is a fairly small number for mostcases. This implies that in the expression for Tsequetial,the term O(nm) tends to dominate which essentiallymeans that sequential biclustering process has complex-ity which is roughly the same as the problem size.

5.2 Parallel Complexity: Lets assume that wehave horizontally partitioned the data among p proces-sors such that each processor has now n/p rows of theoriginal matrix and all m columns. The computationcomplexity of scheme when data is partitioned amongp processors is given by Tp = O(k(n/p+ n/plog(n/p) +n/p)) + O(nm/p). Going by the same arguments as inthe sequential case, O(nm/p) is the rough bound onparallel computation cost. This essentially means thatby having p processors to work on the problem, we havereduced the overall time complexity by a factor of pwhich theoretically yields a linear speedup curve. Sincethe algorithm involves All-to-All communication prim-itives firstly in the crossing minimization phase (All toall reduce over Column positions), then during the clus-ter combining pase all-to-all gather is performed numberof cluster times. All of these All-to-All communicationprimitives incur O(p2m) in communication cost. Thisindicates that the communication cost is independentof the number of rows, it only depends upon number ofcolumns and number of processors. It grows linearlywith number of columns while growing quadraticallywith increasing number of processors.

5.3 Impact of Memory System on PerformanceAs is clear from the above discussion, there are twocomponents of proposed parallel biclustering solution.One is parallel crossing minimization and the other isbicluster identification. Both of these strongly dependon underlying memory system performance. Barycen-ter heuristic requires in-memory sorting of rows andcolumns after each iterations. We use Red-Black trees[20] to keep the nodes sorted during each iteration withrespect to their current positions. Red Black treesare very useful as they provide amortized latency ofO(log n) for insertion, deletion and retrieval operations.However, we observer that the cost of keeping the treebalanced increases exponentially as the data size in-creases beyond a certain threshold depending on theunderlying memory system. This increase in tree bal-ancing cost can be mitigated by increasing the cache sizeso as to reduce the I/O time spent in accessing off-chipmemory and disks. This also motivates the use of par-allelizing the code to make use of the caches distributedacross different processors in the cluster.

The second component of parallel implementationis the bicluster identification process. This requires


reading in the matrix in order determined by theabove mentioned barycenter heuristic. This processis very strongly dependent on the performance of theunderlying memory system based on following tworeasons.

• During this procedure, matrix rows and columnsare scanned based on the order determined throughbarycenter heuristic and not by the natural layoutin the memory. This has adverse impact on thecache performance specially when the data size isvery large.

• As data size increases, due to limited memoryat each processor, there is high likelihood thatrequired data might get swapped out of memoryand onto the hard disk. Since I/O latency forhard disk access is orders of magnitude largerthan memory access, this effect is very prominentspecially in the case of sequential algorithm.

Both of the above reasons can be mitigated intrin-sically by the parallel implementation. As more proces-sors become available, partitions of large data sets canfit into their local memory more easily. Increased cachealso has a strong positive impact on the performance ofthe parallel algorithm. In parallel processing domain, itis generally believed that super-linear speedups are at-tained when the parallel implementation is able to makeeffective use of caches of all machines in the cluster. Ifwe consider the retail chain scenario mentioned in pre-vious sections, this data distribution is natural. As weshow in the experiment section, these reasons explainthe super-linear speedups observed while comparing theperformance of parallel algorithm against the sequentialone.

6 Experimental Results

The proposed biclustering algorithm is implemented inMPI/C++. Our implementations for parallel bicluster-ing algorithm is evaluated on Mercury, a machine atthe National Center for Supercomputing Applications(NCSA). Mercury is an 887-node IBM Linux clusterwhere each node contains two Intel 1.3/1.5 GHz Ita-nium II processors sharing 4 GB of memory. Runninga SuSE Linux operating system, the compute nodes areinter-connected by both Myrinet and Gigabit Ethernet.Implementation has been tested with MPICH version1.2.6 on faster cpu machines (1.5GHz) with 6MB L3cache. In all the experiments the execution time wasobtained through MPI Wtime() and is reported in sec-onds. The execution time doesn’t take into account thetime spent in reading data from files but it contains thetime of writing the identified bicluster to the files.

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

Match score

Noise Level

Accuracy Comparison

SequentialParallel

Figure 3: Accuracy Comparison between Parallel andSequential Algorithm

6.1 Accuracy Evaluation First set of experimentsare aimed at judging the accuracy of the proposed al-gorithm. For this purpose, we evaluate proposed paral-lel/distributed implementation of barycenter heuristicbased biclustering algorithm against results from thesequential implementation [1]. For empirical evaluationof the quality of clusters, we use the following functionwhich was also used in [21] and [22] .

Let M1,M2 be two sets of bi-clusters. The matchscore of M1 with respect to M2 is given by

S(M1,M2) =1

|M1 |∑

A(I1,J1)∈M1

maxA(I2,J2)∈M2

| I1 ∩ I2 || J1 ∩ J2 || I1 ∪ I2 || J1 ∪ J2 |

Let Mopt denote the set of implanted bi-clusters andM the set of the output bi-clusters of a biclusteringalgorithm. S(Mopt,M) represents how well each of thetrue bi-clusters is discovered by the algorithm.

Again, we follow the approach used by Liu et al. [22]for synthetic data generation. To cater for the missingvalues in real life data, we add noise by replacing someelements in the matrix with random values. Thereare three variables b, c and γ in the generation of thebi-clusters. b and c are used to control the size ofthe implanted bi-cluster. γ is the noise level of thebi-cluster. The matrix with implanted constant bi-clusters is generated with four steps: (1) generate an × m matrix A such that all elements of A are 0s,(2) generate

√n×√m bi-clusters such that all elements

of the biclusters are 1s, (3) implant the bi-clusters intoA, (4) replace γ(m × n) elements of the matrix withrandom noise values (0 or 1) .

In the experiment, the noise level ranges from 0 to0.25. The parameter settings used for the two meth-ods are listed in Table 1. The resulting graph is shown


Method Parameter Settings

Parallel δ = 0.5, Iterations = 5,Function=KL-Divergence,Procs=4Sequential δ = 0.5, Iterations = 5,Function=KL-Divergence,Processors=1

Table 1: Parameter Settings for Parallel and sequential Biclustering Algorithm

in Figure 3. The graph illustrates the accuracy com-parison between sequential barycenter heuristic basedbiclustering algorithm with its parallel/distributed im-plementation (using four processors). The graph showsthat at zero noise, this scheme is almost as accurate asthe sequential scheme [1]. With the increase in noise theaccuracy degrades a little bit which is because of non-exact nature of bicluster comparison and merge opera-tions of the algorithm. Note however that the accuracyof this algorithm remains considerably high even thoughit is a distributed implementation which uses a simplebiclustering merge framework.

6.2 Performance Evaluation of Parallel Biclus-tering Algorithm Second set of experiments areaimed at analyzing the performance of parallel biclus-tering algorithm. For this purpose algorithm was runon synthetic data sets mentioned above. These datasets have sizes ranging from 200, 000 rows to 20, 000, 000rows, while keeping 64 columns throughout.

Figure 4 shows the performance improvement interms of execution time for crossing minimization andbicluster identification with varying problem sizes. Infigure 4(a) execution time for more than 64 proces-sors is not reported. It is evident from these chartsthat bicluster identification execution time dominatescrossing minimization execution time. So, total execu-tion time shows the behavior of bicluster identificationwhile crossing minimization becomes insignificant. Eachcurve represents a different data size, it is evident fromthe figure that execution time increases significantlywith the increase in number of rows. Although for singleprocessor, crossing minimization time increases many-fold by increasing the number of rows from 200, 000 to20, 000, 000, but figure 4(b) shows that bicluster iden-tification case has shown extreme increase in executiontime for greater than 4 Million rows. The reason for thisbehavior, as explained in Section 5.3 is the poor cacheutilization and possible swapping of required data tosecondary storage when the data size becomes too largeto fit in memory. The execution time improves signifi-cantly in case of multiple processors because data par-tition at each processor is more likely to fit in memoryand it is less likely to suffer from hard disk I/O accesslatencies because of swapping. Moreover, smaller datasets have better cache utilization as there is less cachepollution. Since there is orders of magnitude differencebetween I/O latencies from cache and those from hard

disk, the end result is exponential reduction in execu-tions times for multiple processor cases. We are notpresenting the execution time more than 5 hours in thispaper to avoid imbalanced scaling in charts. For biggerdata sizes, execution time with only large number of pro-cessors is reported in figure 4, because execution time isconsiderably high for smaller number of processes.

Figure 5 shows the charts for speedup, which is de-fined as S = Time(p=1)

time(p=P ) , for different problem sizes from1 to 256 processors. First graph shows speedup curvesfor parallel crossing minimization, all other graphs showoverall speedup achieved for different problem sizes. For200, 000 rows speedup decreases as the number of pro-cessors is increased beyond 64. This is because of thefact that inter-processor communication cost which isO(p2) starts undermining the performance gains madethrough data decomposition and resulting parallel exe-cution of computation tasks. Similar phenomenon couldbe seen for other problem sizes as well if the number ofprocessors are increased further. 1 Million rows casealso seems to saturate after 256 processors. Smaller theproblem size, sooner saturation will occur. Note that forthe cases of 4 Million, 10 Million and 20 Million rows,speedup has been calculated with respect to 8, 32 and 64processors respectively. It is clear from the graphs thatthe proposed approach is very scalable and maintainsexcellent speedup characteristics throughout. As dis-cussed earlier, superlinear speedups are result of bettercache utilization with smaller data sizes and avoidanceof high latency secondary storage I/O.

7 Conclusions

Parallel/distributed implementation of biclustering al-gorithm is highly desired in many application domains.These domains include gene expression analysis in largescale micro-array experiments, text mining on large webbased text repositories and collaborative filtering in E-commerce systems. In order to meet this challenge,a parallel/distributed algorithm for barycenter heuris-tic based biclustering algorithm is proposed. We pro-vide thorough analysis of the advantages of this schemeover existing solutions. Proposed approach is evaluatedon synthetic data sets of large sizes on a cluster withhundreds of machines. Both accuracy and performanceof the algorithm are tested and verified for these datasets. The experiments reveal that the algorithm notonly shows super-linear and scalable speedup charac-


0

10

20

30

40

50

60

70

1 8 16 32 64

Time (sec)

Number of Processes

Crossing Minimization

200,0001 Million4 Million10 Million20 Million

(a)

0

5000

10000

15000

20000

1 32 64 128 256

Time (sec)

Number of Processes

Bicluster Indentification

200,000

1 Million

4 Million

10 Million

20 Million

(b)

Figure 4: Execution Time for Crossing Minimization and Bicluster Identification for Different Problem Sizes.

0

50

100

150

200

250

300

350

400

64 128 256

Speedup

Number of Processes

Crossing Minimization

200,0001 Million4 Million

10 Million20 Million

0

500

1000

1500

2000

2500

3000

3500

4000

4500

64 128 256

Speedup

Number of Processes

Overall Speedup

200,0001 Million

0

100

200

300

400

500

600

700

800

32 64 128 256

Speedup

Number of Processes

Overall Speedup

4 Million

0

20

40

60

80

100

120

140

32 64 128 256

Speedup

Number of Processes

Overall Speedup

10 Million

0

5

10

15

20

25

64 128 256

Speedup

Number of Processes

Overall Speedup

20 Million

Figure 5: Crossing Minimization’s and Overall Speedup for Different Problem Sizes.


teristics but also maintains good accuracy throughout.

8 Acknowledgements

This work was supported in part by DOE SCIDAC-2: Scientific Data Management Center for EnablingTechnologies (CET) grant DE-FC02-07ER25808, DOESCiDAC award number DE-FC02-01ER25485, NSFHECURA CCF-0621443, NSF SDCI OCI-0724599, NSFCNS-0551639 and IIS-0536994. This research was sup-ported in part by the National Science Foundationthrough TeraGrid resources provided by NCSA, grantnumbers TG-CCR060017T, TG-CCR080019T, and TG-ASC080050N. We gratefully acknowledge Prof. Ash-faq khokar of University of Illinois, Chicago and anony-mous reviewers for their insightful suggestions and help-ful comments.

References

[1] W. Ahmad and A. Khokhar, “chawk: A highly ef-ficient biclustering algorithm using bigraph crossingminimization,” in Second International Workshop onData Mining and Bioinformatics, VDMB 2007, Vi-enna, Austria (In conjunction with VLDB2007), 2007.

[2] Y. Cheng and G. Church, “Biclustering of expressiondata,” in Proceedings of Intelligent Systems for Molec-ular Biology, 2000.

[3] S. C. Madeira and A. L. Oliveira, “Biclustering al-gorithms for biological data analysis: A survey,”IEEE/ACM Transactions on Computational Biologyand Bioinformatics, vol. 1, 2004.

[4] R. Bellman, Dynamic Programming. Princeton, NJ:Princeton University Press, 1957.

[5] I. S. Dhillon and D. S. Modha, “Conceptdecompositions for large sparse text data usingclustering,” Machine Learning, vol. V42, no. 1,pp. 143–175, January 2001. [Online]. Available:http://dx.doi.org/10.1023/A:1007612920971

[6] A. Kittur, E. Chi, B. A. Pendleton, B. Suh, andT. Mytkowicz, “Power of the few vs. wisdom of thecrowd: Wikipedia and the rise of the bourgeoisie,”World Wide Web, vol. 1, no. 2, 2006.

[7] K. hoi Cheung, K. White, J. Hager, M. Gerstein, andM. S. P. P. Miller, “Ymd: a microarray database forlarge-scale gene expression analysis,” in Proc AMIASymp, 2002, pp. 140–144.

[8] J. Zhou and A. A. Khokhar, “Parrescue: Scalableparallel algorithm and implementation for biclusteringover large distributed datasets.” in ICDCS. IEEEComputer Society, 2006, p. 21.

[9] T. George and S. Merugu, “A scalable collaborativefiltering framework based on co-clustering.” in ICDM.IEEE Computer Society, 2005, pp. 625–628.

[10] L. Wei, C. Ling, Q. Hongyu, and Q. Ling, “A par-allel biclustering algorithm for gene expressing data.”IEEE Computer Society Press, 2008.

[11] M. Gu, H. Zha, C. Ding, X. He, andH. Simon, “Spectral relaxation models andstructure analysis for k-way graph clusteringand bi-clustering,” 2001. [Online]. Available:citeseer.ist.psu.edu/article/gu01spectral.html

[12] I. S. Dhillon, “Co-clustering documents andwords using bipartite spectral graph partition-ing,” in Knowledge Discovery and Data Mining,2001, pp. 269–274. [Online]. Available: cite-seer.ist.psu.edu/dhillon01coclustering.html

[13] Y. G. H. Cho, I. S. Dhillon and S. Sra, “Minimumsum-squared residue co-clustering of gene expressiondata,” in Proceedings of the fourth SIAM InternationalConference on Data Mining, 2004, pp. 114–125.

[14] W. Ahmad and A. Khokhar, “An architecture for pri-vacy preserving collaborative filtering on web portals,”in The Third International Symposium on InformationAssurance and Security (IAS), 2007.

[15] S. M. I. S. Dhillon and D. S. Modha., “Information-theoretic co-clustering,” in Proceedings of The NinthACM SIGKDD International Conference on Knowl-edge Discovery and Data Miing (KDD), 2003, pp. 89–98.

[16] I. S. D. A. Banerjee, S. Merugu and J. Ghosh, “Clus-tering with bregman divergences,” Journal of MachineLearning Research, vol. 6, pp. 1705–1749, 2005.

[17] J. Lafferty, S. Pietra, and V. Pietra, “Statistical learn-ing algorithms based on bregman distances,” in Pro-ceedings of the Canadian Workshop on InformationTheory, 1997., 1997.

[18] M. R. Garey and D. S. Johnson, Computers andIntractability; A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman& Co., 1990.

[19] W. Ahmad and A. A. Khokhar, “Phoenix: Privacy pre-serving biclustering on horizontally partitioned data,”in Privacy, Security, and Trust in KDD, First ACMSIGKDD International Workshop, PinKDD 2007, SanJose, CA, USA, August 12, 2007, Revised Selected Pa-pers, ser. Lecture Notes in Computer Science, vol. 4890.Springer, 2008, pp. 14–32.

[20] T. H. Cormen, C. Stein, R. L. Rivest, and C. E.Leiserson, Introduction to Algorithms. McGraw-HillHigher Education, 2001.

[21] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille,P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, andE. Zitzler, “A systematic comparison and evaluation ofbiclustering methods for gene expression data,” Bioin-formatics, 2006.

[22] L. W. X. Liu, “Computing the maximum similaritybi-clusters of gene expression data,” Bioinformatics,vol. 23, no. 1, pp. 50–56, 2007.


high performance parallel/distributed biclustering using

Documents