research openaccess conﬁgurablepattern-basedevolutionary ... · 2017. 8. 28. · ve ve + ·+ · +...

Pontes et al. Algorithms for Molecular Biology 2013, 8:4http://www.almob.org/content/8/1/4

RESEARCH Open Access

Configurable pattern-based evolutionarybiclustering of gene expression dataBeatriz Pontes1*, Raúl Giráldez2 and Jesús S Aguilar-Ruiz2

Abstract

Background: Biclustering algorithms for microarray data aim at discovering functionally related gene sets underdifferent subsets of experimental conditions. Due to the problem complexity and the characteristics of microarraydatasets, heuristic searches are usually used instead of exhaustive algorithms. Also, the comparison among differenttechniques is still a challenge. The obtained results vary in relevant features such as the number of genes orconditions, which makes it difficult to carry out a fair comparison. Moreover, existing approaches do not allow the userto specify any preferences on these properties.

Results: Here, we present the first biclustering algorithm in which it is possible to particularize several biclustersfeatures in terms of different objectives. This can be done by tuning the specified features in the algorithm or also byincorporating new objectives into the search. Furthermore, our approach bases the bicluster evaluation in the use ofexpression patterns, being able to recognize both shifting and scaling patterns either simultaneously or not.Evolutionary computation has been chosen as the search strategy, naming thus our proposal Evo-Bexpa(Evolutionary Biclustering based in Expression Patterns).

Conclusions: We have conducted experiments on both synthetic and real datasets demonstrating Evo-Bexpaabilities to obtain meaningful biclusters. Synthetic experiments have been designed in order to compare Evo-Bexpaperformance with other approaches when looking for perfect patterns. Experiments with four different real datasetsalso confirm the proper performing of our algorithm, whose results have been biologically validated through GeneOntology.

Keywords: Gene expression data analysis, Shifting and scaling expression patterns, Evolutionary biclustering

BackgroundDNA microarray technologies are used to analyse theexpression level of many genes in a single reaction quicklyand in an efficient manner. Different types of microar-ray chips have been designed for different investigations,being expression chips the most common application.They are used to determine the expression patterns ofgenes that correspond to different samples, where thesamples may vary according to experimental conditionsand/or physiological states. They may even be extractedfrom different individuals, tissues or developmental stages[1]. The applications of this kind of microarrays involveto determine gene functions, find new genes, study generegulation and assess how they have evolved over time.

*Correspondence: [email protected] of Computer Languages, University of Seville, Seville, SpainFull list of author information is available at the end of the article

The raw data of a microarray experiment is an image,in which the colours and intensities reflect the expres-sion level of each gene and each sample. This image isprocessed in order to obtain a numerical gene expres-sion matrix, in which rows correspond to the genes understudy and the columns refers the different samples. A spe-cial characteristic of these expression matrices is that theyare very unbalanced, in the sense that the number of genesis much larger (usually thousands of genes) than the num-ber of samples (usually less than a hundred) [2]. Therefore,analysing these kind of matrices implies understandingthe relationships of a space of lots of variables (genes) fromonly a few measured points (experimental conditions).In order to obtain relevant knowledge from microar-

ray data, similarities among genes and samples needto be carried out in many different ways, dependingon the specific application. Due to the complexity of

© 2013 Pontes et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

Pontes et al. Algorithms for Molecular Biology 2013, 8:4 Page 2 of 22http://www.almob.org/content/8/1/4

these tasks, together with the huge amount of data,diverse data mining and machine learning approacheshave been studied to produce a great variety of soft-ware for the analysis of gene expression data frommicroarrays.

Gene expression microarray analysisFocussing the expression matrices analysis on the genes,one of the most studied goals is to extract informationon how gene expression patterns vary among the differentsamples, finding groups of co-expressed genes. If two dif-ferent genes show similar expression patterns across thesamples, this suggests a common pattern of regulation,possibly reflecting some kind of interaction or relationshipbetween their functions [3].Within data mining techniques it is possible to differ-

entiate two main sets of algorithms, depending on theuse (supervised learning) or not (non-supervised learn-ing) of previous knowledge on the data. Classification hasbeen extensively studied within gene expression data as asupervised technique [4-8], where labelled data is used tocreate an algorithm able to assign any new input data toits proper class.On the other hand, non-supervised learning is used

when no previous assignations are available; the goal isto divide the data into clusters of samples and to iden-tify the differences between the genes that characterizesuch groups. The application of clustering techniquesto gene expression data has also been broadly studiedin the literature [9-11]. Nevertheless, there exists twomain restrictions in the use of clustering algorithms: (1)genes are grouped together according to their expres-sion patters across the whole set of samples, and (2)each gene must be clustered into exactly one group. Thislast limitation is two-fold: firstly, it means that a cer-tain gene cannot be present in different groups, thusforbidding overlapping among clusters; secondly, it con-fines each gene to be a member of any cluster, even ifit is not co-regulated with any of the other genes in thecluster.However, genes might be relevant only for a subset of

samples. This is essential for numerous biological prob-lems, such as the analysis of genes contributing to certaindiseases, assigning biological functionalities to genes orwhen the conditions of a microarray are diverse [12].Thus, clustering should be performed on the two dimen-sions (genes and conditions) simultaneously. Also, manygenes may be grouped into diverse clusters (or noneof them) depending on their participation in differentbiological processes within the cell [13]. These charac-teristics are covered by biclustering techniques, whichhave also been largely applied to microarray data [14-16].The groups of genes and samples found by biclusteringapproaches are called biclusters.

Finding significant biclusters in a microarray is a muchmore complex problem than clustering [17]. In fact, ithas been proven to be a NP-hard problem [18]. Conse-quently, themajority of the proposed techniques are basedon optimization procedures as the search heuristics. Thedevelopment of both a suitable heuristic and a good costfunction for guiding the search is essential for discoveringinteresting biclusters in an expression matrix. Further-more, having a suitable evaluation measure for biclustersis important as it can be used for comparing the perfor-mances of different biclustering approaches, which is anunsolved task nowadays.In order to design an effective evaluation measure for

biclusters, we have focused our research on the study ofdifferent types of expression patterns in the literature.

Gene expression patterns in biclustersSeveral types of biclusters have been described and cate-gorized in the literature, depending on the pattern exhib-ited by the genes across the experimental conditions [19].For some of them it is possible to represent the values inthe bicluster using a formal equation. In the following, letB be a bicluster consisting in a set I of |I| genes and a set Jof |J| conditions, in which bij refers to the expression levelof gene i under sample j.

• Constant values: bij = π• Constant values on rows or columns

– Additive: bij = π + βi, bij = π + βj– Multiplicative: bij = π × αi, bij = π × αj

• Coherent values on both rows and columns

– Additive: bij = π + βi + βj– Multiplicative: bij = π × αi × αj

where π represents any constant value for B, βi(1 ≤ i ≤|I|) and βj(1 ≤ j ≤ |J|) refer to constant values usedin additive models for each gene i or condition j; andαi, (1 ≤ i ≤ |I|) and αj, (1 ≤ j ≤ |J|) correspond to con-stant values used in multiplicative models for each gene ior experimental condition j.Other kind of biclusters correspond to those in which

their values exhibit coherent evolutions, thus showing anevidence that the subset of genes is up-regulated or down-regulated across the subset of conditions without takinginto account their actual expression values. In this situa-tion, data in the bicluster cannot be represented by anymathematical model.The most general situation that can be described using

a mathematical formula is when a bicluster has coherentvalues on both rows an columns, for the additive and mul-tiplicative model at the same time.When it is the case, it is


said that the bicluster follow a perfect shifting and scalingpattern, and its values can be represented by this equation:

bij = πi × αj + βj, (1)

where πi(1 ≤ i ≤ |I|) refer to constant values foreach gene/row i. Since 2000, several quality measuresfor biclusters have been proposed together with differ-ent heuristics. Nevertheless, to the best of our knowledge,none of the former proposed quality measures is able torecognize a perfect shifting and scaling pattern in a biclus-ter. Nevertheless, this is the most general situation andalso the most probable when working with gene expres-sion data. In this sense, we have recently developed astandardization-based procedure for assessing biclustersquality. This measure has been named VEt and has beenproven to be effective for recognizing both types of pat-terns simultaneously in biclusters (see section BiclustersEvaluation in Evo-Bexpa).

Biclustering approaches based on evaluation measuresMean squared residueCheng and Church [20] were the first in applying biclus-tering to gene expression data. They introduced one ofthe most popular biclustering algorithms that combinesa greedy search heuristic for finding biclusters with ameasure for assessing the quality of such biclusters.The original algorithm of Cheng and Church (hence-

forth CC) adopts a sequential covering algorithm in orderto return a list of n biclusters from an expression datamatrix. In order to assess the quality of biclusters thealgorithm uses the Mean Squared Residue (MSR). Thismeasure aims at evaluating the coherence of the genesand conditions of a bicluster B consisting of I rows and Jcolumns. MSR is defined as:

MSR(B) = 1|I| · |J||I|∑

i=1

|J|∑

j=1(bij − biJ − bIj + bIJ )2, (2)

where bij, biJ , bIj and bIJ represent the element in theith row and jth column, the row and column means,and the mean of B, respectively. The lower the meansquared residue, the stronger the coherence exhibitedby the bicluster, and the better its quality. If a biclus-ter has a mean squared residue lower than a givenvalue δ, then it is called a δ-bicluster. If a biclusterhas MSR equal to zero, it means that its genes fluctu-ate in exactly the same way under the subset of exper-imental conditions, and thus it can be considered aperfect bicluster.Nevertheless, MSR has been proven to be inefficient

for finding certain types of biclusters in microarray data,

especially when they present strong scaling tendencies[21]. In fact, MSR is only able to capture shifting tenden-cies within the data [22]. Furthermore, CC also presentssome other disadvantages due to the search strategy andthe use of a threshold for rejecting solutions, since thisthreshold is dependent on each database and has to becomputed before applying the algorithm [23].In spite of the MSR constraints, it has been widely

used in many proposals. These proposals are based ona diverse range of heuristics: greedy search [24], geneticalgorithms (both single and multi-objective) [17,25], sim-ulated annealing [26], and fuzzy biclustering, among oth-ers. More recently, MSR has also been incorporated ascost function in multiobjective heuristics based on Parti-cle Swarm Optimization [27], Artificial Immune Systems[28], and in a variant of the GRASP approach [29].

Scalingmean squared residueMukhopadhyay et al. [30] have recently developed anevaluation measure for biclusters which is able to rec-ognize scaling patterns. In their work, they analyse thereasons why MSR is able to recognise shifting patternsin biclusters but no scaling patters. Using the mathemat-ical formula for scaling patterns, they define a metricwhich is then proved to identify scaling patterns. This newmeasure is named SMSR, from Scaling MSR, and it isshown in equation 3. Nevertheless, SMSR is not capableof identifying shifting patterns.

SMSR(B) = 1|I| · |J|i=|I|∑

i=1

j=|J|∑

j=1

(biJ × bIj − bij × bIJ )2b2iJ × b2Ij

(3)

SMSR has been incorporated into a greedy search strat-egy similar to that of Cheng and Church. This method-ology, therefore, shares the same disadvantages with CC,and it is also necessary to stablish a limit value for SMSRfor each database. In order to also find biclusters withshifting patterns, the authors propose an adapted algo-rithm in which CC algorithm is applied twice, the firsttime using MSR as evaluation measure and the secondtime using SMSR. This allows to find biclusters with shift-ing patterns and also biclusters with scaling patterns, butit does not find biclustering with both kind of patternssimultaneously.

HARP AlgorithmYip et al. [31] presented a biclustering approach namedHARP (Hierarchical approach with Automatic Relevantdimension selection for Projected clustering) based onprojected clustering. They also introduced an evaluationmetric slightly different fromMSR, in which the quality of


a bicluster is measured as the sum of the relevance indicesof the columns. Relevance index RIj for column j ∈ J isdefined as

RIj = 1 −σ 2Ij

σ 2.j, (4)

where σ 2Ij (local variance) and σ2.j (global variance) are the

variance of the values in column j for the bicluster and thewhole data set, respectively. Note that the relevance indexfor a column is maximized if its local variance is zero,provided that the global variance is not. Based on this rel-evance index, the quality of a cluster is measured as thesum of the index values of all the selected conditions.At the beginning of the algorithm there are as many

biclusters as genes. The process consist in iterativelymerging biclusters until a certain criterion is met, choos-ing those experimental conditions that satisfy a specificthreshold requirements, taking into account the relevanceindexes. Optionally, a re-assignation procedure is applied,where biclusters with very few elements are removed andelements are assigned to the closest bicluster, according toa distance measure.This algorithm presents several drawbacks, being the

most important one the kind of biclusters it is capableto find. Due to the nature of their evaluation measures,the only bicluster patterns that maximize the quality areconstant biclusters (either on rows or on columns). Fur-thermore, the way in which the algorithm works does notallow overlapped elements among biclusters, which is oneof the most important differences between clustering andbiclustering methodologies.

Virtual errorThe basic idea behind the Virtual Error (VE) [32] isto measure how genes in a bicluster follow the gen-eral tendency within the group. In order to catch thegeneral tendency of the genes across the conditions con-tained in the bicluster, a new gene (the so-called vir-tual gene) is computed as the mean of all the genesin the bicluster. This way, this virtual gene symbol-izes the common tendency of the set of genes for thegiven bicluster.VE aims at measuring the extend to which all the genes

in the biclusters resemble the virtual gene. In order tocarry out a fair comparison in terms of shifting and scal-ing patterns, a process of gene standardization is performon all genes, including the virtual one. This way, genesvalues are scaled to a common range. After that, VE isdefined as the average value of all the differences betweenthe standardized expression values of the bicluster and thestandardized virtual gene. The more similar the genes are,the lower the value for VE. In fact, VE has proven to be0 for those biclusters presenting either shifting or scal-ing patterns [32]. It has also been proven that when the

data in a bicluster resembles a perfect pattern but contain-ing some other noise data, VE will have a greater valuedepending on the amount of noise data in the bicluster[33]. VE has been used in an evolutionary search strat-egy producing satisfactory results and improving thoseobtained with other evaluation measures [32], being ableto recognized both shifting and scaling patters, though nosimultaneously.

Nonmetric-based biclustering algorithmsNot all existing biclustering approaches base their searchfor biclusters on evaluation measures. There exists adiverse set of biclustering tools that follow different strate-gies and algorithmic concepts which guide the searchtowards meaningful results. Among others, most pop-ular algorithms include Ben-Dor et al. [34] approach’sOrder Preserving Sub Matrix (OPSM) algorithm, whichtries to identify large submatrices in which the expres-sion levels of all genes induce the same linear orderingof the samples. Iterative Signature Algorithm (ISA) wasproposed by J. Ihmels et al. [35,36] and applies the sig-nature algorithm in order to find transcription modules,which are self-consistent regulatory unit consisting of aset of co-regulated genes and the experimental conditionsthat induce their co-regulation. Murali and Kasif [37] pro-posed the use of xMOTIFs (conserved gene expressionMotifs) for the representation of gene expression data.Their algorithm looks for large xMOTIFs in which genesare expressed in the same state across all samples in it.Finally, Bimax has been presented by Prelic et al. as afast divide-and-conquer algorithm capable of finding allmaximal bicliques in a corresponding graph-based matrixrepresentation [38].Prelic et al. also developed a Biclustering Analysis

Toolbox (BicAT) [39] which includes implementations ofBimax and also the other three algorithms (OPSM, ISAand xMOTIFs), together with CC (see section Bicluster-ing approaches based on Evaluation Measures). In thiswork we have compared the results of our approach onboth synthetic and real data sets with these five differentapproaches.More recently, QUBIC has been presented as a qualita-

tive biclustering algorithm, in which the input data matrixis first represented as a matrix of integer values. After-wards the algorithm looks for genes with identical integervalues across a subset of conditions [40]. Hochreiter et al.[41] have developed a generative multiplicative model forthe biclustering problem, assuming realistic non-Gaussiansignal distributions with heavy tails. They also assumedgene expression data to be preprocessed and filtered. Hier-archical clustering has also been used by Huang et al.[42], incorporating a sub-dimensional search strategy inan effort to reduce the search space dimension, while Sillet al. [43] have incorporated stability selection to improve


a sparse singular value decomposition (SSVD) approach.Other works are based on a previous binarization of thedata, such as DeBi [44]. After binarizing the data, DeBiconsist of three stages for finding, extending and filteringseed bicluters. Although no evaluationmeasure for biclus-ters is defined, Fisher exact text is used in the extendingphase.

MethodsThis section details our biclustering approach, consistingin a sequential covering method [45], where the func-tion that obtains each bicluster is an evolutionary algo-rithm. Our algorithm has been named Evo-Bexpa, fromEvolutionary Biclustering based in Expression Patterns.The evolutionary process inside Evo-Bexpa consists of agenetic algorithm guided by a fitness functions in whichseveral objectives are taken into account. These objectivesare easily configurable, with the possibility of specifyinguser preferences on some characteristics of the results,specifically the number of genes, number of conditions,overlapping amount and gene mean variances. This way, ifany previous information related to the microarray understudy is available, the search can be guided towards thepreferred types of biclusters. Furthermore, other objec-tives can also be easily incorporated into the search, aswell as any objective may be ignored by setting its weightto zero.The problem of finding a single bicluster according to

several objectives corresponds to a multi-objective opti-mization problem, in which two ormore conflicting objec-tives need to be optimized. The strategy of constructinga single Aggregate Objective Function (AOF) has beenadopted in order to solve this multi-objective problem([46]). This way, it is possible to specify the relative influ-ence of each objective in the bicluster evaluation, allowingthus our algorithm to be configurable.In the following subsections we first explain the different

objectives taken into account in the bicluster evaluation.Afterwards, the evolutionary algorithm behind Evo-Bexpais depicted, including the initialization of the population,the generational change and also the way in which the dif-ferent objectives have been combined to form the fitnessfunction.

Biclusters evaluation in Evo-BexpaThis subsection details the biclusters characteristics takeninto account in their evaluation. In our approach wehave individualised four different objectives, attending tothe extent to which a bicluster follow a perfect correla-tion pattern, its size, overlapping amount among differentsolutions and mean gene variance. This objectives havebeen chosen corresponding to the whole set of objectivesused in different biclustering approaches in the literature,with independence of their applications.

VEt for correlated pattern recognitionTransposed Virtual Error (VEt) [33] has been used asthe quality measure for biclusters, being one of the mostimportant objectives in the fitness. It is based on the con-cepts of expression patterns and quantifies the degree ofcorrelation among genes in a bicluster. VEt is computedsimilarly to VE (see section Biclustering approaches basedon Evaluation Measures) but in the transposed way. Thefirst step is the creation of a Virtual Condition, which is avector containing the means of every row in the bicluster,as represented in equation 5. This virtual condition ρ willhave, therefore, as many elements as genes are containedin the bicluster.

ρi = 1|J||J|∑

j=1bij (5)

Afterwards, a process of standardization is carried outon both the bicluster data and the virtual condition, asdepicted in equation 6, where σcj and μcj represent thestandard deviation and the arithmetic average of all theexpression values for condition j, respectively. μρ and σρalso refer to the average and the deviation of the values ofthe virtual condition, respectively.

b̂ij =bij − μcj

σcj, ρ̂i = ρi − μρ

σρ(6)

Finally, VEt measures the differences between the stan-dardized values for every experimental condition and thestandardized virtual condition, as in equation 7. There-fore, VEt is always positive, being its optimal value equalsto 0.

VEt(B) = 1|I| · |J||I|∑

i=1

|J|∑

j=1(b̂ij − ρ̂i) (7)

VEt has been proven to be efficient to recognize bothshifting and scaling patterns in biclusters either simulta-neously or independently [33]. It has also been proven topresent a linear increasing behaviour when the amount oferror in a bicluster gets bigger, measured according to thedistance from its nearest perfect pattern. When workingwith real data, it is very unlikely to find biclusters whereVEt is equal to zero, due to the fact that although genes ina good bicluster share a common behaviour, it cannot berepresented in an exact mathematical equation [47].

Bicluster volumeBicluster volume is defined as the product of the num-ber of genes and the number of samples. At this point wehave two contrary objectives to be optimized. On the onehand, VEt has to be minimized and normally the smallera bicluster is, the lower VEt will be. On the other hand,the volume has to be maximized and the general tendencyis that bigger biclusters will have bigger values for VEt . In


order to design the volume term for the fitness we tookinto account the following issues:

• Use of a logarithmic scale. Little changes in thenumber of rows or columns would not have asignificant effect, depending on the biclustersize.

• Two separated terms for number of genes andconditions. This is necessary for avoiding toounbalanced biclusters, but also desirable in order toallow to configure each dimension sizeindependently. Note that biclusters in which onedimension is very small are more probable to benearer to a perfect pattern, and therefore, they havelow VEt values. For this reason, it is preferable tooptimize the size of both dimensions independently,thus avoiding obtaining biclusters made up of a greatnumbers of genes and only a few samples.

• Fixed range. The range of the values of the functionscontrolling both dimensions should not bedependant on any parameter value.

The final design of the term for the volume is the oneshown in equation 8, where |I| and |J| refer to the numberof genes and conditions, respectively, while wg and wc arethe configuring parameters for both dimensions.

Vol(B) = ( − ln(|I|)ln(|I|) + wg ) + (

− ln(|J|)ln(|J|) + wc ) (8)

Those terms whose constant value (wg or wc) is greaterdecrease slower. Depending on the value of the con-stant used, the term will have more or less influenceover the fitness function at the beginning of the algo-rithm, since initial biclusters are small and they growalong the evolutionary process. At a certain point, increas-ing the number of rows or columns for a certain solutionwould not compensate the lose of quality, according tothe rest of objectives. The moment in which the algo-rithm stops increasing the size of the solutions andfocuses on improving the quality depends on the valueof the constants used. The smaller these constants are,the sooner the algorithm will stop increasing the size.Figure 1 represents the term for the number of genes inequation 8, for different values of the constant wg . It canbe clearly seen that for the smaller value of wg repre-sented (wg = 0.25), the function decreases slower froma smaller value of the number of genes than for greatervalues of wg .Although we have found default values for the con-

stants for both dimensions (rows and columns) thatallow to obtain good solutions in every expression

matrix we have tested, it is very easy to modify the fit-ness function in order to obtain solutions of differentsizes if it is desirable. Increasing the constant associ-ated to rows (wg) will produce biclusters with greaternumber of genes, while increasing the constant associ-ated to columns (wc) will produce biclusters with moreexperimental conditions.

Overlapping among biclustersOverlapping among biclusters is usually permitted butcontrolled in the literature [13]. Overlapping differs fromVEt and volume in the sense that it cannot be evaluatedon a bicluster by itself. Cheng and Church [20] try to avoidoverlapping by replacing in the microarray data those val-ues contained in each found bicluster with random ones.The main drawback of this strategy is that the replace-ment does not really avoid including those values in futurebiclusters. Therefore, if a bicluster is overlapped with aformer one, that means that this new bicluster has beenfound using random values instead of the real ones.In our work, we have adopted a strategy similar to the

one used in [23], where a matrix of weights W the size ofthe microarray is initialized with zero values at the begin-ning of the algorithm. Every time a bicluster is found, theweightmatrix is updated increasing by one those elementscontained in the bicluster. In order to limit the overlapamong biclusters, this matrix is used in the correspond-ing term of the fitness function as in equation 9, whereI and J refers to the sets of rows and columns in the biclus-ter B, respectively. W (bij) corresponds to the weight ofbij inW .

Overlap(B) =∑

i∈I,j∈J W (bij)|I| × |J| × (nb − 1) (9)

This term computes how many times the elements ofB have appeared in any former biclusters, and divides itby the size of B and the order of the solution (nb) minusone. This way, we are being more permissive with the lat-est solutions, and also enclosing the overlapping factor inthe interval [0,1].

Gene varianceBiclustering was first defined by Hartigan in 1972 [48],although it wasn’t applied to microarray data. The aimwas to find a set of sub-matrices having zero variance,that is with constant values. Therefore, Hartigan used thevariance of a bicluster to evaluate its quality. However,when working with gene expression data, it is preferableto obtain biclusters in which gene variances are high. Thisway, gene variance is used in biclustering of microarraydata to avoid obtaining trivial biclusters, favouring those


Figure 1 Genes size term in volume evaluation. This figure represent the term Volg(B) = − ln(|I|)ln(|I|)+wg in equation 8, for different values of wg . Forsmaller values of wg , Volg decreases slower from a smaller value of the number of genes than for greater values of wg , implying that greater values ofwg should be selected in order to favour major number of genes.

solutions in which genes exhibit high fluctuating trends.Gene variance of a bicluster is given by the mean of thevariances of all the genes in it, as in equation 10.

GeneVar(B) = 1|I| · |J||I|∑

i=1

|J|∑

j=1(bij − μgi)2 (10)

Existing biclustering approaches deal with gene variancein different ways. For instance, Cheng and Church [20]used a threshold value δ as an upper limit for their eval-uation measure. This way, they search for biclusters withthe maximum possible values for MSR below δ, reject-ing thus trivial solutions in which there are no expressionchanges across the samples. Nevertheless, using such alimit presents a clear drawback, since δ has to be com-puted for each database before applying the algorithm(see section Biclustering approaches based on EvaluationMeasures).In our proposal, using VEt as a single objective would

produce biclusters in which gene variance is considerablelow. However, if VEt is combined with volume constraintsfavouring bigger solutions and overlapping control, theobtained results may not have so low variance. Despitethis fact, we have also designed a term for controllinggene mean variance within the fitness function. This newterm consists in the inverse of the gene mean variance inequation 10, since biclusters with higher gene variancesare preferred and the fitness is going to be minimized inthe algorithm.

Evolutionary algorithmEvo-Bexpa follows a sequential covering strategy, obtain-ing a single bicluster each time the evolutionary algorithm

(Bexpa) is executed. Therefore, it has to be run n timesif n biclusters are desired, where n is an user-definedparameter.Genetic algorithms are classified as population-based

meta-heuristics for combinatorial optimization, itera-tively trying to improve several candidate solutions (pop-ulation) with regard to a given measure of quality (fit-ness function) [49]. In contrast to other meta-heuristicsdescendent methodologies, such as simulated annealing[26], tabu search [50] or particle swarm optimization [51],genetic algorithms start with a set of possible solutionsinstead of a single one. This characteristic allows geneticalgorithms to explore a larger subset of the whole spaceof solutions, at the same time as it helps them to avoidbecoming trapped at a local optimum. These reasonsmake genetic algorithms very suited to the biclusteringproblem.The first task when choosing a genetic algorithm for

solving any problem is to decide an appropriate indi-vidual or chromosome representation for the possiblesolutions. We have adopted the same individual repre-sentation in other evolutionary biclustering works [52],where each bicluster is represented by a fixed sized binarystring in which a bit is set to one if the correspondinggene or sample is present in the bicluster, and set to zerootherwise.Starting by an initial population, genetic algorithms

select some individuals and recombine them to generatea new population of individuals. This process is repeatedfor a number of generations until the algorithm convergesor certain criteria are met. Algorithms 1.a and 1.b inFigure 2 show the pseudo-codes of both the sequential(Evo-Bexpa) and genetic (Bexpa) strategies, respectively.Evo-Bexpa (algorithm 1.a) consist in iteratively invoking


Bexpa asmany times as biclusters are desired. Bexpa startswith the initialization of the population, followed by aniterative process for the search of a bicluster. Both stageswill be detailed in the next subsections.

Initial populationInitial population procedure is essential in every evolu-tionary algorithm. Depending on the adopted strategy,the algorithm may converge to different solutions. Also, asuitable initial population strategy can even speed up theconvergence [53].Other evolutionary biclustering approaches have

adopted a totally random initial population generation[54], where initial solutions are made up of a randomnumber of elements (genes and samples) randomly cho-sen from the microarray, or also random strategies inwhich the chromosomes are made up of only one element(one gene and one condition) from the microarray [32].In our experimental tests on synthetic data, we found thatthese kind of initializations did not give the algorithm aninitial space solution good enough to come up to the bestsolution. Nevertheless, our algorithm always converged tothe best solution when the initial population contained atleast one 3 × 3 sized bicluster representing a partial solu-tion. That is, this 3 × 3 sized bicluster is a sub-matrix ofthe solution. We used this kind of seeds since it representsthe minimal partial solution in which correlation patternsare less probable to appear at random. Since 2 × 2 sized

matrices of random values always present shifting andscaling behaviour, it would be impossible to differentiatetheir qualities.Therefore, the initial strategy we have adopted consists

in randomly generating individuals which represent 3 × 3sub-matrices, henceforth seeds. The key is to generatemuch more seeds than the size of the population and thenselect the best ones. In fact, it is quite easy to computethe number of seeds needed to increase the probabil-ity that some of them are part of the solution, if it isknown beforehand. The probability of a randomly gener-ated seed to be part of the solution can be computed asthe number of possible seeds in the solution divided bythe number of possible seeds in the whole data matrix, asin equation 11, where M, N, |I| and |J| are the number ofrows and columns of the microarray data matrix and thesolution, respectively.

Favorable seedsTotal seeds

=(|I|3) × (|J|3

)(M3) × (N3

) (11)

Thus, our algorithm computes the number or seedsneeded in order to at least one of them is a part of thesolution. This procedure can only be performed with syn-thetic data, but it also gives us an idea of the number ofseeds to generate in the case of real data.

Figure 2 Bexpa and Evo-Bexpa algorithms. Evo-Bexpa (algorithm 1.a) consist in iteratively invoking Bexpa as many times as biclusters are desired(while loop in line 4). Starting with an empty list of biclusters (line 1), lines 6 to 8 represent the transition between two inner iterations, where therecently found bicluster by Bexpa (invoked in line 5) is stored into the result list, and also the matrix of overlapping weights is updated. This weightmatrix was initialized with 0 values in line 3. The inner genetic algorithm (algorithm 1.b) searches for one bicluster at a time. The weight matrixWand the order of the next bicluster are given to Bexpa for evaluation purposes. Specifically, they are involved in the overlapping control processamong different biclusters. Bexpa starts with the initialization of the population in line 1. Lines 2 to 17 correspond to the iterative process for thesearch of each solution.


Generational changeGenerational change is the mechanism that allows thepopulation to improve its individuals, according to the fit-ness function and trying to converge to the optimal solu-tion (lines 3 to 17 in Algorithm 1.b). For each generation,the new population is formed by incorporating individ-uals from the previous one in several ways: replicatingthemselves, being mutated, being crossed with other(s)individual(s) or combining some of these operators.The next population in Bexpa is created by firstly adding

the best individual of the current population to the nextone, as it can be seen in line 4 of Algorithm 1.b in Figure 2.This process is called elitism and is usually applied inorder to ensure the convergence of the algorithm [55].Also, a mutated copy of the best individual is incorporatedinto the next population (line 5). The rest of individu-als are generated by selecting one or two individuals andapplying crossover or/and mutation. Selection is based onthe use of the fitness function together with a randomcomponent. In our approach, we have used tournament ofsize 3 as selection mechanism [49]. 80% of the remainingindividuals are generated by the crossover of two pre-viously selected chromosomes (lines 6 to 10), while theother 20% individuals correspond to replications (lines 11to 14). The resulting offspring is mutated with a certainprobability in both cases.Three distinct crossover operators are used in our algo-

rithm with equal probability: one-point crossover, two-points crossover, and uniform crossover. We have alsoapplied two different mutation operators: the simple andthe uniform ones. The probability of the uniform muta-tor is much lower due to the fact that every position ofthe bit string is a candidate to be mutated in the uniformmutation.The number of generations (iterations) has been set to

1500, although if there is no significant improvement after150 consecutive generations, the execution is stopped.Crossover and replications percentages, as well as muta-tion probabilities and the number of generations havebeen set experimentally, although all of them are inputparameters for the evolutionary algorithm and can bemodified by the user.

Fitness functionThe fitness function used in our algorithm for the evalua-tion of the potential solutions is presented here. Althoughwe have used the four different objectives described abovein our experiments, the fitness function is easily config-urable by adding new objectives in the form of a mathe-matical formula.In the context of evolutionary algorithms, the fitness

function is a particular type of objective function used tosummarise, as a single figure of merit, how close a givendesign solution is in order to achieve the set aims.

Equation 12 depicts the final fitness function used in ouralgorithm. Note that the goal is to minimize the value ofevery term, in order to find big-sized biclusters with a lowvalue of VEt , high gene variance and hardly overlapped.

�(B) = VEt(B)

VEt(M) + ws · Vol(B) + wov · Overlap(B)

+ wvar · 11 + GeneVar(B)(12)

Every term is weighted, except VE which acts as the ref-erence objective. Nevertheless, the value of VEt for thebicluster has been divided by the VEt value of the wholemicroarray. This is due to the fact that the range of val-ues of VEt depends of the values in each microarray.Although the algorithm pursuit to minimize it, the weightof the other terms of the fitness function would have to berecomputed when using a different microarray. In orderto avoid this situation, we divide it by the VEt value ofthe whole microarray (M refers to the microarray datamatrix).Modifying the weights associated to the different objec-

tives leads the algorithm towards different kind of biclus-ters, according to their sizes, overlapping amount or genevariance. All weights have been designed in the same way;a lower value of a certain weight will result on biclusterswith lower values for the corresponding characteristic,and vice versa. For example, a lower value of ws will leadto small-sized biclusters, while bigger values of ws willresult on big-sized biclusters. In the results section weprovide default values for every weight, which have beenobtained experimentally and have produced meaningfulresults for all the databases under study. Also, we providethe user with a guidance on how the modification of theweights affect the different characteristic of the obtainedbiclusters.Note that it is quite simple to add new objectives to the

fitness. A new mathematical formula should be designedfor each new bicluster feature to be taken into account.This formula will be minimized when inserted into the fit-ness function, and will also have a corresponding weight.In order to better control the effect on the results, it ispreferable that the range of values were fixed, not depen-dant on the specific values of the microarray or bicluster.

Results and discussionThis section presents a wide set of experiments performedto test the validity of Evo-Bexpa, both on synthetic andreal data sets. The results have been compared with thoseobtained using five different approaches: OPSM [34], ISA[35,36], xMotifs [37], CC [20] and Bimax [38] (see sectionBackground for a short description of each approach). All


these five algorithms have been executed using BicATa(Biclustering Analysis Toolbox)[39].Next subsection presents an in-depth analysis on the

performance of Evo-Bexpa when modifying the differentconfiguration parameters introduced in Bicluster Evalu-ation in Evo-Bexpa section, as well as a study on thedifferent parameters for the algorithms in BicAT. Later,subsections Synthetic Data Experiments and Experimentson Real DataSets describe the experiments carried out onartificial and real data sets, respectively.

Analysis of parametersEach biclustering approach needs different parametersto run. Although default parameters are provided whichshould guide the algorithms towards reasonable results,there is no detailed description on how their varia-tions affect the obtained bicluster, for any of them.In this subsection we first describe the input parame-ters for each of the algorithms in BicAT (OPSM, ISA,xMotifs, CC and Bimax), trying to clarify the charac-teristic of the resulting biclusters affected by the mod-ification of the different parameters. After that, wepresent a study on the parameter sensitivity for Evo-Bexpa.

Analysis of parameters for algorithms in BicATBimax uses an underlying binary data model whichassumes two possible expression levels per gene. There-fore, as a preprocessing phase, it is compulsory to dis-cretize the expression values to binary values at a specificthreshold and with a specific scheme. All values abovethe threshold will be set to one, all those below to zero.The discretization scheme defines if only down or up-regulated genes (or both) will be considered.The algorithm also takes as input parameters the mini-

mum number of genes and samples for the output biclus-ters. By specifying larger lower bounds, fewer biclusterswill be returned, reducing thus the computing time.Default values for both theminimumnumber of genes andconditions have been set to two.CC algorithm takes as input parameter two different

thresholds, δ as the upper limit for MSR, which hasalready been mentioned, and α > 1 as a threshold for themultiple node deletion phase. δ presents two main draw-backs: its value depends on the input microarray and hasto be computed beforehand (there is no common defaultvalue), and also the use of δ blocks the algorithm fromobtaining meaningful solutions [21,23]. Default value forα parameter has been set to 1.2, and the authors claimthat when it is properly selected, the multiple node dele-tion phase is usually extremely fast. Nevertheless, there isno explanation on how does this value affect the results.There are no criteria for finding an efficient value forα either.

CC also receives as an input parameter the numberof biclusters to obtain, since it is based on a sequentialcovering strategy, as well as Evo-Bexpa.OPSM approach is based on the formulation of a prob-

abilistic model of the expression data. As finding the bestmodel is infeasible for real data, Ben-Dor et al. use par-tial models and grow them iteratively. The algorithm takesas input paramater the number of partial models passedfor each iteration . According to the authors, increas-ing would improve results, although it will come ata cost of a higher running time. Nevertheless, it is noclear in which aspect does the modification of affect theobtained biclusters (size, quality or other) in real data. Fur-thermore, they do not provide any instruction on how toselect an appropriate value for .The Iterative Signature Algorithm (ISA) receives three

different input parameters. Tg and Tc are thresholds forthe resolution of the modular decomposition of bothgenes and conditions, respectively. Tc is said to havea minor effect on the results, and was set to 2 in allthe analyses. Tg was varied from 1.8 to 4.0 in steps of0.1, in order to analyse the resulting stringency of co-regulation between the genes. The default value for Tgcan be assumed as 2.0. Although the authors perform ananalysis on the influence of Tg on the results on a specificdataset[36], it is not straightforward to see what will theinfluence be for any other datasets.The third input parameter for ISA is the number of

starting points that the algorithm uses for randomlyselecting a set of genes and iteratively refining this set untilthe genes and conditions in it are mutually consistent andmatch the definition of a transcription module. Authorsclaim that using a sufficiently large number of initial sets itis possible to determine all the modules corresponding toa particular pair of thresholds. The default value for thisparameter is set to 100.xMOTIFs looks for biclusters in which genes are

expressed in the same state across all samples. In orderto differentiate biologically interesting states, a maximump-value parameter is used, considering only those stateswhose p-value is less than the parameter (1 × 10E − 9).Another parameter α determines the minimum numberof samples for biclusters, given as a fraction of the totalnumber of conditions, being its default value 0.05. Muraliand Kasif also make use of inner parameters to the algo-rithm such as the number of seeds (ns), the number ofdeterminants (nd) and the size of the discriminating set(sd), as in [56]. The authors claim that the quality of theresults does not change much when those are slightlyvaried.

Analysis of Evo-Bexpa parameter sensibility on real dataSetsInput parameters for Evo-Bexpa were detailed in sectionMethods. The number of parameters will depend on the


number of objectives or bicluster characteristics to opti-mize. In this approach, we have used 5 different configu-ration parameters, which control the volume (wg , wc andws), the amount of overlapping (wov) and the gene variance(wvar).Default values for Evo-Bexpa have been set experimen-

tally by using a benchmark database and trying to repro-duce previous results for this database in the literature.Also, solutions with low proportions of the number ofgenes and high percentage of the total number of sampleshas been favoured for the setting.In order to deduce the parameter influence on each

characteristic, we have tested Evo-Bexpa modifying eachconfiguration parameter from −100% to +100% its value,in intervals of ±25%. Table 1 shows all the used val-ues, where the central row gives the default ones. So itmeans running Evo-Bexpa 8 additional times per param-eter, using each value of Table 1 for each weight, whilemaintaining the other weights at their default values, thisrepresents 41 experiments for each dataset. Furthermore,we have chosen four different microarrays to study theparameters influence in diverse scenarios (see Table 1).All in all, for the purpose of the parameter influence anal-ysis, a total of 164 experiments over real datasets havebeen carried out, being 100 the number of biclusters to beobtained in each execution. All weights must have a posi-tive value, being 0.0 the value for which the correspondingobjective exerts no influence on the results. However,they can be set to any positive value, even above +100%their default values, if more influence of any biclustercharacteristic is desired.In the following, parameter analysis is only presented for

Embryonal tumours of the central nervous system dataset[57], in view of results for the other datasets are similarand do not contribute anything new to the study.Figures 3, 4, 5, 6 and 7 represent the variations of the

means and deviations for all the different objectives (VEt ,

number of genes, number of conditions, overlap and genevariance) when modifying each configuration parameter.Each figure is made up of four different graphics whichdepict the influence of a certain weight over the afore-mentioned bicluster aspects. The main graphic of eachfigure shows the variations of the means and deviationsfor the main aspect affected by the weight modifications.Abscissa axis refers to the specific weight values, accord-ing to Table 1, while the ordinates axis depends on theconfiguration parameter under study. For example, in thefirst three Figures (3, 4 and 5), vertical axis correspondsto the means of the number of elements in the biclusters(genes or samples). At the right side of the main graphic,the way in which the variations of the parameter affectsthe other characteristics has also been represented.Means of the 100 biclusters represent the general ten-

dency of the results. Nevertheless, deviations cannot bedisregarded. This is due to the fact that although it is pos-sible to favour some properties in the solutions, resultsprovided by Evo-Bexpa are diverse, obtaining thus biclus-ters in which their properties vary in a range around thereported mean.Although the modification of any configuration param-

eter not only affects its corresponding aspect, it can beclearly seen that the greatest variations in any characteris-tic are obtained by increasing or decreasing its associatedweight. Furthermore, some objectives are related in anegative way. Mean gene variance, for instance, wouldbe decreased if bicluster size is increased or the over-lap decreases. Therefore, it would be a good practiceto slightly correct gene variance parameter when sizeor overlap parameters are adjusted, or vice-versa. Othercharacteristics have different behaviours when adjustingany other weights. The mean of the number of genes andconditions is quite stable when modifying wov in Figure 6,as well as overlap mean when wc is adjusted, in Figure 4.In general, VEt increases whenever greater sizes or less

Table 1 Experimental values for configuration parameters

wg wc ws wov wvar−100% 0.0 0.0 0.0 0.0 0.0−75% 0.0625 0.125 1.25 1.25 0.025−50% 0.125 0.25 2.5 2.5 0.05−25% 0.1875 0.375 3.75 3.75 0.075Default 0.25 0.5 5.0 5.0 0.1

+25% 0.3125 0.625 6.25 6.25 0.125+50% 0.375 0.75 7.5 7.5 0.15+75% 0.4375 0.875 8.75 8.75 0.175+100% 0.5 1.0 10.0 10.0 0.2This table shows the different configuration parameter values used in our experimentation. Each weight has been modified from−100% to +100% its default value, inintervals of ±25%. Default values in the central row have been obtained with synthetic data experiments. Variations have been made individually, invoking Evo-Bexpafor each different weight value, while maintaining the other weights at their default values.


Figure 3 wg influence over the different bicluster features.Main graphic shows the variations of the means and deviations of both the numberof genes and conditions, depending on the values of wg in abscissa axis. At the right side of the main graphic, the way in which the variations of wgaffects VEt , overlap and mean gene variance has also been represented.

overlapping is preferred. It was the expected behaviour,since bigger solutions would produce higher values of VEt ,unless they were closer to a perfect combined pattern. Onthe contrary, biclusters with higher mean gene variancewould have lower values of VEt , due to the reduction oftheir sizes when higher variances are required.Table 2 presents a summary of the configuration param-

eters influences over the different bicluster characteris-tics. This table has been elaborated using the four realdatasets in Table 3 and the aforementioned variations of

the weights. This way, Table 2 represents the commonbehaviour observed in all the datasets under study.In short, Evo-Bexpa parametrization allows the user to

specify preferences on biclusters features, by adjusting thecorresponding weight(s). The recommended procedureconsist in first run the algorithm using the default con-figuration, correcting afterwards those weights needed toreach the desired results in terms of the objectives. Inorder to select an appropriate correction, Figures 3, 4, 5,6 and 7, together with the information in Table 2 should

0

10

20

30

40

50

60

70

0 0,125 0,25 0,375 0,5 0,625 0,75 0,875 1

Figure 4 wc influence over the different bicluster features.Main graphic shows the variations of the means and deviations of both the numberof genes and conditions, depending on the values of wc in abscissa axis. At the right side of the main graphic, the way in which the variations of wcaffects VEt , overlap and mean gene variance has also been represented.


Figure 5 ws influence over the different bicluster features.Main graphic shows the variations of the means and deviations of both the numberof genes and conditions, depending on the values of ws in abscissa axis. At the right side of the main graphic, the way in which the variations of wsaffects VEt , overlap and mean gene variance has also been represented.

be used, being aware of the implications that each weightvariation has on the other bicluster aspects.

Synthetic data experimentsIn order to test the effectiveness of Evo-Bexpa to findbiclusters following shifting and scaling patterns, we havecarried out a set of experiments inspired on the works ofA. Mukhopadhyay et. al [30] and D. Bozdag et. al [22],where perfect synthetic biclusters with shifting or scal-ing tendencies were inserted into artificial data sets. In amore general purpose, we have used combined patterns

(shifting and scaling simultaneously) for biclusters genera-tion. These biclusters have been hidden in several artificialdata matrices, with uniform random distributions.We have chosen the size of one of the most tested

benchmark microarrays in biclustering: yeast Saccha-romyces cerevisiae cell cycle expression dataset [58], madeup of 2884 genes and 17 samples, for the generationof artificial matrices. We have also defined several sizes(genes × conditions) for the inclusion of perfect biclus-ters: 20 × 10, 60 × 12, 100 × 13, 150 × 15 and 200 ×16. For each of these sizes we have generated a perfect

Figure 6 wov influence over the different bicluster features.Main graphic shows the variations of the means and deviations of the overlap termin equation 9, depending on the values of wov in abscissa axis. At the right side of the main graphic, the way in which the variations of wov affectsthe number of genes and conditions, VEt and mean gene variance has also been represented.


Figure 7 wvar influence over the different bicluster features.Main graphic shows the variations of the means and deviations of the mean genevariance term in equation 10, depending on the values of wvar in abscissa axis. At the right side of the main graphic, the way in which the variationsof wvar affects the number of genes and conditions, VEt and the overlap term has also been represented.

bicluster according to a combined shifting and scaling pat-tern. Each of these 5 different sized perfect biclusters hasbeen inserted into 5 different random in silico microar-rays in random positions. Thus, a total of 25 different casestudies constitute the first set of experiments, in which nonoise has been introduced.Furthermore, we have also generated the same number

of experiments adding noise to the data with random val-ues generated from normal distribution, with mean equalsto 0 and deviation equals to 0.25. All in all, there are 50different experiments, 25 in which the biclusters follow aperfect pattern and 25 in which random noise has beenincluded into the data.For each of the experiments, we have run the following

6 different biclustering algorithms: OPSM, ISA, xMotifs,CC, Bimax and Evo-Bexpa presented in this work. Wehave used default parameters to run all of them.In order to check the extent to which the bicluster

obtained by each algorithm adjusts to the solution we have

used match scores indexes for both genes and conditions[38] as performance measure. Let B1(I1, J1) and B2(I2, J2)be two biclusters, then gene match score is defined asSI(I1, I2) = |I1∩I2||I1∪I2| and condition match score is defined asSJ (J1, J2) = |J1∩J2||J1∪J2| . Both indexes vary from 0, when bothset of genes (or conditions) are disjoint, to 1, when the setstotally match. This way, match score indexes can be use tocompute the degree of similarity of the sets of genes andconditions of two biclusters. We have, therefore, compareeach bicluster obtained with the corresponding solutionfor all the executions using the six former algorithms.Figure 8 displays the gene and condition match scores

of the executions of the six algorithms. X-axis repre-sents gene match scores and Y-axis represents conditionmatch scores. Each dot in the graphic refers the com-parison of a bicluster found by each algorithm and theequivalent solution. According to the gene and conditionmatch scores definitions, the dots in the right top cor-ner of the graphic correspond to those obtained biclusters

Table 2 Qualitative influence of the configuration parameters over the different objectives

Weight VEt #Genes #Conditions Overlap Mean gene variance

wg � � = ↘↗ �wc ↑ ↘= � = �ws � � � ↘↗ �wov ↑ = = � �wvar � � ↗↘ ↘↗ �Each row represents the influence of each configuration parameter in the first column over the characteristics in the first row, where the behaviour of each row hasbeen observed when increasing the corresponding weight from 0.0 to its maximum experimented value (see Table 1). Symbol� represents large increments, ↑medium increments and� large decrements. Symbol = stands for no significant variations, ↘↗ depicts a decreasing behaviour for low values of the weight, turningto increasing for higher values of the parameter, while ↗↘ depicts the contrary situation.


Table 3 Datasets used in the experimentation

Dataset Name #Genes #Conditions Ref.

Yeast Yeast Saccharomyces cerevisiae cell cycle 2884 17 [58]

Embryonal Embryonal tumors of the central nervous syst. 7129 60 [57]

Leukemia Leukemia 7129 72 [4]

Steminal Steminal Cells 26127 30 [59]

This table includes the name, sizes and references to the corresponding publications of the four datasets used in our experimentation. Yeast dataset represents onethe most used dataset for comparison of biclustering techniques, and is considered to be a benchmark dataset.

which have a better match with its equivalent solution.OPSM, CC and Evo-Bexpa are the algorithms with bet-ter results. In the case of Evo-Bexpa, there exists exactlyfive biclusters which are not correctly found, and whosescores indexes are below 0.6 and 0.3 for the conditions andgenes sets respectively. We have studied these results andhave found that they correspond to the five experimentsin which the hidden biclusters are smaller (20 × 10) andnoise has been introduced. Only OPSM finds better solu-tions than Evo-Bexpa in these experiments, while for theother cases Evo-Bexpa outperforms both OPSM and CC.In fact, we have conducted a statistical test which confirmsthat Evo-Bexpa outperforms the other five algorithms infinding perfect shifting and scaling behaviours in syntheticdata.Match Score can also be used for measuring the degree

of similarity of two biclusters using former genes and con-ditions match scores indexes. This way, we have used thebicluster match score index in order to rank the effective-ness of the algorithms. Bicluster match score is definedas

√SI(I1, I2) × SJ (J1, J2), and varies from 0, when the

biclusters B1 and B2 are disjoints, to 1, when B1 and B2completely match.Since our results do not follow a normal distribution,

we have applied Friedman as a non-parametrical test tocarry out a comparison which involves six different meth-ods. Friedman test ensures us that the results obtainedby the six algorithms are statistically different, with a p-value of 1.16−10. Also, the raking provided by Friedmansuggests the following order: Evo-Bexpa, OPSM, CC, ISA,Bimax, xMotifs, which seems to be in concordance withthe representation in Figure 8. Furthermore, we have alsoperformed a post-hoc procedure in order stablish a com-parison two by two using our algorithm as the controlmethod. In this comparison, we obtained for each of theother five algorithms a p-value less than the alpha val-ues returned by five different post-hoc procedures (Holm,Holland, Rom, Finner and Li), which certifies that our pro-posal Evo-Bexpa outperforms the other five algorithmsin this empirical study with a significance less than 0.05.STATService [60] has been used in order to perform thesestatistical tests.

Figure 8 Gene and condition match scores for ISA, xMotifs, OPSM, BIMAX, CC and Evo-Bexpa in synthetic experiments. X-axis representsgene match scores and Y-axis represents condition match scores. Each dot in the graphic refers to the comparison of a bicluster found by eachalgorithm and its equivalent solution. Dots in the right top corner of the graphic correspond to those obtained biclusters which have a better matchwith its equivalent solution.


Experiments on real DataSetsExperiments on four different real microarrays have beenconducted using Evo-Bexpa and the five algorithms con-tained in Bicat toolbox: OPSM, CC, ISA, Bimax and xMo-tifs. Table 3 specifies the details of the datasets, includingtheirs sizes as well as references to their correspondingpublications. Yeast dataset is the smallest, made up of2884 genes and 17 samples, and represents one the mostused dataset for comparison of biclustering techniques.In fact, it is considered as a benchmark dataset for manyresearches. Leukemia dataset is the one containing thehigher number of samples, while Steminal acts as themostunbalanced microarray, with the mayor number of genes(26127) and only 30 samples.Table 4 presents the results for each dataset and algo-

rithm using default parameters in all cases. Results arerepresented by the number of biclusters obtained, meansand deviations of their volume (number of genes andexperimental conditions), and means and deviations oftheir VEt values and gene variance (see section Meth-ods). Results have also been grouped by dataset, beingthe first five rows those corresponding to the execu-tions of OPSM, ISA, CC, Bimax and Evo-Bexpa for Yeastmicroarray, respectively. Unfortunately, Bicat implemen-tations of xMotifs and Bimax approaches did not workproperly for every dataset in Table 3. Specifically, xMotifscould not be performed for Yeast and Steminal datasets,

due to unexpected runtime errors. xMotifs could nei-ther be executed for Leukemia dataset, since it does notsupport more than 64 samples, according to Bicat tool-box. In the case of Bimax, we did not obtain results foreither Embryonal, Leukemia or Steminal datasets in rea-sonable time. This fact might be related to the datasetssizes, since the mean of the biclusters sizes for the Yeastdataset using Bimax is greater than 75% of the size of thewhole microarray, and the computational cost of generat-ing quality biclusters of similar proportions for the otherdatasets may be unfeasible. Nevertheless, Bimax has runproperly for the Yeast dataset and synthetic data matricesof the same size.Bimax generated 56 biclusters for Yeast dataset, all of

them of a very big size, containing almost the totality ofthe elements, both genes and conditions. In fact, althoughwe did not measure overlap for the algorithms in Bicat, itmust be certainly high, since biclusters are made up of amean of almost 2300 genes (out of 2884) and 15,16 or 17experimental conditions (out of 17). Studying correlationsin this kind of biclusters is almost as difficult as studyingthe whole dataset. It would even be easier to analyse thegenes and/or samples not contained in the biclusters.Murali and Kasif ’s xMotifs generates 1000 biclusters for

the Embryonal Tumours dataset, all of them have 5 sam-ples and a decreasing number of genes with the biclustersindexes. The first bicluster is made up of 4593 genes, more

Table 4 Summary of experimental results for themicroarrays in Table 3

Dataset Algorithm NumBic Genes Conditions VEt Mean gene variance

OPSM 14 496.1 ± 791.1 8.6 ± 4.4 0.189 ± 0.051 1.39 × 105 ± 2.23 × 105ISA 0 - - - -

Yeast 2884x17 CC 100 34.0 ± 64.2 7.6 ± 3.2 0.151 ± 0.048 3.20 × 103 ± 3.16 × 103Bimax 56 2297.6 ± 26.1 15.3 ± 0.5 0.207 ± 0.004 1.42 × 105 ± 5.41 × 103

Evo-Bexpa 100 44.0 ± 33.7 11.8 ± 3.9 0.051 ± 0.027 9.81 × 102 ± 5.00 × 102OPSM 12 1151.5 ± 1809.1 7.8 ± 4.1 0.155 ± 0.072 2.51 × 108 ± 4.22 × 108ISA 20 377.0 ± 191.0 2.7 ± 0.8 0.145 ± 0.053 5.98 × 108 ± 3.85 × 108

ET 7129x60 xMotifs 1000 2134.6 ± 830.4 5.0 ± 0.0 0.135 ± 0.021 9.61 × 107 ± 3.74 × 107CC 100 53.1 ± 100.4 11.9 ± 7.9 0.310 ± 0.082 6.14 × 104 ± 2.06 × 105

Evo-Bexpa 100 22.5 ± 6.8 50.8 ± 5.7 0.011 ± 0.003 1.78 × 107 ± 1.59 × 107OPSM 12 924.3 ± 1633.3 7.8 ± 4.3 0.103 ± 0.047 1.75 × 108 ± 2.89 × 108

Leukemia 7129x72 ISA 34 253.1 ± 172.1 3.1 ± 1.1 0.147 ± 0.049 3.31 × 108 ± 2.17 × 108CC 100 53.5 ± 232.4 13.9 ± 8.6 0.265 ± 0.067 4.32 × 104 ± 8.53 × 104

Evo-Bexpa 100 18.4 ± 2.9 63.3 ± 5.9 0.008 ± 0.002 4.46 × 106 ± 2.97 × 106OPSM 27 1170.6 ± 3274.1 16.2 ± 8.8 0.399 ± 0.163 1.73 × 107 ± 5.55 × 107

Steminal 26127x30 ISA 0 - - - -

CC 100 179.1 ± 813.6 13.0 ± 3.1 0.219 ± 0.071 4.84 × 104 ± 1.19 × 105Evo-Bexpa 100 33.8 ± 17.9 26.3 ± 2.4 0.009 ± 0.004 4.80 × 105 ± 2.69 × 105

Results are represented by the number of biclusters obtained, means and deviations of their volume (number of genes and experimental conditions), and means anddeviations of their VEt values and mean gene variance (see sectionMethods).


than the half of the whole dataset, while bicluster num-ber 999 consist of 576 genes. We consider the number ofbiclusters to be cumbersome for any post analysis, evenmore if it needs to be carried out manually. Also, the num-ber of genes per bicluster may again result too high for anyspecific study.Iterative Signature Algorithm (ISA) only found biclus-

ters for Embryonal Tumours (20) and Leukemia (12)microarrays. In both cases they are obtained with adecreasing number of genes and conditions, being thesecond one a very low value, which we consider almostuseless in biclustering analyses (2 or 3 samples per biclus-ter). The number of genes varies from 661 to 81 for theEmbryonal database and from 707 to 83 for Leukemia.For both datasets the biclusters obtained by ISA have thegreatest gene variance.OPSM, together with CC and Evo-Bexpa produced

results for the four datasets. OPSM biclusters are charac-terized for having the greatest deviation on the number ofgenes. In fact, OPSM bicluster’s sizes vary from a biclustercontaining a few genes and a great number of samples tothe contrary: almost the whole set of genes and very fewsamples (2×17 to 2422×2 for Yeast, 2×16 to 5491×2 forET, 2×17 to 5208×2 for Leukmia and 6×30 to 15332×2 inthe Steminal case). From the biological point of view, onlya small portion of the obtained biclusters are interesting:those in the intermediate situations.CC algorithm allows the user to choose the number of

biclusters to obtain, being 100 its default value. It is asequential process in which random data is inserted intothe matrix. For these reasons, first biclusters are in generalgreater than the following ones, being the smallest onesthe last 10 biclusters. VEt values are quite high, speciallyfor Embryonal Tumours (VEt = 0.3098) and Leukemia(VEt = 0.2652) datasets, where biclusters sizes are not asbig as to favour this range of values. Also, results producedby CC are rather flat, since their gene variance is in themajority of the cases the lowest of all the algorithms.Default parameters for Evo-Bexpa (Table 1) have been

adjusted to produce biclusters with a very low propor-tion of genes but a high proportion of samples, althoughthere exists considerable diversity in the results, as shownby the deviation. Only for the Yeast dataset Evo-Bexpaobtains the biclusters with the lowest values of gene vari-ance, while VEt is always much lower, as preferred. In fact,VEt values for the biclusters found by Evo-Bexpa is smallerthan 0.1, for all datasets, whereas no other algorithm findsbiclusters with such a low VEt level. This is a very goodachievement of our approach given the importance of VEtas a quality measure for quantifying all kind of patternsin gene expression data (see section Methods). Further-more, although VEt values increase for bigger biclustersor those with lower levels of overlapping, it can be seenin Figures 1, 3, 4, 5 and 6 that they are never greater than

the biclusters VEt for the other approaches. The order inwhich biclusters are found with Evo-Bexpa is not relevant,although if the weights associated to the overlapping andsize are too high Evo-Bexpa will produce big submatriceswith no overlap, increasing thus VET considerably for thelatest solutions.The great advantage of Evo-Bexpa with regard to the

other algorithms is its ability to adjust the result char-acteristics to user defined parameters. Next subsectionpresents biological validation for biclusters obtained byEvo-Bexpa, using the same parameter configuration intro-duced in section Analysis of Evo-Bexpa Parameter Sensi-bility on Real DataSets, which confirms the validity of ourapproach.

Biological assessmentThe Gene Ontology project [61] (GO) is a initiativeto unify the representation of gene and gene productattributes across all species. It is a directed acyclic graphwhose nodes represent terms dealing with molecularfunctions, cell components or biological processes, andedges connecting nodes depict dependency relationships.Gene Ontology has been widely used in genome researchapplications, and also for the validation of results obtainedafter a microarray analysis process, such as clustering orbiclustering.Term-for-Term analysis represents the standardmethod

of performing statistical analysis for over-representationin GO. Starting from a subset of genes (study group) froma larger population (whole set of genes in the microar-ray), we are interested in knowing if the frequency of anannotation to a Gene Ontology term is relevant for thestudy group compared to the overall population. Fisher’sexact test is themost commonly used test for this purpose,together with the Bonferroni multiple test correction.This correction is advisable to be performed since Fisher’stest is applied to many terms per study group. After that,a Bonferroni adjusted p-value is obtained for each GOterm for which genes in the study group are involved. Inour case, study groups correspond to the sets of genesin each bicluster. Depending on the desired confidencelevel, which determines the adjusted p-value, a bicluster issaid to be significantly enriched if there exists at least oneGO term for which genes in the bicluster are significantlyannotated.Among all the existing tools for the analysis of gene

expression data using GO [62] we have chosen Ontolo-gizer [63] for assessing Evo-Bexpa biclusters due to itsnovelty (it has been recently updated) and its suitability forperforming the validation of a great number of biclustersas a batch process.Results of bicluster biological validation using GO vary

depending on the biclusters sizes. In fact, GO terms areorganized in levels of the graph according, among other


0

10

20

30

40

50

60

70

0 0,0625 0,125 0,1875 0,25 0,3125 0,375 0,4375 0,5

Num

ber

of s

igni

fican

t bic

lust

ers

Wg

Yeast

ET

Leuk

Stem

Figure 9 Number of significant biclusters for differentwg values. Ordinates axis represents the number of significant biclusters obtained foreach dataset and the wg values in abscissa axis, being the adjusted p-value 0.05.

issues, to their specificity [64]. Terms in higher levels(nearer to the root of the graph) are considered to be moregeneric and have a greater number of genes annotated,while terms in lower levels of the graph are more specificand might have only a few genes annotated. For this rea-sons, when working with big sets of genes, it would bemore probable that they will be enriched for more genericGO terms (higher in the graph structure).In order to check the influence of Evo-Bexpa config-

uration parameters on the biological validation of theobtained biclusters, we have represented in Figures 9,10, 11, 12 and 13 the number of significant biclusters(ordinates axis) for each of the experiments detailed insection Analysis of Evo-Bexpa Parameter Sensibility onReal DataSets (see Table 1) and for each dataset, whereabscissa axis refers to the specific weight value, and theadjusted p-value has been set to 0.05.

The main conclusion we can come up to is that thereis no a common behaviour embraced by the four dif-ferent data sets and for each configuration parameter.For example, the number of significant biclusters for theYeast dataset increases significantly whenever the num-ber of genes (wg) or conditions (wc) are increased, as wellas the overall size (ws). Nevertheless, when the overlapgets more penalized (Figure 12), the number of signifi-cant biclusters for the Yeast dataset decreases. This is dueto the fact that biclusters sizes are affected by the over-lapping weight, in the reverse way (the more restrictivethe overlapping amount is, the less elements the biclusterscontains). Figure 13 shows that variance weight varia-tions do not significantly affect the number of significantbiclusters in the Yeast dataset.Steminal dataset is the second one presenting more

variations on the number of significant biclusters when

0

5

10

15

20

25

30

35

40

45

50

0 0,125 0,25 0,375 0,5 0,625 0,75 0,875 1

Num

ber

of s

igni

fican

t bic

lust

ers

Wc

Yeast

ET

Leuk

Stem

Figure 10 Number of significant biclusters for differentwc values. Ordinates axis represents the number of significant biclusters obtained foreach dataset and the wc values in abscissa axis, being the adjusted p-value 0.05.


0

10

20

30

40

50

60

70

80

0 1,25 2,5 3,75 5 6,25 7,5 8,75 10

Num

ber

of s

igni

fican

t bic

lust

ers

Ws

Yeast

Embry

Leuk

Stem

Figure 11 Number of significant biclusters for differentws values. Ordinates axis represents the number of significant biclusters obtained foreach dataset and the ws values in abscissa axis, being the adjusted p-value 0.05.

modifying the parameters values. It is worth to note thatSteminal is the only dataset for which the number of sig-nificant biclusters varies significantly from more to lesswhen increasing the mean gene variance. This is related tothe fact that when higher gene variances in biclusters arerequired, the size of the obtained biclusters decrease, asexplained in Analysis of Evo-Bexpa Parameter Sensibilityon Real DataSets section.For Embryonal and Leukemia data sets the number of

significant biclusters is quite lower than for the other twodata sets, in all the cases. In fact, it rarely exceeds 20%. Forboth of them there no exist great variations when modify-ing the different parameter values. It is interesting to markthat no significant biclusters were found for the Embry-onal datasets when wg ,wc or ws are set to zero (Figures 9,10 and 11).

One common issue in hierarchical ontologies is decid-ing the level of specificity to use in the analysis [65]. Onthe one hand, GO terms that are too general may over-look significantly represented biological markers becausemany genes in the background genome are also annotatedby the general GO terms. On the other hand, GO termsthat are too specific can result in the same problem, sincetoo few genes in the microarray are annotated by theseGO terms. In order to study the level of specificity of theterms to which Evo-Bexpa biclusters have been annotatedwe have carried out three different validations: taking thewhole hierarchical graph into account, and limiting thevalidation with the levels 3 to 6 and 4 to 7, both inclusive.Table 5 presents the validation results for Evo-Bexpa

biclusters using the default configuration. For each typeof validation the number of significant biclusters and the

0

10

20

30

40

50

60

70

80

90

0 1,25 2,5 3,75 5 6,25 7,5 8,75 10

Num

ber

of s

igni

fican

t bic

lust

ers

Wov

Yeast

Embry

Leuk

Stem

Figure 12 Number of significant biclusters for differentwov values. Ordinates axis represents the number of significant biclusters obtained foreach dataset and the wov values in abscissa axis, being the adjusted p-value 0.05.


0

10

20

30

40

50

60

0 0,025 0,05 0,075 0,1 0,125 0,15 0,175 0,2

Num

ber

of s

igni

fican

t bic

lust

ers

Wvar

Yeast

Embry

Leuk

Stem

Figure 13 Number of significant biclusters for differentwvar values. Ordinates axis represents the number of significant biclusters obtained foreach dataset and the wvar values in abscissa axis, being the adjusted p-value 0.05.

mean of their significant terms are given, for two differentadjusted p-value values: 0.01 and 0.05. As it can be seenin Table 5, the number of significant biclusters slightlyvaries from the validation with the whole graph to the lim-ited validations, meaning that the majority of biclustersobtained by Evo-Bexpa contain genes that are not fre-quently annotated to too general or too specific terms. Forthe Embryonal Tumours dataset the number of significantbiclusters does not even decrease with the limited valida-tions. The mean of significant terms to which genes in thebiclusters of the previous column have been annotated isalso smaller for the hierarchically limited validations. Thiswas an expected result since those terms which are not inthe specified levels have not been taken into account inthe study. Nevertheless, we consider the reduction on thenumber of significant biclusters and terms to be minimal,

locating Evo-Bexpa biclusters in the central part of theGO graph.In general, the validation carried out support Evo-

Bexpa effectiveness for biclustering microarray data. Infact, significant biclusters have been obtained for eachdataset at 0.01 and 0.05 levels provided that configu-ration parameters are not set to zero. This fact alsosuggest the appropriateness of the different chosen objec-tives in this work. Although the number of signifi-cant biclusters may vary in a different way for differ-ent datasets when modifying the different configura-tion parameters, Evo-Bexpa significant biclusters corre-spond to significant terms in the central part of the GOgraph. This means Evo-Bexpa succeeds at finding biclus-ters whose significant terms have an intermediate levelof specificity.

Table 5 Validation results with GO hierarchy level limitations

All levels Levels 3 to 6 Levels 4 to 7

p-value #Bics #Terms #Bics #Terms #Bics #Terms

Yeast 0.01 32 5.969 32 5.125 31 4.387

0.05 41 6.878 40 5.625 40 5.225

ET 0.01 3 3.000 3 3.000 3 2.666

0.05 9 3.778 9 3.444 9 3.333

Leukemia 0.01 4 1.000 3 1.000 3 1.000

0.05 12 2.333 11 1.454 11 1.818

Steminal 0.01 18 4.056 16 2.750 13 2.231

0.05 27 7.111 26 5.269 25 4.200

Validation with Gene Ontology has been carried out for Evo-Bexpa biclusters of the four datasets in Table 3 using the default configuration. This table shows thenumber of significant biclusters and significant terms in three different validations: taking the whole hierarchical graph into account, and limiting the validation withthe levels 3 to 6 and 4 to 7, both inclusive.


ConclusionIn this paper we have presented a new evolutionary algo-rithm for biclustering of gene expression data namedEvo-Bexpa. There exist two main advantages over otherexisting approaches: the use of an evaluation measureable to detect shifting and scaling patterns (VEt), and thepossibility of specifying user preferences on some charac-teristics of the results (number of genes and conditions,overlapping amount,...). This way, if any previous infor-mation related to the microarray under study is available,the search can be guided towards the preferred types ofbiclusters. Furthermore, other objectives can also be easilyincorporated into the search, as well as any objective maybe ignored by setting its weight to zero. Default values forthe configuration parameters are given in order to providethe user with quality results. Moreover, an experimentalstudy has been performed on four real datasets in order tostudy the parameters sensibility and their influence overthe different features. This study concludes with an usefulguide on how to customize the algorithm depending onthe user preferences.Experimental results on both synthetic and real datasets

confirm the validity of our approach, where the resultshave been compared to those obtained by five well-knownbiclustering algorithms. Evo-Bexpa has been proven tooutperform ISA, xMotifs, OPSM, BIMAX and CC insynthetic experiments, where match scores indexes havebeen used for comparing the obtained results with thesolution. Regarding the experiments on real datasets,Evo-Bexpa results have been biologically validated usingdifferent levels in Gene Ontology hierarchy. This valida-tion shows that significant biclusters obtained by Evo-Bexpa correspond to neither too general or specificGO terms.

EndnoteaBiclustering Analysis Toolbox (BicAT) is a software plat-form for clustering-based data analysis that integrates var-ious biclustering and clustering techniques in terms of acommon graphical user interface [39]. It has been used inthis work for the comparison of Evo-Bexpa performancewith ISA, xMotifs, OPSM, BIMAX and CC algorithmsand is publicly available at http://www.tik.ee.ethz.ch/sop/bicat/.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsBP, RG and JSAR conceived the approach. Implementation of the method andperforming the computational experiments were done by BP. All authors readand approved the final manuscript.

AcknowledgementsThis research has been supported by the Spanish Ministry of Economy andCompetitiveness under grant TIN2011-28956-C02.

Author details1Department of Computer Languages, University of Seville, Seville, Spain.2School of Engineering, Pablo de Olavide University, Seville, Spain.

Received: 19 June 2012 Accepted: 3 January 2013Published: 23 February 2013

References1. Lesk A: Introduction to Bioinformatics. Oxford: Oxford University Press; 2008.2. Watson JD: DNA The Secret of Life. New York: Alfred A. Knopf; 2003.3. Baldi P, Hatfield GW: DNAMicroarrays and Gene Expression: From

Experiments to Data Analysis andModeling. Cambridge: CambridgeUniversity Press; 2002.

4. Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H,Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E:Molecularclassification of cancer: class discovery and class prediction by geneexpression monitoring. Science 1999, 286:531–537.

5. Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z:Tissue classification with gene expression profiles. J Comput Biol2000, 7(3–4):559–583.

6. Asyali MH, Colak D, Demirkaya O, Inan MS: Gene expression profileclassification: a review. Curr Bioinformatics 2006, 1:55–73.

7. Schachtner R, Lutter D, Knollmüller P, Tomé AM, Theis FJ, Schmitz G,Stetter M, Vilda PG, Lang EW: Knowledge-based gene expressionclassification via matrix factorization. Bioinformatics 2008,24:1688–1697.

8. Buness A, Ruschhaupt M, Kuner R, Tresch A: Classification across geneexpression microarray studies. BMC Bioinformatics 2009, 10:453.

9. Jiang D, Tang C, Zhang A: Cluster analysis for gene expression data: asurvey. IEEE Trans Knowl Data Eng 2004, 16(11):1370–1386.

research openaccess conﬁgurablepattern-basedevolutionary ... · 2017. 8. 28. · ve ve + ·+ · +...

Documents