an improved heuristic for consistent biclustering...

An Improved Heuristic for ConsistentBiclustering Problems

Artyom Nahapetyan1, Stanislav Busygin2, and Panos Pardalos3

1 Center for Applied Optimization, University of Florida, [email protected] Center for Applied Optimization, University of Florida, [email protected] Center for Applied Optimization, University of Florida, [email protected]

1 Introduction

Let matrix A represent a data set of m features and n samples. Each elementof the matrix, aij , corresponds to the expression of the i-th feature in thej-th sample. Biclustering is a classification of the samples as well as featuresinto k classes. In other words, we need to classify columns and rows of thematrix A. Doing so, let S1, S2, . . . , Sk and F1, F2, . . . , Fk denote the classes ofthe samples (columns) and features (rows), respectively. Formally biclusteringcan be defined as follows.

Definition 1. A biclustering is a collection of pairs of sample and featuresubsets B = {(S1, F1), (S2, F2), . . . , (Sk, Fk)} such that

S1, S2, . . . , Sk ⊆ {aj}j=1,...,n,

k⋃r=1

Sr = {aj}j=1,...,n,

Sζ

⋂Sξ = ∅, ζ 6= ξ,

F1, F2, . . . , Fk ⊆ {ai}i=1,...,m,

k⋃r=1

Fr = {ai}i=1,...,m,

Fζ

⋂Fξ = ∅, ζ 6= ξ,

where {aj}j=1,...,n and {ai}i=1,...,m denote the set of columns and rows of thematrix A, respectively.

2 Artyom Nahapetyan, Stanislav Busygin, and Panos Pardalos

By reordering the columns and rows of the matrix according to their classifi-cations, the corresponding biclustering can be visualized using the HeatmapBuilder software [HeatMap], where the color of a pixel is chosen accordingto the corresponding value of aij . Despite all varieties of classifications, ourultimate goal in a biclustering problem is to find a classification in whichsamples from the same class have similar values for the features that charac-terize the class. The visualization of a reasonable classification should reveala block-diagonal or “checkerboard” pattern similar to one on Figure 1.

Fig. 1. An example of biclustering: “checkerboard” pattern.

One of the early algorithms to obtain an appropriate biclustering is pro-posed by Hartigan [H72], which is known as Block Clustering. Given a biclus-tering B, the author employs the variability of the data in the block (Sr, Fr) tomeasure the quality of the classification. In the resulting problem a lower vari-ability is preferable. However, to avoid a trivial solution with zero variability,where each class consists of only one sample, it is required to fix the numberof classes. A more sophisticated approach for biclustering was introduced byCheng and Church [CC00], where the authors minimize the mean squared

An Improved Heuristic for Consistent Biclustering Problems 3

residual. They prove that the problem is NP-hard and propose a greedy algo-rithm to find an approximate solution to the problem. A simulated annealingtechnique to solve the problem is discussed by Bryan et al. [BCB05].

Dhillon [D01] discusses another biclustering method for text mining usinga bipartite graph. In the graph the nodes represent features and samples, andeach feature i is connected to a sample j with a link (i, j), which has a wightaij . The total weight of all links connecting features and samples from differentclasses is used to measure the quality of a biclustering. In particular, a lowervalue corresponds to a better biclustering. A similar method for microarraydata is suggested by Kluger et al. [KBCG03].

Another method to tackle the problem is to treat the input data as a jointprobability distribution between two discrete sets of random variables (seeDhillon et al. [DMM03]). The goal of the method is to find disjoint classes forboth variables. A Bayesian biclustering technique based on the Gibbs samplingcan be found in Sheng et al. [SMM03].

Recently, Busygin et al. [BPP05] have introduced a concept of consistentbiclustering. Formally speaking, a biclustering B is consistent if in each sample(feature) from any set Sr (set Fr) the average expression of features (samples)that belong to the same class r is grater than the average expression of features(samples) from other classes. It has been shown that consistent biclusteringimplies cone separability of samples and features. The mathematical formula-tion of the problem belongs to the 0-1 fractional programming. To solve thesupervised biclustering problem, the authors introduce additional variablesto linearize the problem and propose an iterative heuristic procedure, wherein each iteration it is required to solve a smaller size mixed integer problem.In this chapter we discuss an improved heuristic procedure, where in eachiteration we solve a continuous linear problem. Numerical experiments on thesame data confirm that our algorithm outperforms the previous result in thequality of the obtained solution as well as computational time.

In Section 2 we provide a brief discussion of the consistent biclustering. Fordetails we refer to the paper by Busygin et al. [BPP05]. Section 3 introducesthe application of the technique in the supervised biclustering problem. Theheuristic algorithm and numerical experiments are described in Sections 4 and5, respectively. Finally, Section 6 concludes the chapter.

2 Consistent Biclustering

Given a classification of the samples Sr, let S = (sjr)n×k denote a 0-1 matrixwhere sjr = 1 if the sample j is classified as a member of the class r, i.e.,aj ∈ Sr, and sjr = 0 otherwise. Similarly, given a classification of the featuresFr, let F = (fir)m×k denote a 0-1 matrix where fir = 1 if the feature i belongto the class r, i.e., ai ∈ Fr, and fir = 0 otherwise. Using those matricesconstruct corresponding centroids for the samples and features.


CS = AS(ST S)−1 = (cSiξ)m×r (1)

CF = AT F (FT F )−1 = (cFjξ)n×r (2)

The elements of the matrices, cSiξ and cF

jξ, represent the average expression ofthe corresponding sample and feature in the class ξ, respectively. In particular,

cSiξ =

∑nj=1 aijsjξ∑n

j=1 sjξ=

∑j|aj∈Sξ

aij

|Sξ| ,

and

cFjξ =

∑mi=1 aijfiξ∑m

i=1 fiξ=

∑i|ai∈Fξ

aij

|Fξ| .

Consider the matrix CS . Using the elements of the matrix, one can assigna feature to a class where it is most expressed. Doing so let assign the featurei to the class r if cS

ir = maxξ{cSiξ}, i.e.,

ai ∈ Fr =⇒ cSir > cS

iξ, ∀ξ, ξ 6= r. (3)

It is noticed that the constructed classification of the features, Fr, is notnecessary to be the same as the classification Fr. Similarly, one can use theelements of the matrix CF to classify the samples. In particular, assign thesample j to the class r if cF

jr = maxξ{cFjξ}, i.e.,

aj ∈ Sr =⇒ cFjr > cF

jξ, ∀ξ, ξ 6= r. (4)

As before, obtained classification Sr is not necessary to coincide with classifi-cation Sr.

Definition 2. We refer to a biclustering B as a consistent biclustering if re-lations (3) and (4) hold for all elements of the corresponding classes, wherematrices CS and CF are defined according to (1) and (2), respectively.

Theorem 1. Let B be a consistent biclustering. Then there exist convexcones P1, P2, . . . , Pk ⊆ Rm such that only samples from Sr belong to thecorresponding cone Pr, r = 1, . . . , k. Similarly, there exist convex conesQ1, Q2, . . . , Qk ⊆ Rn such that only features from class Fr belong to the cor-responding cone Qr, r = 1, . . . , k.

Proof. For the proof, see [BPP05].

According to the definition, a biclustering is consistent if Fr = Fr and Sr =Sr. Theorem 1 proves that a consistent biclustering implies the separability bycons. Despite the nice properties, for a given data set, it might be impossibleto construct a consistent biclustering. The later is due to the fact that thedata set includes features and/or samples that are not evidently belong to anyof the classes. However, one can delete some of the features and/or samplesfrom the data set so that there is a consistent biclustering for the truncated


data set. Our ultimate goal is to include into the truncated data as manyfeatures and samples as possible.

Another problem of our interest is to choose the most representative subsetof samples and features. For instance, assume that there is a consistent biclus-tering for a given data set, and there is a feature, i, such that the differencebetween the two largest values of cS

ir is negligible, i.e.,

minξ 6=r

{cSir − cS

iξ} ≤ α,

where α is a small positive number. Although this particular feature is classi-fied as a member of class r, i.e., ai ∈ Fr, it is easy to violate the correspond-ing relation (3) by adding a slightly different sample to the data set. In otherwords, if α is a relatively small number, then it is not statistically evident thatai ∈ Fr, and the feature i cannot be used to classify the samples. The problemof choosing the most representative features and samples is important in thecases when performing feature tests and collecting a large number of samplesare expensive and time consuming. Before we proceed to the formulation ofthe problem, let us define a notion of additive and multiplicative consistentbiclusterings that are stronger than the consistent biclustering.

Instead of (3) and (4) consider the relations

ai ∈ Fr =⇒ cSir > αS

i + cSiξ, ∀ξ, ξ 6= r, (5)

andaj ∈ Sr =⇒ cF

jr > αFj + cF

jξ, ∀ξ, ξ 6= r, (6)

respectively, where αFj > 0 and αS

i > 0. Let α denote the vector of αFj and

αSi .

Definition 3. A biclustering B is called an additive consistent biclusteringwith parameter α or α-consistent biclustering if relations (5) and (6) holdfor all elements of the corresponding classes, where matrices CS and CF aredefined according to (1) and (2), respectively.

Similarly, instead of (3) and (4) consider the relations

ai ∈ Fr =⇒ cSir > βS

i cSiξ, ∀ξ, ξ 6= r, (7)

andaj ∈ Sr =⇒ cF

jr > βFj cF

jξ, ∀ξ, ξ 6= r, (8)

respectively, where βFj > 1 and βS

i > 1. Let β denote the vector of βFj and

βSi .

Definition 4. A biclustering B is called a multiplicative consistent bicluster-ing with parameter β or β-consistent biclustering if relations (7) and (8) holdfor all elements of the corresponding classes, where matrices CS and CF aredefined according to (1) and (2), respectively.


It is easy to show that an α-consistent biclustering is a consistent biclus-tering for all values of cS

iξ and cFjξ. In the case of β-consistent biclustering, it

is a consistent biclustering if cSiξ ≥ 0 and cF

jξ ≥ 0. The latter usually holds inDNA microarray problems.

Using above definitions, we can formulate two problems of choosing themost representative subsets of features and samples. In the first one, we deletea least number of features and/or samples from a data set so that thereexists an α-consistent biclustering for the truncated data set. In the secondproblem, one can choose to achieve a β-consistent biclustering by deleting aleast number of features and/or samples. In those two problems, vectors αand β play a role of a threshold for choosing features and samples. However,large values of the vectors α and β can be very restrictive. As a result, some ofthe valuable features and samples might be excluded from the truncated dataset. The optimal values of the parameters α and β should be tuned based onsome experiments with the data.

3 Supervised Biclustering

In the real-life problems usually there is a set of data for which the clas-sification is known. For instance, if some patients are diagnosed with acutelymphoblastic leukemia (ALL) or acute myeloid leukemia (AML) then theirmicroarray data can be classified as ALL or AML. In the supervised biclus-tering we assume that there is a training data, i.e., a set of samples for whichthe classification is known, and it is accurate. Using the training data, onecan classify the features as it is described in Section 2 and formulate consis-tent, α-consistent and β-consistent biclustering problems. Then solutions ofthe problems can be used to classify additional samples. The values of the vec-tors α and β can be adjusted to obtain a more compact set of representativefeatures as well as reduce the number of misclassifications in the data.

Given a set of training data construct the matrix S and compute thevalues of cS

iξ using the formula (1). Classify the features according to thefollowing rule: the feature i belongs to the class r, i.e., ai ∈ Fr, if cS

ir > cSiξ,

∀ξ 6= r. At last, construct the matrix F using the obtained classification. Letxi denote a binary variable, which takes value one if the feature i is includedin the computations and zero otherwise. Consider the following optimizationproblems.

CB:

maxx

m∑

i=1

xi (9)

∑mi=1 aijfirxi∑m

i=1 firxi>

∑mi=1 aijfiξxi∑m

i=1 fiξxi, ∀r, ξ ∈ {1, . . . , k}, r 6= ξ, j ∈ Sr (10)


xi ∈ {0, 1}, ∀i ∈ {1, . . . ,m} (11)

α-CB:

maxx

m∑

i=1

xi


i=1 firxi> αj +


i=1 fiξxi, ∀r, ξ ∈ {1, . . . , k}, r 6= ξ, j ∈ Sr

xi ∈ {0, 1}, ∀i ∈ {1, . . . ,m}

β-CB:

maxx

m∑

i=1

xi


i=1 firxi> βj


i=1 fiξxi, ∀r, ξ ∈ {1, . . . , k}, r 6= ξ, j ∈ Sr

xi ∈ {0, 1}, ∀i ∈ {1, . . . ,m}In the CB problem we are looking for the largest set of features, which canbe used to construct a consistent biclustering. The α-CB and β-CB problemsare similar to CB problem. The only difference is that the selected set of fea-tures have to allow constructing α-consistent and β-consistent biclusterings,respectively.

4 Heuristic Algorithm

All three above optimization problems belong to the fractional 0-1 program-ming, and it is difficult to find an optimal solution of the problems. In thepaper by Busygin et al. [BPP05] the authors consider the β-CB problem andintroduce a linearization of the problem. However, commercial mixed integerprogramming (MIP) solvers are not able to solve it due to the excessive num-ber of variables and constraints. The authors introduce an iterative heuristicprocedure, where in each iteration it is required to solve a linear 0-1 problemof a smaller size. In this section, we discuss a heuristic procedure to solve theproblems, which iteratively solves continuous linear problems. Because thesame algorithm can be applied to all three problems, in our discussion wefocus only on the CB problem.

Observe that in the problems the expression∑m

i=1 fiξxi describes the car-dinality of the set of features in the truncated data. In particular, if xi = 1,∀i ∈ {1, . . . , m} such that fiξ = 1, then it is equal to the cardinality ofFξ. Given a vector x, let Fξ(x) denote the truncated set of features, i.e.,Fξ(x) ⊆ Fξ such that the features are included in the set Fξ(x) only if xi = 1.


If the optimal cardinality of the sets Fξ(x) are known, then they can be fixedat the optimal values, and the problem reduces to a linear one. In the heuris-tic procedure we employ this property and iteratively solve a series of linearprograms by updating the cardinalities according to the current available so-lution.

In the first step of the algorithm (see Procedure 1), we assign x0i = 1,

∀i ∈ {1, . . . , m}, Fξ(x0) = Fξ, ∀ξ ∈ {1, . . . , k}, and p = 0. In the second step,we solve the following linear program, which can be obtained from the CBproblem by fixing the cardinalities of the feature sets at the values |Fξ(xp)|and relaxing the integrality of the variables xi.

maxx

m∑

i=1

xi (12)

∑mi=1 aijfirxi

|Fr(xpi )|

≥∑m

i=1 aijfiξxi

|Fξ(xpi )|

, ∀r, ξ ∈ {1, . . . , k}, r 6= ξ, j ∈ Sr (13)

xi ∈ [0, 1], ∀i ∈ {1, . . . , m} (14)

Let p ← p+1 and xp denote the vector solution of the problem. According tothe solution xp, construct the sets Fξ(xp), where the features are included inthe set only if xp

i = 1, i.e., Fξ(xp) ⊆ Fξ such that xpi = 1. If ∃ξ ∈ {1, . . . , k}

such that Fξ(xp) 6= Fξ(xp−1) then go to Step 2 and solve the problem (12)-(14) with updated values of cardinalities. On the other hand, if Fξ(xp) =Fξ(xp−1), ∀ξ ∈ {1, . . . , k}, then we have to check if x∗i = bxp

i c is feasible tothe constraint (13). If the feasibility is satisfied, then stop and return thevalue of the vector x∗. However, if the vector is not feasible, then we concludethat the variables xp

i with fractional values cannot take value one, i.e., thecorresponding features cannot be included in the set of the truncated features.Then we delete permanently those features from the data set and continuethe process.

Observe that the solution x∗ is feasible to the CB problem. In particular, x∗itakes either value one or zero. Because by construction the sets Fξ(xp) includeonly the features with x∗i = 1, the feasibility to the inequality (13) impliesthe feasibility to the inequality (10). It is noticed that the strict inequality in(10) holds in practical problems because of the following reasons.

Procedure 1 :Step 1: Let x0

i = 1, ∀i ∈ {1, . . . , m}, Fξ(x0) = Fξ, ∀ξ ∈ {1, . . . , k}, and p = 0.

Step 2: Solve the problem (12)-(14). Let p ← p + 1 and xp denote the vectorsolution of the problem.Step 3: Construct the set of features Fξ(x

p) ⊆ Fξ such that xpi = 1.

Step 4: If Fξ(xp) 6= Fξ(x

p−1) then go to Step 2.Step 5: x∗i ← bxp

i c If x∗i is feasible to the constraint (13) then stop. Otherwise,delete permanently from further consideration all features with fractional valuesof xp

i and go to the Step 2.


Table 1. Computational results on ALL and AML samples: the CB and α-CBproblems with different values of α.

CB 10-CB 20-CB 30-CB 40-CB 50-CB 60-CB 70-CB 130-CB

Number or Features 7024 7021 7018 7014 7010 6959 6989 6960 4639Number of Errors 2 2 2 2 1 1 1 1 1

CPU 1.66 1.91 2.08 2.21 2.84 90.52 32.79 24.35 6.91

Recall that the objective (12) maximizes the number of features includedin the truncated data set, and it is our benefit to have the values of thevariables xi as close as possible to one. However, because of the inequality(13), some variables take fractional values at optimality. As a result, some ofthe constraints (13) are tight at optimality. If x∗ = bxpc is feasible to (13),then it is highly unlikely that the constraints remain tight.

5 Numerical Experiments

In the computational experiments we consider a well known data set, whichconsists of samples from patients diagnosed with acute lymphoblastic leukemia(ALL) or acute myeloid leukemia (AML) diseases (see [GST99]). This dateset is used in the computations by Busygin et al. [BPP05] as well as otherresearchers (see, e.g., [BBNSY00], [BFY01], [WMCPPV01], [XK01]). Similarto the numerical experiments in [BPP05], we divide the data set into twogroups, where the first group is used as a training data set and the secondone, test data set, is used to verify the quality of the obtained classification.The training data set consists of 38 samples from which 27 are ALL and 11are AML samples. The test data set consist of 20 ALL and 14 AML samples.Each sample in the sets consists of 7070 features.

We run our heuristic algorithm to solve CB as well as α-CB and β-CB prob-lems with different values for the parameters α and β. Although parametersαj and βj can take different values for different features, in our experimentswe assume that there are all equal. In all cases, we obtain the “checkerboard”pattern similar to one on Figure 2.

Table 2. Computational results on ALL and AML samples: the CB and β-CBproblems with different values of β.

CB 1.05-CB 1.1-CB 1.2-CB 1.5-CB 2-CB 3-CB 5-CB 7-CB

Number or Features 7024 7017 7010 6937 6508 5905 5458 5173 5055Number of Errors 2 2 1 1 1 1 1 2 3

CPU 1.66 1.68 1.7 37.55 28.45 17.67 6.39 6.44 4.73


ALL

AML

Fig. 2. “Checkerboard” pattern for ALL and AML samples.

Table 1 illustrates the results for the additive consistent biclustering withdifferent values of the vector parameter α. The first row in the table representsthe maximum number of features in the truncated data that allow constructingthe corresponding biclustering. Using the obtained set of features, we classifythe samples from the second group of data, and the second row in the tablerepresents the number of misclassifications. Finally, the last row provides aninformation on the CPU time of the algorithm. As we can see from the table,a higher value of the parameter α better classifies the samples. In particular,for the values equal to 40 and higher there is only one error detected in theclassification of the test data. In addition, observe that the number of selectedfeatures decreases with the increase of the parameter. These two observationslead to a conclusion that a fewer but most representative features can be usedto classify the data. The highest value of the parameter α for which we areable to obtain an α-consistent biclustering is 130. As for the CPU time, noticethat it varies depending on the value of the parameter α but remains withinreasonable limits.

A similar result can be obtained using β-consistent biclustering (see Table2). In particular a higher value of β provides a better classification. However,


Table 3. Computational results on HuGE data set: CB and α-CB problems withdifferent values of α.

CB α-CB

Tissue type #Sumples α = 10 α = 30 α = 50 α = 70

Blood 1 472 472 472 472 472Brain 11 615 615 615 615 615Breast 2 903 903 903 903 903Colon 1 367 366 363 360 355Cervix 1 155 155 155 155 155Endometrium 2 226 225 222 218 211Esophagus 1 281 280 277 274 272Kidney 6 159 159 159 159 159Liver 6 440 440 440 440 440Lung 6 102 102 102 102 102Muscle 6 533 533 533 533 532Myometrium 2 162 161 159 156 153Ovary 2 257 255 251 246 240Placenta 2 519 519 519 519 519Prostate 4 281 281 281 281 281Spleen 1 438 438 438 438 438Stomach 1 447 447 447 447 447Testes 1 522 521 520 518 515Vulva 3 187 187 187 187 187

Total 59 7066 7059 7043 7023 6996

the values of β grater than or equal to 5 are to restrictive and worsen thequality of the classification. It is noticed that in the paper by Busygin etal. [BPP05] the heuristic algorithm proposed by the authors converges after15 minutes and is able to select 6681 features for the parameter βj = 1.1,∀j ∈ {1, . . . , n}. Using the same parameter, our algorithm outperforms theprevious result by selecting 7010 features within 1.68 seconds of CPU time.

Despite a small number of deleted features (in the case of the CB problemthe number of deleted features is 46), the consistent biclustering is crucial toobtain a good classification for the features. In particular, if one classifies allfeatures using the formula (3) and tests the classification using the secondgroup of data, then usually the number of misclassifications is larger. In thecase of AML and ALL samples, the number of misclassifications we obtainusing this technique is 19. (Practically all ALL sample from the test set areclassified as AML.)

In addition to the ALL and AML samples, we test our algorithm on Hu-man Gene Expression (HuGE) Index. The samples are collected from healthytissues of different parts of human body. The main purpose of the classifica-tion is to identify the features that are highly expressed in particular tissue.Table 3 illustrates the computational results of the CB and α-CB problemsfor different values of α. It is interesting to observe that in the most of the


tissues, e.g., Blood, Brain, and Breast, the number of the selected features donot change for different values of α. On the other hand, some tissues, e.g.,Ovary, are more “sensitive” to the changes of the parameter. Table 4 intro-duces the results for the multiplicative consistent biclustering. Although inthese problems the set of “sensitive” tissues is larger than in the case of α-CB problems, some tissues, Cervix, Kidney, Placenta, Prostate, Spleen, andStomach, preserve the same number of selected features. The last column inthe table provides the data from the paper by Busygin et al. [BPP05] wherethe authors solve the multiplicative consistent biclustering problem with pa-rameter βj = 1.1, ∀j ∈ {1, . . . , n}. Observe that for the same value of theparameter, our algorithm finds 162 more features.

Table 4. Computational results on HuGE data set: CB and β-CB problems withdifferent values of β.

CB β-CB [BPP05]

Tissue type #Sumples β = 1.1 β = 1.5 β = 2 β = 1.1

Blood 1 472 472 472 467 472Brain 11 615 615 615 610 614Breast 2 903 903 903 900 902Colon 1 367 365 354 348 367Cervix 1 155 155 155 155 107Endometrium 2 226 224 212 190 225Esophagus 1 281 278 269 259 289Kidney 6 159 159 159 159 159Liver 6 440 440 440 421 440Lung 6 102 102 102 101 102Muscle 6 533 533 533 515 532Myometrium 2 162 160 153 142 163Ovary 2 257 253 241 225 272Placenta 2 519 519 519 519 514Prostate 4 281 281 281 281 174Spleen 1 438 438 438 438 417Stomach 1 447 447 447 447 442Testes 1 522 520 513 506 512Vulva 3 187 187 187 182 186

Total 59 7066 7051 6993 6865 6889

6 Conclusion Remarks

In this chapter we have discussed the concept of consistent biclustering pre-sented by Busygin et al. [BPP05]. For the supervised biclustering case, theadditive and multiplicative variations of the problem are introduced to further


analyze the possibilities of choosing the most representative set of features.The heuristic algorithm presented in this chapter allows computing the setof truncated data. Unlike the algorithm presented in [BPP05], where it isrequired to solve a sequence of integer programs, our approach iterativelysolves continuous linear problems. Computational results on the same dataset conform that our heuristic algorithm outperforms the previous result inthe quality of the solution as well as computational time. Although for themost of the values of the parameters α and β the heuristic algorithm con-verges to a solution, for some values it does not. However, in latter cases thealgorithm converges after a small perturbation in the values of α and β.


References

[BBNSY00] Ben-Dor A., Bruhn L., Nachman I., Schummer M., Yakhini Z.: TissueClassification with Gene Expression Profiles. Journal of ComputationalBilology, 7, 559–584 (2000)

[BFY01] Ben-Dor A., Ffiedman N., Yakhin Z.: Class Descovery in Gene ExpressionData. Procidings of Fifth Annual International Conveference on Compu-tational Malecular Biology, (2001)

[BCB05] Bryan K., Cunningham P., Bolshakova N.: Biclustering of ExpressionData Using Simulated Annealing. Proceedings of the 18th IEEE Sym-posium on Computer-Based Medical Systems, 383–388 (2005)

[BPP05] Busygin S., Prokopyev O., Pardalos P.: Feature Selection for ConsistentBiclustering via Fractional 0-1 Programming. Journal of CombinatorialOptimization, 10, 7–21 (2005)

[CC00] Cheng Y., Church G. M.: Biclustering of Expression Data. Proceedings ofthe Eighth International Conference on Intelligent Systems for MolecularBiology, 93–103 (2000)

[D01] Dhillon I. S.: Co-Clustering Documents and Words Using Bipartite Spec-tral Graph Partitioning. Proceedings of the Seventh ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining (KDD),San Francisco, 26–29 (2001)

[DMM03] Dhillon I. S., Mallela S., Modha D. S.: Information-Theoretic Co-clustering. Proceedings of the Ninth ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining (KDD), 89–98 (2003)

[GST99] Golub T. R., Slonim D. K., Tamayo P., Huard C., Gaasenbeek M., MesirovJ. P., Coller H., Loh M. L., Downing J. R., Caligiuri M. A., Bloomfield C.D., Lander E. S.: Molecular Classification of Cancer: Class Discovery andClass Prodiction by Gene Expression Monitoring. Science, 286, 531–537(1999)

[H72] Hartigan J. A.: Direct Clustering of a Data Matrix. Journal of AmericanStatistical Association, 67, 123–129 (1972)

[HeatMap] HeatMap Builder Software, Quertermous Laboratory, Stanford Univer-sity, http://quertermous.stanford.edu/heatmap.htm.

[KBCG03] Kluger Y., Basri R., Chang J. T., Gerstein M.: Spectral Biclustering ofMicroarray Data: Coclustering Genes and Conditions. Genome Research,703–716 (2003)

[SMM03] Sheng Q., Moreau Y., De Moor B.: Biclustering Microarray Data by GibbsSampling. Bioinformaics, 19, ii196–ii205 (2003)

[WMCPPV01] Weston J., Mukherjee S., Chapelle O., Pontil M., Poggio T., VapnikV.: Feature Selection for SVMs. NIPS, (2001)

[XK01] Xing E.P., Karp R.M.: CLIFF: Clustering of High-Dimensional Microar-ray Data Via Iterative Feature Filtering Using Normilized Cuts. Bioinfo-matics Discovery Note, 1, 1–9 (2001)