a comparative study of clustering for gene expression data in bioinformatics

Welcome to my presentationon

A Comparative Study of Clustering for Gene Expression Data in Bioinformatics

1

Roll: 08054746 Reg: 1484

Department of StatisticsRajshahi University

Rajshahi-6205

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

2

Outline1. Why choosing clustering technique ?2. Some Objectives 3. Methods and materials 4. Results and Discussions5. Conclusion


3

1. Why choosing Clustering TechniqueCluster analysis programs are routinely run as a first step of data summary and grouping genes in a microarray data analysis. Mainly the gene expression data is so much noisy, mixture with expression pattern, down regulated and up regulated. That’s why we show here a comparative study of four clustering algorithms and two proximity measures applied on most commonly used iris data, simulated data and six real cancer gene expression data sets.


Bioinformatics Lab, Dept. of Statistics, University of Rajshahi 44

2. Some Objectives Find significant cluster according to similarities,

intensities and regulations among it’s objects. Compare several method of HC with K-means

based on two proximity measures. To asses the quality and reliability of clustering by

Calinaski Harabasz (CH) and Daviece Bouldin (DB) index.


5

1. Single Linkage or Nearest Neighbor Method

2. Complete Linkage or Furthest Neighbor Method 3. Average Linkage Method4. K-means clustering

Methods


Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 6

Davies–Bouldin (DB) IndexThe Davies–Bouldin index is a metric for evaluating clustering algorithms (Davies and Bouldin, 1969). This is an internal evaluation scheme and it is a cluster separation measure.

DB=

Ri,j= Di=

7

Calinski-Harabasz (CH) Index• Calinski-Harabasz (Calinski and Harabasz, 1974; Olatz et al., 2012) index obtained

the best results in the work of (Milligan and Cooper, 1985). It is a ratio type index where the cohesion is estimated based on the distances from the points in a cluster to its centroid. The separation is based on the distance from the centroids to the global centroid. This index for estimating the number of clusters, based on an observations/variables-matrix here. This method described as follows.

• The Calinski-Harabasz criterion is sometimes called the variance ratio criterion (VRC). The Calinski-Harabasz index is defined as

Where, SSB is the overall between-cluster variance, SSW is the overall within-cluster

variance, k is the number of clusters, and N is the number of observations.


8

Dataset Chip Tissue n #C Dist. Classes m d

Armstrong-V2 [2] Affy Blood 72 3 24,20,28 12582 2194

Bhattacharjee [3]

Affy Lung 203 5 139,17,6,21,20 12600 1543

Nutt-V1 [6] Affy Brain 50 4 14,7,14,15 12625 1377

Alizadeh-V2 [1] cDNA Blood 62 3 42,9,11 4022 2093

Garber [4] cDNA Lung 66 4 17,40,4,5 24192 4553

Liang [5] cDNA Brain 37 3 28,6,3 24192 1411

Data sets



In this example, the objects g1, g2, g3, g4, g5, g6, g7, g8, g9 and g10 have been clustered. The place at the bottom of the tree, where the object names are written, are called leaves. The junctions are called nodes. It is possible to use a hierarchical clustering algorithm to find groups in the data, by cutting the tree at a certain height. For instance, it might be considered than on the example there are two groups, (g2, g3, g1, g8) and (g6, g10, g5, g7, g4, g9) or three groups (g2, g3, g1, g8), (g6, g10) and (g5, g7, g4, g9) or ten groups, each containing only one leaf.

3 Clusters

2 Clusters

10

Hierarchical Clustering of Simulated Data

Green color dendrogram shows the best result and we make a Heat map by this method. i.e Complete HC with respect to Euclidean distance give the best result then other methods.

Fig: Heat map


11

K-means of Simulated Data

No. of Cluster K=2 K=3 K=4 K=5Cluster Size 20,40 20,20,20 12,20,8,20 4,4,12,20,20DB index 0.897 0.797 0.825

From the above table we see, when the number of cluster k=3 the DB index give the lower value. Therefore we may conclude that three clusters are present in this data set.

Table: Davies-Bouldin index

0.321


12

HC

of A

rmst

rong

-V2

Dat

a(d)

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 24,20,28

13

Seve

ral H

C N

utt-V

1 D

ata

(c)

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 14,7,14,15

Table represent, when the number of cluster k=3 the DB index give the lower value. The sizes of the cluster is 41, 10 and 11 and the actual cluster size is 42, 9 and 11. When the number of cluster is 3 than the DB index gives the lower value. Therefore we may conclude that three clusters are present in Alizadeh-V2 data.

14


No. of Cluster K=2 K=3 K=4 K=5Cluster Size 44,18 41,10,11 11,20,17,14 22, 9, 3,10,18

DB index 2.708 2.477 2.3281.774

K-means of Alizadeh-V2 Data


Table 4.3 represent, when the number of cluster k=3 the DB index give the lower value. The sizes of the cluster is 8, 26 and 3 and the actual cluster size is 6, 28 and 3. When the number of cluster is 3 than the DB index gives the lower value. Therefore we may conclude that three clusters are present in Armstrong-V2 data.

15


No. of Cluster K=2 K=3 K=4 K=5Cluster Size 29, 8 8,26,3 6, 9, 3, 19 1, 2, 19, 14, 1

DB index 1.231 2.091 1.2151.124

K-means of Liang Data


16

Seve

ral H

C Li

ang

Dat

a(c,

d,e,

f)

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 28,6,3

17

Hea

t map

of L

iang

Dat

a



Dataset Distance Method Cluster Method Calinski-Harabasz (CH)

Armstrong-V2 Euclidean Single 1.889

Euclidean Complete 11.803

Euclidean Average 6.674

Pearson Single 0.914

Pearson Average 10.393

K-means 11.943

Bhattacharjee Euclidean Single 1.786



Pearson Complete 26.512


K-means 22.924

Nutt-V1 Euclidean Single 3.167





K-means 6.051




Compare HC with K-means for Affymetrix data sets



Single Average Complete K-Means0

2

4

6

8

10

12

14

16

18

20

Mean of the CH index for Affy Chip

PearsonEuclidean

CH in

dex

Compare HC with K-means for Affymetrix data sets by visualization technique

From the above graph we see that Complete linkage with Euclidean achieves CH index of 18.14 which is larger CH than Single, Average and K-means with respect to their proximity measure. Therefore we may conclude that the complete linkage method gives the better result for the Affymetrix data sets.

20

Dataset Distance Method Cluster Method Calinski-Harabasz (CH)

Alizadeh-V2 Euclidean Single 2.047Euclidean Complete 11.161Euclidean Average 11.068Pearson Single 0.980Pearson Complete 11.229Pearson Average 10.319

Garber Euclidean Single 2.772

Euclidean Average 5.166Pearson Single 0.855Pearson Complete 7.693Pearson Average 18.912

K-means 9.269

Liang Euclidean Single 9.057Euclidean Complete 19.665Euclidean Average 10.279Pearson Single 19.665Pearson Complete 19.665Pearson Average 19.665

Compare HC with K-means for cDNA data sets

K-means 13.003


K-means 23.781


21

Single Complete Average K-Means0

2

4

6

8

10

12

14

16

18

Mean of the CH index for cDNA Chip

EuclideanPearsonSeries3CH

inde

x

Compare HC with K-means for cDNA data sets by visualization technique

From the above graph we see that K-means achieves a CH index of 17.01 which is larger CH than Single, Complete and Average with respect to their proximity measure. Therefore we may conclude that the K-means method gives the better result for the cDNA data sets.



ConclusionsOur results reveal that the complete linkage with euclidean distance exhibited the best performance for Affymetrix data sets. For cDNA data sets the K-means clustering exhibited the best performance in terms of recovering the true structure of the data sets. To the best of our knowledge, the comparative study of several HC and K-means with the validity index as CH and DB are poorly documented in literature.


Future Research Interest1. Comparison on Hierarchical clustering method with the

Self-Organizing Maps method and other existing update clustering methods.

2. Investigate the performance of the different hierarchical clustering method in a comparison of the other existing methods by false discovery rate (FDR), misclassification error rate (MER), receiver operating characteristic (ROC) and area under ROC curve using resampling technique.

3. Comparing both supervised and unsupervised methods for gene expression data.

Thank you

Reference[1] Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM (2000); Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 403:503-511.

[2] Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ (2002); MLL translocations specify a distinctgene expression profile that distinguishes a unique leukemia; Nat Genet. 30:41-47.

[3] Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M,Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001); Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses; Proc Natl Acad Sci USA. 98(24):13790-13795.

[4] Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, Rijn M van de, Rosen GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein D, Petersen I (2001); Diversity of gene expression in adenocarcinoma of the lung; Proc Natl Acad Sci USA. 98(24):13784-13789.

[5] Liang Y, Diehn M, Watson N, Bollen AW, Aldape KD, Nicholas MK, Lamborn KR, Berger MS, Botstein D, Brown PO, Israel MA (2005); Gene expression profiling reveals molecularly and clinically distinct subtypes of glioblastoma multiforme; Proc Natl Acad Sci USA. 102(16):5814-5819.

[6] Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin ME, Batchelor TT, Black PM, von Deimling A, Pomeroy SL, Golub TR, Louis DN (2003); Gene expressionbased classification of malignant gliomas correlates better with survival than histological classification; Cancer Res. 63(7):1602-1607.

a comparative study of clustering for gene expression data in bioinformatics

Health & Medicine

bipul hossen

cluster cluster size

cluster variance

significant cluster

actual cluster size

university of rajshahi2

university of rajshahi3

university of rajshahi4