a comparative study of clustering for gene expression data in bioinformatics

25
Welcome to my presentation on A Comparative Study of Clustering for Gene Expression Data in Bioinformatics 1 Roll: 08054746 Reg: 1484 Department of Statistics Rajshahi University Rajshahi-6205 Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Upload: rajshai-unicersity

Post on 21-May-2015

150 views

Category:

Health & Medicine


0 download

DESCRIPTION

This is an presentation that every researcher need to essential .

TRANSCRIPT

Page 1: A comparative study of Clustering for Gene expression data in Bioinformatics

Welcome to my presentationon

A Comparative Study of Clustering for Gene Expression Data in Bioinformatics

1

Roll: 08054746 Reg: 1484

Department of StatisticsRajshahi University

Rajshahi-6205

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 2: A comparative study of Clustering for Gene expression data in Bioinformatics

2

Outline1. Why choosing clustering technique ?2. Some Objectives 3. Methods and materials 4. Results and Discussions5. Conclusion

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 3: A comparative study of Clustering for Gene expression data in Bioinformatics

3

1. Why choosing Clustering TechniqueCluster analysis programs are routinely run as a first step of data summary and grouping genes in a microarray data analysis. Mainly the gene expression data is so much noisy, mixture with expression pattern, down regulated and up regulated. That’s why we show here a comparative study of four clustering algorithms and two proximity measures applied on most commonly used iris data, simulated data and six real cancer gene expression data sets.

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 4: A comparative study of Clustering for Gene expression data in Bioinformatics

Bioinformatics Lab, Dept. of Statistics, University of Rajshahi 44

2. Some Objectives Find significant cluster according to similarities,

intensities and regulations among it’s objects. Compare several method of HC with K-means

based on two proximity measures. To asses the quality and reliability of clustering by

Calinaski Harabasz (CH) and Daviece Bouldin (DB) index.

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 5: A comparative study of Clustering for Gene expression data in Bioinformatics

5

1. Single Linkage or Nearest Neighbor Method

2. Complete Linkage or Furthest Neighbor Method 3. Average Linkage Method4. K-means clustering

Methods

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 6: A comparative study of Clustering for Gene expression data in Bioinformatics

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 6

Davies–Bouldin (DB) IndexThe Davies–Bouldin index is a metric for evaluating clustering algorithms (Davies and Bouldin, 1969). This is an internal evaluation scheme and it is a cluster separation measure.

DB=

Ri,j= Di=

Page 7: A comparative study of Clustering for Gene expression data in Bioinformatics

7

Calinski-Harabasz (CH) Index• Calinski-Harabasz (Calinski and Harabasz, 1974; Olatz et al., 2012) index obtained

the best results in the work of (Milligan and Cooper, 1985). It is a ratio type index where the cohesion is estimated based on the distances from the points in a cluster to its centroid. The separation is based on the distance from the centroids to the global centroid. This index for estimating the number of clusters, based on an observations/variables-matrix here. This method described as follows.

• The Calinski-Harabasz criterion is sometimes called the variance ratio criterion (VRC). The Calinski-Harabasz index is defined as

Where, SSB is the overall between-cluster variance, SSW is the overall within-cluster

variance, k is the number of clusters, and N is the number of observations.

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 8: A comparative study of Clustering for Gene expression data in Bioinformatics

8

Dataset Chip Tissue n #C Dist. Classes m d

Armstrong-V2 [2] Affy Blood 72 3 24,20,28 12582 2194

Bhattacharjee [3]

Affy Lung 203 5 139,17,6,21,20 12600 1543

Nutt-V1 [6] Affy Brain 50 4 14,7,14,15 12625 1377

Alizadeh-V2 [1] cDNA Blood 62 3 42,9,11 4022 2093

Garber [4] cDNA Lung 66 4 17,40,4,5 24192 4553

Liang [5] cDNA Brain 37 3 28,6,3 24192 1411

Data sets

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 9: A comparative study of Clustering for Gene expression data in Bioinformatics

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 9

In this example, the objects g1, g2, g3, g4, g5, g6, g7, g8, g9 and g10 have been clustered. The place at the bottom of the tree, where the object names are written, are called leaves. The junctions are called nodes. It is possible to use a hierarchical clustering algorithm to find groups in the data, by cutting the tree at a certain height. For instance, it might be considered than on the example there are two groups, (g2, g3, g1, g8) and (g6, g10, g5, g7, g4, g9) or three groups (g2, g3, g1, g8), (g6, g10) and (g5, g7, g4, g9) or ten groups, each containing only one leaf.

3 Clusters

2 Clusters

Page 10: A comparative study of Clustering for Gene expression data in Bioinformatics

10

Hierarchical Clustering of Simulated Data

Green color dendrogram shows the best result and we make a Heat map by this method. i.e Complete HC with respect to Euclidean distance give the best result then other methods.

Fig: Heat map

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 11: A comparative study of Clustering for Gene expression data in Bioinformatics

11

K-means of Simulated Data

No. of Cluster K=2 K=3 K=4 K=5Cluster Size 20,40 20,20,20 12,20,8,20 4,4,12,20,20DB index 0.897 0.797 0.825

From the above table we see, when the number of cluster k=3 the DB index give the lower value. Therefore we may conclude that three clusters are present in this data set.

Table: Davies-Bouldin index

0.321

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 12: A comparative study of Clustering for Gene expression data in Bioinformatics

12

HC

of A

rmst

rong

-V2

Dat

a(d)

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 24,20,28

Page 13: A comparative study of Clustering for Gene expression data in Bioinformatics

13

Seve

ral H

C N

utt-V

1 D

ata

(c)

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 14,7,14,15

Page 14: A comparative study of Clustering for Gene expression data in Bioinformatics

Table represent, when the number of cluster k=3 the DB index give the lower value. The sizes of the cluster is 41, 10 and 11 and the actual cluster size is 42, 9 and 11. When the number of cluster is 3 than the DB index gives the lower value. Therefore we may conclude that three clusters are present in Alizadeh-V2 data.

14

Table: Davies-Bouldin index

No. of Cluster K=2 K=3 K=4 K=5Cluster Size 44,18 41,10,11 11,20,17,14 22, 9, 3,10,18

DB index 2.708 2.477 2.3281.774

K-means of Alizadeh-V2 Data

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 15: A comparative study of Clustering for Gene expression data in Bioinformatics

Table 4.3 represent, when the number of cluster k=3 the DB index give the lower value. The sizes of the cluster is 8, 26 and 3 and the actual cluster size is 6, 28 and 3. When the number of cluster is 3 than the DB index gives the lower value. Therefore we may conclude that three clusters are present in Armstrong-V2 data.

15

Table: Davies-Bouldin index

No. of Cluster K=2 K=3 K=4 K=5Cluster Size 29, 8 8,26,3 6, 9, 3, 19 1, 2, 19, 14, 1

DB index 1.231 2.091 1.2151.124

K-means of Liang Data

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 16: A comparative study of Clustering for Gene expression data in Bioinformatics

16

Seve

ral H

C Li

ang

Dat

a(c,

d,e,

f)

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 28,6,3

Page 17: A comparative study of Clustering for Gene expression data in Bioinformatics

17

Hea

t map

of L

iang

Dat

a

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 18: A comparative study of Clustering for Gene expression data in Bioinformatics

Bioinformatics Lab, Dept. of Statistics, University of Rajshahi 1818

Dataset Distance Method Cluster Method Calinski-Harabasz (CH)

Armstrong-V2 Euclidean Single 1.889

Euclidean Complete 11.803

Euclidean Average 6.674

Pearson Single 0.914

Pearson Average 10.393

K-means 11.943

Bhattacharjee Euclidean Single 1.786

Euclidean Average 26.850

Pearson Single 1.700

Pearson Complete 26.512

Pearson Average 12.902

K-means 22.924

Nutt-V1 Euclidean Single 3.167

Euclidean Average 5.269

Pearson Single 0.941

Pearson Complete 4.273

Pearson Average 2.987

K-means 6.051

Pearson Complete 12.559

Euclidean Complete 34.702

Euclidean Complete 7.938

Compare HC with K-means for Affymetrix data sets

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 19: A comparative study of Clustering for Gene expression data in Bioinformatics

Bioinformatics Lab, Dept. of Statistics, University of Rajshahi 19

Single Average Complete K-Means0

2

4

6

8

10

12

14

16

18

20

Mean of the CH index for Affy Chip

PearsonEuclidean

CH in

dex

Compare HC with K-means for Affymetrix data sets by visualization technique

From the above graph we see that Complete linkage with Euclidean achieves CH index of 18.14 which is larger CH than Single, Average and K-means with respect to their proximity measure. Therefore we may conclude that the complete linkage method gives the better result for the Affymetrix data sets.

Page 20: A comparative study of Clustering for Gene expression data in Bioinformatics

20

Dataset Distance Method Cluster Method Calinski-Harabasz (CH)

Alizadeh-V2 Euclidean Single 2.047Euclidean Complete 11.161Euclidean Average 11.068Pearson Single 0.980Pearson Complete 11.229Pearson Average 10.319

Garber Euclidean Single 2.772

Euclidean Average 5.166Pearson Single 0.855Pearson Complete 7.693Pearson Average 18.912

K-means 9.269

Liang Euclidean Single 9.057Euclidean Complete 19.665Euclidean Average 10.279Pearson Single 19.665Pearson Complete 19.665Pearson Average 19.665

Compare HC with K-means for cDNA data sets

K-means 13.003

Euclidean Complete 19.097

K-means 23.781

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 21: A comparative study of Clustering for Gene expression data in Bioinformatics

21

Single Complete Average K-Means0

2

4

6

8

10

12

14

16

18

Mean of the CH index for cDNA Chip

EuclideanPearsonSeries3CH

inde

x

Compare HC with K-means for cDNA data sets by visualization technique

From the above graph we see that K-means achieves a CH index of 17.01 which is larger CH than Single, Complete and Average with respect to their proximity measure. Therefore we may conclude that the K-means method gives the better result for the cDNA data sets.

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

Page 22: A comparative study of Clustering for Gene expression data in Bioinformatics

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 22

ConclusionsOur results reveal that the complete linkage with euclidean distance exhibited the best performance for Affymetrix data sets. For cDNA data sets the K-means clustering exhibited the best performance in terms of recovering the true structure of the data sets. To the best of our knowledge, the comparative study of several HC and K-means with the validity index as CH and DB are poorly documented in literature.

Page 23: A comparative study of Clustering for Gene expression data in Bioinformatics

Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 23

Future Research Interest1. Comparison on Hierarchical clustering method with the

Self-Organizing Maps method and other existing update clustering methods.

2. Investigate the performance of the different hierarchical clustering method in a comparison of the other existing methods by false discovery rate (FDR), misclassification error rate (MER), receiver operating characteristic (ROC) and area under ROC curve using resampling technique.

3. Comparing both supervised and unsupervised methods for gene expression data.

Page 24: A comparative study of Clustering for Gene expression data in Bioinformatics

Thank you

Page 25: A comparative study of Clustering for Gene expression data in Bioinformatics

Reference[1] Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM (2000); Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 403:503-511.

[2] Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ (2002); MLL translocations specify a distinctgene expression profile that distinguishes a unique leukemia; Nat Genet. 30:41-47.

[3] Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M,Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001); Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses; Proc Natl Acad Sci USA. 98(24):13790-13795.

[4] Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, Rijn M van de, Rosen GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein D, Petersen I (2001); Diversity of gene expression in adenocarcinoma of the lung; Proc Natl Acad Sci USA. 98(24):13784-13789.

[5] Liang Y, Diehn M, Watson N, Bollen AW, Aldape KD, Nicholas MK, Lamborn KR, Berger MS, Botstein D, Brown PO, Israel MA (2005); Gene expression profiling reveals molecularly and clinically distinct subtypes of glioblastoma multiforme; Proc Natl Acad Sci USA. 102(16):5814-5819.

[6] Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin ME, Batchelor TT, Black PM, von Deimling A, Pomeroy SL, Golub TR, Louis DN (2003); Gene expressionbased classification of malignant gliomas correlates better with survival than histological classification; Cancer Res. 63(7):1602-1607.