clustering by soft-constraint affinity propagation: applications to gene-expression data

29
Clustering by soft- constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007

Upload: lucita

Post on 21-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Clustering by soft-constraint affinity propagation: applications to gene-expression data. Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007. Outline. Introduction The Algorithm and Method Analysis Experimental results Discussion. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Clustering by soft-constraint affinity propagation: applications to gene-

expression dataMichele Leone, Sumedha and Martin

WeightBioinformatics, 2007

Page 2: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Outline

• Introduction• The Algorithm and Method Analysis• Experimental results• Discussion

2

Page 3: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Introduction

• Affinity Propagation seeks to identify each cluster by one of its elements, exemplar.– each point in the cluster refers to this exemplar.– each exemplar is required to refer to itself as a

self-exemplar.

• However, it forces clusters to appear as stars.

3

There’s only one central node, and all other nodes are directly connected to it.

Page 4: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Introduction

• Some drawbacks in Affinity Propagation:– The hard constraint in AP relies strongly on

cluster-shape regularity.– All information about the internal structure and

the hierarchical merging/dissociation of cluster is lost.

– AP has robustness limitations.– AP forces each exemplar to point to itself.

4

Page 5: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Introduction

• How to improve it?• The hard constraint: exemplars would be self-

exemplars.• We relax the hard constraint by introducing a

finite penalty term for each constraint violation.

5

Page 6: Clustering by soft-constraint affinity propagation: applications to gene-expression data

The Algorithm and Method Analysis

• The Soft Constraint Affinity Propagation(SCAP) equations.

• Efficient implementation of the algorithm.• Extracting cluster signatures.

6

Page 7: Clustering by soft-constraint affinity propagation: applications to gene-expression data

The SCAP equations

• We write the constraint attached to a given data point as follows, with :

The first case assigns a penalty if data point is chosen as exemplar by some other data point , without being a self-exemplar.

7

Page 8: Clustering by soft-constraint affinity propagation: applications to gene-expression data

The SCAP equations

• The penalty presents a compromise between the minimization the cost function and the search of compact clusters.

• Then, we introduce a positive real-valued parameter weighing the relative importance of the cost minimization with respect to the constraints.

8

Page 9: Clustering by soft-constraint affinity propagation: applications to gene-expression data

The SCAP equations

• So, we can define the probability of an arbitrary clustering as:

• Original AP is recovered by taking since any violated constraint sets to zero.

9

Page 10: Clustering by soft-constraint affinity propagation: applications to gene-expression data

The SCAP equations

• For general , the optimal clustering can be determined by maximizing the marginal probabilities for all data points :

10

Page 11: Clustering by soft-constraint affinity propagation: applications to gene-expression data

The SCAP equations

• Assume , we find the SCAP equations:

• The exemplar of any data point can be computed as:

11

Page 12: Clustering by soft-constraint affinity propagation: applications to gene-expression data

The SCAP equations

• Compared to original AP, SCAP amounts to an additional threshold on the self-availabilities and the self-responsibilities .

• For small enough , in many case.• The self-responsibility is substituted

with .• For (i.e. ), the original AP equations

are recovered.

12

Page 13: Clustering by soft-constraint affinity propagation: applications to gene-expression data

The SCAP equations

• This means that variables are discouraged to be self-exemplars beyond a given threshold, even in the case someone is already pointing at them.

13

Page 14: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Efficient implementation

• The iterative solution:

14

Page 15: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Efficient implementation

• Difference between the original AP:– Step 3 is formulated as a sequential update.– The original AP used damped parallel update.

15

Page 16: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Extracting cluster signatures

• Only a few components carry useful information about the cluster structure, they are called cluster signatures.

• We assume the similarity between data points

and to be additive in single-gene contributions:

16

Page 17: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Extracting cluster signatures

• Having found a clustering given by the exemplar selection , we can calculate the similarity of a cluster C defined as a connected component of the directed graph:

as a sum over single-gene contributions

17

Page 18: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Extracting cluster signatures

• Then, we compare to random exemplar choices which are characterized by their mean:

and variance

18

Page 19: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Extracting cluster signatures

• The relevance of a gene can be ranked by

which measures the distance of the actual from the distribution of random exemplar mappings.

• Genes can be ranked according to , highest-ranking genes are considered a cluster signature.

19

Page 20: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Experimental results

• Iris data• Brain cancer data• Other benchmark cancer data– Lymphoma cancer data– SRBCT cancer data– Leukemia

20

Page 21: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Iris data

• Three clusters: setosa, versicolor, virginica.• Four features for 150 flowers:– sepal length– sepal width– petal length– petal width

21

Page 22: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Iris data

• Experimental results:– Affinity Propagation: 16 errors.– SCAP: 9 errors with Manhattan distance measure

for the similarity.

• On increasing the value of , the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they are closer to each other than to Setosa.

22

Page 23: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Brain cancer data

• Five diagnosis types for 42 patients:– 10 medulloblastoma– 10 malignant glioma– 10 atypical teratoid/rhabdoid tumors– 4 normal cerebella– 8 primitive neuroectodermal tumors – PNET

23

Page 24: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Brain cancer data

• Clustering with AP(for ):

24

There are three well-distinguishable clusters.

Five clusters for lowest errors.

Page 25: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Brain cancer data

• Clustering with SCAP:

25

The SCAP identifies four clusters with 8 errors.

Page 26: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Brain cancer data

• Eight errors are due to misclassifications of the fifth diagnosis(PNET).

• We use the procedure to extract cluster signatures in the case of four clusters:

• No. 34~41 are the fifth diagnosis.

26

Page 27: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Other benchmark cancer data

• Lymphoma cancer data– Three diagnoses for 62 patients.

• SRBCT cancer data– Four expression diagnosis patterns for 63 samples.

• Leukemia– Two diagnoses for 72 samples.

27

Page 28: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Other benchmark cancer data

• Lymphoma cancer data– AP: 3 errors with 3 clusters.– SCAP: 1 error with 3 clusters.

• SRBCT cancer data– AP: 22 errors with 5 clusters.– SCAP: 7 errors with 4 clusters.

• Leukemia– AP: 4 errors with 2 clusters.– SCAP: 2 errors with 2 clusters.

28

Page 29: Clustering by soft-constraint affinity propagation: applications to gene-expression data

Discussion

• If clusters cannot be well represented by a single cluster exemplar, AP has to fail.

• SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data and thus in biological applications concerning microarray data.

• The cluster structure can be efficiently probed.

29