clustering by soft-constraint affinity propagation: applications to gene-expression data

Clustering by soft-constraint affinity propagation: applications to gene-

expression dataMichele Leone, Sumedha and Martin

WeightBioinformatics, 2007

Outline

• Introduction• The Algorithm and Method Analysis• Experimental results• Discussion

2

Introduction

• Affinity Propagation seeks to identify each cluster by one of its elements, exemplar.– each point in the cluster refers to this exemplar.– each exemplar is required to refer to itself as a

self-exemplar.

• However, it forces clusters to appear as stars.

3

There’s only one central node, and all other nodes are directly connected to it.

Introduction

• Some drawbacks in Affinity Propagation:– The hard constraint in AP relies strongly on

cluster-shape regularity.– All information about the internal structure and

the hierarchical merging/dissociation of cluster is lost.

– AP has robustness limitations.– AP forces each exemplar to point to itself.

4

Introduction

• How to improve it?• The hard constraint: exemplars would be self-

exemplars.• We relax the hard constraint by introducing a

finite penalty term for each constraint violation.

5

The Algorithm and Method Analysis

• The Soft Constraint Affinity Propagation(SCAP) equations.

• Efficient implementation of the algorithm.• Extracting cluster signatures.

6

The SCAP equations

• We write the constraint attached to a given data point as follows, with :

The first case assigns a penalty if data point is chosen as exemplar by some other data point , without being a self-exemplar.

7

The SCAP equations

• The penalty presents a compromise between the minimization the cost function and the search of compact clusters.

• Then, we introduce a positive real-valued parameter weighing the relative importance of the cost minimization with respect to the constraints.

8

The SCAP equations

• So, we can define the probability of an arbitrary clustering as:

• Original AP is recovered by taking since any violated constraint sets to zero.

9

The SCAP equations

• For general , the optimal clustering can be determined by maximizing the marginal probabilities for all data points :

10

The SCAP equations

• Assume , we find the SCAP equations:

• The exemplar of any data point can be computed as:

11

The SCAP equations

• Compared to original AP, SCAP amounts to an additional threshold on the self-availabilities and the self-responsibilities .

• For small enough , in many case.• The self-responsibility is substituted

with .• For (i.e. ), the original AP equations

are recovered.

12

The SCAP equations

• This means that variables are discouraged to be self-exemplars beyond a given threshold, even in the case someone is already pointing at them.

13

Efficient implementation

• The iterative solution:

14

Efficient implementation

• Difference between the original AP:– Step 3 is formulated as a sequential update.– The original AP used damped parallel update.

15

Extracting cluster signatures

• Only a few components carry useful information about the cluster structure, they are called cluster signatures.

• We assume the similarity between data points

and to be additive in single-gene contributions:

16


• Having found a clustering given by the exemplar selection , we can calculate the similarity of a cluster C defined as a connected component of the directed graph:

as a sum over single-gene contributions

17


• Then, we compare to random exemplar choices which are characterized by their mean:

and variance

18


• The relevance of a gene can be ranked by

which measures the distance of the actual from the distribution of random exemplar mappings.

• Genes can be ranked according to , highest-ranking genes are considered a cluster signature.

19

Experimental results

• Iris data• Brain cancer data• Other benchmark cancer data– Lymphoma cancer data– SRBCT cancer data– Leukemia

20

Iris data

• Three clusters: setosa, versicolor, virginica.• Four features for 150 flowers:– sepal length– sepal width– petal length– petal width

21

Iris data

• Experimental results:– Affinity Propagation: 16 errors.– SCAP: 9 errors with Manhattan distance measure

for the similarity.

• On increasing the value of , the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they are closer to each other than to Setosa.

22

Brain cancer data

• Five diagnosis types for 42 patients:– 10 medulloblastoma– 10 malignant glioma– 10 atypical teratoid/rhabdoid tumors– 4 normal cerebella– 8 primitive neuroectodermal tumors – PNET

23

Brain cancer data

• Clustering with AP(for ):

24

There are three well-distinguishable clusters.

Five clusters for lowest errors.

Brain cancer data

• Clustering with SCAP:

25

The SCAP identifies four clusters with 8 errors.

Brain cancer data

• Eight errors are due to misclassifications of the fifth diagnosis(PNET).

• We use the procedure to extract cluster signatures in the case of four clusters:

• No. 34~41 are the fifth diagnosis.

26

Other benchmark cancer data

• Lymphoma cancer data– Three diagnoses for 62 patients.

• SRBCT cancer data– Four expression diagnosis patterns for 63 samples.

• Leukemia– Two diagnoses for 72 samples.

27

Other benchmark cancer data

• Lymphoma cancer data– AP: 3 errors with 3 clusters.– SCAP: 1 error with 3 clusters.

• SRBCT cancer data– AP: 22 errors with 5 clusters.– SCAP: 7 errors with 4 clusters.

• Leukemia– AP: 4 errors with 2 clusters.– SCAP: 2 errors with 2 clusters.

28

Discussion

• If clusters cannot be well represented by a single cluster exemplar, AP has to fail.

• SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data and thus in biological applications concerning microarray data.

• The cluster structure can be efficiently probed.

29

clustering by soft-constraint affinity propagation: applications to gene-expression data

Documents

cluster structure

hard constraint

cluster c

scap amounts

scap equationscompared

scap equationsassume

scap equationswe

scap equationsso