structure validation in clustering by stability...

24
Friday, 03 July 2009 Structure validation in clustering by stability analysis Joachim M. Buhmann Computer Science Department, ETH Zurich 2 Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen Learning discrete structures under uncertainty Partitions or clusterings: Which model? How many clusters? Trees or dendrograms: What depth of the tree? How many leaves? Graphical models: How many edges?

Upload: truongthu

Post on 10-Sep-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

Friday, 03 July 2009

Structure validation in clustering by stability analysis

Joachim M. Buhmann

Computer Science Department, ETH Zurich

2Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Learning discrete structures under uncertainty

Partitions or clusterings:Which model?How many clusters?

Trees or dendrograms: What depth of the tree? How many leaves?

Graphical models: How many edges?

Page 2: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

3Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

How difficult is image segmentation?How difficult is image segmentation?Do you recognize the object ? Do you recognize the object ?

4Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

What is information?

Is Shannon information useful

as the technical measure of information?

842.414 Bytes620.393 Bytes

Page 3: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

5Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Relevant information in image – where can we find it ?The relevance of information is determined by the task !

Yarbus (1967)

6Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Machine learning & statistical modelling

Approximate optimization, stability

and information theory optimization as coding

statistical models

Medical image processing of tissue

data for high throughput cancer

diagnosis computational pathology: Tissue

Microarrays for clear cell renal carcinoma

sender

problem generator

receiver

s

R(·,σs ◦X(2)), s.t.

X(2),X(1) ∼ P(X)

Page 4: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

7Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

What is the „right“ model for figure ground segmentation?

8Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Path Based Clustering

Clustering models for figure ground segmentation?

Connectedness criterion Single Linkage

Compactness criterion K-Means Clustering

Pairwise Clustering, AvAssoc

Correlation Clustering

Max-Cut, Average Cut

Normalized Cut

Page 5: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

9Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Design problems in clustering: validation

Modeling problem: Does the

cluster model “describe” the

data? Selection of the

costs/hypothesis class!

Model order selection

problem: Is the number of

clusters and/or features

correct?

10Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Concept for robust data analysis

FeedbackFeedback

Structureoptimizationmultiscale analysis,stochastic approximation

xQuantization ofsolution spaceInformation/Rate Distortion Theory

Regularizationof statistical &

computational complexity

Structure Validationstatisticallearning theory

Structuredefinition

(costs, risk, ...)

Datavectors, relations,

images,...

Page 6: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

11Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Mathematical formalization of clustering

Given: object space with objects

Given: measurement space

data are relations

Clusterings partition objects into groups, i.e.,

Hypothesis class

o ∈ O.X

c : O ×X → {1, . . . , k}(o,X) 7→ c(o,X)

c ∈ C ≡ {partitions of data}

O

(o,X) ∈ O ×X

12Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Order relation of clusterings

algorithm selects “statistically optimal”

clusterings

Remark: could minimize a cost function where

is a close approximation of the minimum.

Model selection problem: Which properties

should a good clustering algorithm possess?

Stability! Small changes of data should yield

similar clusterings.

α : O ×X → Cγ ⊂ C

Page 7: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

13Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Why risk approximation?

Data often contain noise! Very frequently data

are best modeled as random variables.

An empirically optimal clustering often is

statistically indistinguishable from other equally

plausible data partitionings!

Data noise reduces resolution in data space!

=> This quantization induces a quantization of

the hypothesis class for structures.

14Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Empirical Risk Approximation

Learning: sample typical solutions of an

approximation set given data X(1)

Algorithm: Gibbs sampling of clusterings with

temperature T() explores approximation set.

Interpretation: T() controls the resolution of the

hypothesis class, i.e., the minimal similarity of

statistically indistinguishable structures

c ∈ C(1)γ ≡©c : d

¡c(X(1)), c⊥(X(1))

¢ ≤ γªC(1)γ ≡ Cγ(X(1))

Page 8: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

15Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Approximate optimization and information theory

Problem: Data processing in Machine Learning is

often formulated as an optimization question.

=> noisy data require robustness = generalization

Use approximate optimization results as code

“Communication” is achieved via approximate

optimization of instances since test instances

are considered to be perturbed training instances.

16Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

2n states

test solutions

Two instance setting for learning

Sol

utio

n sp

aces

(Hyp

othe

sis

clas

s)

x: ERM solutions

Inst

ance

spa

ce(d

ata

spac

e)

2n training solutions

instance/data space

+ ++X(1) X(2)

xC(1)γx

C(2)γ

+

φ(c)φ ◦ C(2)γ

++X

φ(c)φ ◦ C(2)γ

2(n2)states

for graph cut

Page 9: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

17Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Figure Ground Segmentation by Graph Cut

graph cut test problem

1

23

46

59

7 8

11111

111

111

111

1111

111

1111

1111

11111

graph cut code problems

…2M

5

97

46

12

3 8

1

23

46

59

7 8

5

97

46

12

3 8

1111

111

1111

111

11111

111

111

11111

1111

11111

111

1111

1-1

1111

11111

11-1-

11-1

11111

5

97

46

12

3 8

5

97

46

12

3 8

5

97

46

12

3 8

18Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Robust Approximations by Stability

sender

problem generator PG

receiverR(·,X(1)) R(·,X(1))

define a set of code problems

c⊥ = argmincR(c,X(1))

X(1)

{σ1, . . . ,σ2M}σ ◦X(1)

c⊥(σ ◦X(1))

Page 10: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

19Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Robust Approximations by Stability

sender

problem generator

receiver

s

estimate the coding error

1. Sender sends a permutation index s to problem generator.

2. Problem generator sends a new problem with permuted indices to receiver without revealing s.

3. Receiver identifies the permutation by comparing approximation sets.

c⊥(X̃(2))

σs ◦X(1)

c⊥(σs ◦X(1))

R(·, σs ◦X(2)), s.t.

X(1),X(2) ∼ P (X)X̃(2) = σs ◦X(2)

20Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Communication Process

Receiver has to compare sets of clusterings

of training instance (code problem) with

approximate clusterings of the test data.

Define a mapping

Decoding

Cγ(X(2))

Cγ(X(1))

σ∗ = argmaxσ

¯̄̄Cγ(σ ◦X(1)) ∩ φ

³Cγ(X̃(2))

´¯̄̄if

|Cγ(σ∗◦X(1)) ∩ φ(Cγ(X̃(2)))||Cγ(σ∗◦X(1))| ≥ 1 − ²

φ : C(X(2))→ C(X(1))

Page 11: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

21Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Error Analysis and Approximation Capacity

Approximations of sender and receiver have little

in common! => Irreproducibility

This condition determines approximation precision

Approximations of test problem has a large

overlap with approximations of wrong training

problem! => Confusion

=> Select model with maximal information capacity,

i.e., high precision and high noise robustness

24Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Approximation Capacity

Condition of vanishing total error

Model selection: Maximize the mutual

information w.r.t. topology of solution space,

metric of solution space, cost function, transfer

function and approximation precision

mutual information

M log 2 < log |φ ◦ Cγ(σs ◦X(2)) ∩ Cγ(σj ◦X(1))|− log |Cγ(σs ◦X(2))|−H(σs)

≡ I¡σs, Cγ(σs ◦X(2))¢

Page 12: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

25Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Results on Toy Data

26Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Clustering of Microarray Data(dataset from Golub et al., Science, Oct. 1999, pp.531-537)

Task: Find groups of different Leukemia tumour samples(two- and three class classifications are known).

Problem: Number of groups isunknown a priori.

Result: 3-means solutionrecovers 91% of known sampleclassifications.

3-means grouping of Golub et al.data set and estimated instability

Via Stability with k-means: Estimated number of groups is 3.

Page 13: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

27Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Scales in Data Analysis and Vision

fine

coarse

Refinement ofVariable Space

Increment Level ofResolution Pyramid

Refinement ofOptimization Criterion

Increase Regularization

Refinement ofModel Order

Increase # of Segments

28Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Machine learning & statistical modelling

Approximate optimization, stability

and information theory optimization as coding

statistical models

Medical image processing of tissue

data for high throughput cancer

diagnosis (MICCAI ‘08)

computational pathology: Tissue Microarrays for clear cell renal carcinoma

sender

problem generator

receiver

R(·,π ◦X(2)),

s.t. X(2) ∼X(1)

Page 14: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

29Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Renal (Clear) Cell Carcinoma

Renal cell carcinoma (RCC) Renal cell carcinoma (RCC) is one of the ten most frequent malignancies in Western societies.

The prognosis of renal cancer is poor.

Many patients suffer already from metastases at first diagnosis.

The identification of biomarkers for prediction of prognosis (prognostic prognostic markermarker) or response to therapy (predictive markerpredictive marker) is therefore of utmost importance to improve patient prognosis.

30Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Tissue Microarray Preparation

Hematoxylin-EosinStaining

ImmunohistochemistrySpecimen

kidney

Protein specificstaining

Page 15: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

31Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Tissue Microarray preparation

32Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Tissue Microarray Analysis

Immunohistochemistryidentifies potential biomarker (e.g. MIB-1)

Sur

viva

l(%

)

102030405060708090

100

0 10020 40 60 80Months

p<0.0001

expressed

not expressed

----------

StainedTMA slides

Page 16: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

33Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Data Analysis Problem: VariabilityData Analysis Problem: VariabilityH

& E

MIB

-1

34Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Tissue Microarray Analysis

Page 17: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

35Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

networknetwork

images

clinical dataimage

partitioning

USZBiobank

ETH / USZWeb application

USZdata base

36Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

TMA AnnotatorTMA AnnotatorGoldGold--Standard by expert classificationsStandard by expert classifications

2 pathologists

~2500 nuclei

15% detection

mismatch

Page 18: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

37Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

TMA Classifier TMA Classifier –– Tumor / Normal LabelsTumor / Normal Labels

38Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Agreement among pathologistsAgreement among pathologists

3 2 1 -1 -2 -3

no cancer cancer

180 cell nuclei are randomly selected out of 9 areas.

agreement for 105 cell nuclei.

num

ber

of c

ell n

ucle

i

self confusion matrix of pathologistsconsistency among pathologists

Page 19: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

39Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Workflow of computational pathologyWorkflow of computational pathology

+++

+ +

trainingexamples11

featureextraction22

Random Forestlearning33

++

+

++

+ +

++

+

++

+ +

++

+

++

++

++

+

++

++

++

+

++

++

++

+

++

++

cell nucleidetection44

estimatestaining55

application to patient cohort66

estimatedMarker distribution77

pathologist algorithm

survivalanalysis88

40Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Prediction of survivalPrediction of survival

pathologist PW learning algorithm

The learning algorithm

estimates the number of stained

cancer cell nuclei more reliably

than a trained pathologists for

patients with good prognosis

Kaplan-Meier curve for

33% low risk patients

33% intermediate risk

33% high risk patients

reliability of diagnosis

will be increased!

nu

mb

er o

f st

ain

ing

s(n

=13

3)

n=133 patients

degree of staining [relative units]

Page 20: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

41Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Improving the quality of training data

Problem: Labeling information by pathologists

is often inconsistent! No ground truth

Solution: Filter out samples which are hard to

classify, i.e., denoising

Strategy1. Compute how similar the samples appear to the

trees of the random forest.

2. Cluster the samples into groups of high similarity.

3. Analyze label inconsistency in these groups.

42Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Wishart-Dirichlet Cluster Process …

… is a sequence of inner product matrices of

growing size, and a random partition B of objects

into k blocks.

Dirichlet-Multinomial prior over partitions

Similarity matrices are Wishart distributed

S|B ∼Wd(ΣB) with ΣB = In ⊗Σ0 +B ⊗Σ1.

Pn(B|λ, k) = k!(k−kB)!

Γ(λ)Qb∈B Γ(nb+λ/k)

Γ(n+λ)[Γ(λ/k)]kB.

Page 21: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

43Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Application: Automated TMA Analysis

Dataset: 500 cancerous and 500 normal nuclei from TMA spots of renal clear cell carcinoma patients.

Malignant/benign classification by Random Forest yields 36% error.

Enhancement: search for subgroups of highly discriminative nuclei. Similarity matrix is defined by proximity of nuclei in tree ensemble.

Some negative eigenvalues of similarities require to use a shift invariance method. co

-mem

bers

hip

prob

s.or

igin

al s

imila

ritie

s

44Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Classification Results

Random Forest trained on subset of nuclei from cluster 2 and 3: 19.4% test error => Importance of quality assessment prior to classification!

This strategy overcomes a severe problems in the design of computational TMA analysis tools.

Page 22: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

45Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

R (D)COM Server SQL Database

AJAX

ASP.Net, C#

IIS

HTML

RankingSurvival StatisticsPlotting

User ManagementClinical Data

Web Application

Client JS ApplicationImage Viewer

http://comp‐path.inf.ethz.ch

46Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Page 23: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

47Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

48Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Page 24: Structure validation in clustering by stability analysismmds.imm.dtu.dk/presentations/buhmann.pdf · Structure validation in clustering by stability analysis ... 2009 Joachim M. Buhmann

49Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

50Friday, July 03, 2009 Joachim M. Buhmann EMMDS Workshop, DTU Copenhagen

Future challenges in Pattern Recognition

Machine Learning has matured to an essential method of computer science, to generate complex statistical models for Computational Biology, Visual Computing but also other core areas of informatics.

Statistical modeling and algorithmics as core problems of computer science can be excellently studied in the area of Pattern Recognition.

We will provide society with the intelligent technology of the 21st century!

Computer Science