clustering and classification with applications to...
TRANSCRIPT
![Page 1: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/1.jpg)
1
Clustering and classification with applicationsto microarrays and cellular phenotypes
Gregoire Pau, EMBL [email protected]
European Molecular Biology Laboratory
![Page 2: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/2.jpg)
2
Microarray data
• Clustering– Are there patient groups with similar
expression profiles ?
– Are there groups of genes behaving similarly ?
• Classification– Given known cancer type profiles– Which cancer type has a patient given his
expression profile ?
patientsge
nes
![Page 3: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/3.jpg)
3
Microarray data
• Clustering– Are there patient groups with similar
expression profiles ?
– Are there groups of genes behaving similarly ?
• Classification– Given known cancer type profiles– Which cancer type has a patient given his
expression profile ?
patientsge
nes
![Page 4: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/4.jpg)
4
Microarray data
• Clustering– Are there patient groups with similar
expression profiles ?
– Are there groups of genes behaving similarly ?
• Classification– Given known cancer type profiles– Which cancer type has a patient given his
expression profile ?
patientsge
nes
![Page 5: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/5.jpg)
5
Microarray data
• Clustering– Are there patient groups with similar
expression profiles ?
– Are there groups of genes behaving similarly ?
• Classification– Given known cancer type profiles– Which cancer type has a patient given his
expression profile ?
patientsge
nes
Type A Type B ?
![Page 6: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/6.jpg)
6
Cell phenotypes
• Clustering– Are there similar cell groups ?
• Classification– Given known cell types– What is this cell type ?
cells
features
![Page 7: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/7.jpg)
7
Cell phenotypes
• Clustering– Are there similar cell groups ?
• Classification– Given known cell types– What is this cell type ?
cells
features
![Page 8: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/8.jpg)
8
cells
features
Cell phenotypes
• Clustering– Are there similar cell groups ?
• Classification– Given known cell types– What is this cell type ?
n
nn
a
nn
n
a
a
a
a
?
normal
actinfiber
?
![Page 9: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/9.jpg)
9
Clustering versus classification
• Clustering– Unknown class labels
– Given a measure of similarity between objects– Identification of similar subgroups– Ill-defined problem
• Classification– Known class labels– Prediction/classification/regression of class labels– Well-defined
![Page 10: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/10.jpg)
10
Clustering
![Page 11: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/11.jpg)
11
Clustering
• Identification of similar subgroups within data• Using a similarity measure between objects• Ill-defined problem ⇒ many algorithms exist
• Non-parametric clustering– Agglomerative: Hierarchical clustering
– Partitive: K-means– Partitive: Partitioning Around Medioids (PAM)– Other: Self-organising maps
• Parametric clustering– Gaussian mixture estimation
![Page 12: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/12.jpg)
12
Similarity
• n objects (here, patients)• p parameters (here, genes)
patientsge
nes
gene 1
gene
2
![Page 13: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/13.jpg)
13
Similarity
• n objects (here, patients)• p parameters (here, genes)
patientsge
nes
gene 1
gene
2
![Page 14: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/14.jpg)
14
Many similarity measures
![Page 15: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/15.jpg)
15
Dissimilarity measures
• L2 distance (Euclidean distance) • L1 distance (Manhattan distance)
![Page 16: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/16.jpg)
16
Dissimilarity measures
• Weighted L2 distance • 1 - Pearson correlation
![Page 17: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/17.jpg)
17
Dissimilarity measures
• Lp family– L1 type: d(x, y) = Σi |xi - yi|– L2 type: d(x, y) = sqrt( Σi (xi - yi)2 )
– More generally, Lp type: d(x, y) = ( Σi |xi - yi|p )1/p
– Metrics: positive, symmetric, triangle inequality
• Transformations– Transformation of covariates with f and computation of d(f(x), f(y))
– f could be a log, a normalization method, a weighting function– Example, weighted Euclidean: d(x, y) = sqrt( Σi wi(xi - yi)2 )– Example, Mahalanobis distance: d(x, y)2 = (x-y)tA(x-y), with A = Σ-1
![Page 18: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/18.jpg)
18
Which dissimilarity measure ?
• No universal solutions: it all depends on the objects• L1 distance is less sensitive to outliers• L2 distance is more sensitive• If the object parameters have similar distributions
– Ex: gene expression after normalization– Correlation distance is a popular choice
• If not, object parameters have to be transformed– Ex: heterogeneous parameters (cellular phenotypes)
– Cell A: (size=120, ecc=0.3, x.position=134)– Cell B: (size=90, ecc=0.5, x.position=76)– Cell C: (size= 140, ecc=0.4, x.position=344)
– Ex: un-normalized gene expression sets
![Page 19: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/19.jpg)
19
Clustering
• " Identification of similar subgroups within data "• Ill-defined problem ⇒ many algorithms exist
– Tradeoffs between agglomerative properties, sensitivity, robustness, speed
![Page 20: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/20.jpg)
20
Clustering
• " Identification of similar subgroups within data "• Ill-defined problem ⇒ many algorithms exist
– Tradeoffs between agglomerative properties, sensitivity, robustness, speed
![Page 21: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/21.jpg)
21
Clustering
• " Identification of similar subgroups within data "• Ill-defined problem ⇒ many algorithms exist
– Tradeoffs between agglomerative properties, sensitivity, robustness, speed
![Page 22: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/22.jpg)
22
Hierarchical clustering
• Iterative agglomerative method• Initialisation: each data point is
assigned to an unique cluster• At each step: join most similar
clusters, using between clusterdissimilarity measure
• Iterate until there is only onecluster
• Several linkage variants• In R, function hclust
i = 0
![Page 23: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/23.jpg)
23
Hierarchical clustering
• Iterative agglomerative method• Initialisation: each data point is
assigned to an unique cluster• At each step: join most similar
clusters, using between clusterdissimilarity measure
• Iterate until there is only onecluster
• Several linkage variants• In R, function hclust
i = 300
![Page 24: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/24.jpg)
24
Hierarchical clustering
• Iterative agglomerative method• Initialisation: each data point is
assigned to an unique cluster• At each step: join most similar
clusters, using between clusterdissimilarity measure
• Iterate until there is only onecluster
• Several linkage variants• In R, function hclust
i = 50
![Page 25: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/25.jpg)
25
Hierarchical clustering
• Iterative agglomerative method• Initialisation: each data point is
assigned to an unique cluster• At each step: join most similar
clusters, using between clusterdissimilarity measure
• Iterate until there is only onecluster
• Several linkage variants• In R, function hclust
i = 10
![Page 26: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/26.jpg)
26
Hierarchical clustering
• Iterative agglomerative method• Initialisation: each data point is
assigned to an unique cluster• At each step: join most similar
clusters, using between clusterdissimilarity measure
• Iterate until there is only onecluster
• Several linkage variants• In R, function hclust
i = 3
![Page 27: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/27.jpg)
27
Hierarchical clustering
• Clustering data dendrogram
i = 3
![Page 28: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/28.jpg)
28
Examples
• Gene expression time series
![Page 29: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/29.jpg)
29
Examples
• Gene expression time series
![Page 30: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/30.jpg)
30
Examples
• Gene expression time series
![Page 31: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/31.jpg)
31
Golub et al. leukemia dataset
• Gene expression data of– 25 acute myeloid leukemia
– 47 acute lymphoblastic leukemia– Using 400 most differentiated genes
• Perfect separation
![Page 32: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/32.jpg)
32
k-means
• Iterative partitioning method• Initialisation: k random clusters• Assignment: each point is assigned
to its closest cluster center• Update: cluster centers are
updated with the new members• Iterate through convergence
• In R, function kmeans• Known number of clusters
![Page 33: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/33.jpg)
33
Gaussian mixture estimation
• Well-defined estimation problem– Data is believed to come from a
mixture of k Gaussian distributions– X ~ ωN(µ1, Σ1) ⊕ (1-ω)N(µ2, Σ2)
• EM algorithm– Expectation: Given parameter
estimates, compute classmembership probability
– Maximization: Given classmembership, estimate parametersby maximum likelihood
– Iterate through convergence
• Works well if n >> p– Package mclust
![Page 34: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/34.jpg)
34
Clustering
• Ill-defined problem ⇒ many algorithms exist
• Most important: a relevant dissimilarity measure• Requires cautious interpretation• Still useful tool for data exploration• Prior knowledge (model, dissimilarity measure) should be used, if available
![Page 35: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/35.jpg)
35
Clustering phenotype populations bygenome-wide RNAi and multiparametric imaging
Gregoire Pau, Oleg Sklyar, Wolfgang Huber
EMBL, Heidelberg
Florian Fuchs, Dominique Kranz, Christoph Budjan,
Thomas Horn, Sandra Steinbrink, Angelika Pedal, Michael Boutros
DKFZ, Heidelberg
Molecular Systems Biology, 2010
![Page 36: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/36.jpg)
36
Experimental setup
• Human cervix carcinoma HeLa cells• Genome-wide RNAi screen, testing 22839 genes• Cells are incubated for 48 h and fixed• Staining using DNA (DAPI), Tubulin (Alexa), Actin (TRITC)
• Readout: microscopy images
CD3EAP
![Page 37: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/37.jpg)
37
Phenotypic profile
• Phenotype expressed by a population of cells• Phenotypic profile, vector of p = 13 parameters
– Number of cells– Statistics on cell features (size, eccentricity, …)
– Cell types distribution (normal, metaphase, condensed, protruded…)
CD3EAP
nexteccNextNinta2iNext2AF %BC %C %M %LA %P %
28934.331180.4729342857.356485.27100.8288760.0986470.0495940.0817460.1588170.1793390.0092490.219697
x =
![Page 38: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/38.jpg)
38
Transformation of phenotypic descriptors
• Let x be a phenotypic profile in Rp
• Transformation into a phenoprint, vector of [0,1] scores– For each descriptor k, f(xk) = 1 / (1 + exp( -αk(x - βk))
• Phenotypic distance = L1 distance between phenoprints• 20 parameters (α, β)k to be determined
!
f
![Page 39: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/39.jpg)
39
Distance metric learning• Perturbation of related genes lead more likely to similar phenotypes
than random ones
• EMBL STRING database– About 6,000,000 related protein pairs– Physical interaction, regulation, literature co-citation– Rich but noisy
• We design our distance to be lower in average on related gene pairsthan random ones– Parameters (α, β)k are fitted by minimization of a criterion
– Similar to PAM matrices to compute protein alignment scores
![Page 40: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/40.jpg)
40
Phenotypic hits
• 1820 perturbations show non-null phenoprints
![Page 41: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/41.jpg)
41
Phenotypic map
![Page 42: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/42.jpg)
42
Phenotypic map
![Page 43: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/43.jpg)
43
Elongated phenotype
ELMO2
SH2B2
![Page 44: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/44.jpg)
44
NUF2
CEP164
Mitotic phenotype
![Page 45: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/45.jpg)
45
Giant cells with large nucleus
RRM1
CLSPN
![Page 46: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/46.jpg)
46
Validation
• 22839 siRNA perturbations• 1820 non-null phenoprints• 604 perturbations were retested
– 310 reproduced the phenotypes with an independent siRNA library– Among them, 280 reproduced the phenotypes on U2OS cells
![Page 47: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/47.jpg)
47
Functional inference
• "Giant cells with large nucleus" phenotypic cluster– 50 genes
– RRM1, CLSPN, PRIM2 and SETD8– Mediators of the DNA damage response
• Secondary assays– Cell cycle progression upon depletion
– Protein subcellular localization– Monitoring γH2AX foci formation upon depletion
– Monitoring pChk1 response after gamma irradiation
![Page 48: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/48.jpg)
48
Subcellular localization
• DONSON localizes to the centrosomes• Centrosomes are linked to DNA damage repair
![Page 49: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/49.jpg)
49
Monitoring DDR response
• Depletion of DONSON, SON, CD3EAP and CADM1• Induction of γH2AX foci formation, an early DDR marker
• Inhibition of CHEK1 phosphorylation response upon gamma irradiation
![Page 50: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/50.jpg)
50
Conclusion
• Automated phenotyping method from microscopy images• Prediction of gene function by loss-of-function phenotype similarity• Association of DONSON, SON, CD3EAP and CADM1 to DDR• Data available at http://www.cellmorph.org• Bioconductor/R package: EBImage, imageHTS
![Page 51: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/51.jpg)
51
Classification
![Page 52: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/52.jpg)
52
Cancer prediction
• Known patient gene expression profiles
Acute myeloidleukemia
Healthy
Healthy
?
![Page 53: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/53.jpg)
53
Automatic cell annotation
Training set
Inter Pro Prometa Meta Ana1 Ana2 Telo
![Page 54: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/54.jpg)
54
Prediction of mitotic state
• Based on nucleus size
• None of the two features is a good predictor of mitotic state• Combining them ?
Prophase MetaphaseProphase Metaphase
• Based on nucleus intensity
![Page 55: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/55.jpg)
55
Classification
• Given objects with known labels, predict the label of an unknown object• Well-defined but hard problem• Optimal answers
– Denote Y the outcome and X the data– Bayes formalism: P(Y|X) = P(X|Y) * p(Y) / p(X)– But P(X|Y) is unknown and has to be estimated or modelled– Regression problem: Argminf ||Y - f(X)||2 ⇒ f(x) = E(Y|X=x)
– But E(Y|X=x) has to be estimated
• Algorithms– Linear regression– k-nearest neighbors– Support vector machines
– Kernel methods
• Validation, cross-validation and overfitting
![Page 56: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/56.jpg)
56
Linear classifier
• Denote by X the matrix of features: n samples, p features• Denote by Y the vector of outcomes• Find β that minimize ||Y - Xβ||2
• In R, using lm
nucleus intensity
nucl
eus
size
prometaphase
metaphase
![Page 57: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/57.jpg)
57
k-nearest neighbors
• Each point is assigned to the dominant label among its k-nearest neighbors– f(x) = Avg k in neighb(x)(yk) approximates E(Y|X=x)
• In R, using knn
prometaphase
metaphase
nucleus intensity
nucl
eus
size
![Page 58: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/58.jpg)
58
Which decision boundary ?
Which decision boundary hasthe lowest
prediction error?
High biasLow variance
Low biasHigh variance
high model complexity(needs hundreds of parameter to describe the decision boundary)
low model complexity(needs 2 parameters todescribe the decision boundary)
![Page 59: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/59.jpg)
59
Bias-variance dilemma
![Page 60: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/60.jpg)
60
Cross-validation
• Simple method to estimate the prediction error
• Method– Split the data in K approximately equally sized subsets
– Train the classifier on (K-1) subsets– Test the classifier on the remaining subset. The prediction error is estimated by
comparing the predicted class label with the true class labels.– Repeat the last two steps K times
• Take the classifier that have the lowest prediction error
Training Test
![Page 61: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/61.jpg)
61
• Find the hyperplane that best separates two sets of points• Well-defined minimization problem, tolerant for misclassifications
– Find β that minimize ||w||2 + C|
• In R, using svm from the package e1071
Support vector machine
![Page 62: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/62.jpg)
62
Non-linear case
![Page 63: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/63.jpg)
63
Basis expansion & the kernel trick
• Increase the dimension space if data cannot be linearly separated– Use X2, X3… e. g. || Y - [X;X2;X3]β ||2
– Use splines or model-based separation curves
• Kernel trick– Scalar product xty can be generalized by kernel functions K(x, y)– Kernel functions: K(x, y) = xty ; K(x, y) = exp(-||x - y||/γ)
– SVM ⇒ kernel SVM ; LDA ⇒ kernel LDA ; PCA ⇒ kernel PCA– Complex separation in low-dimension ⇒ linear separation in high-dimension
![Page 64: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/64.jpg)
64
SVM + radial kernel
![Page 65: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/65.jpg)
65
Influence of the kernel parameter
!
" = 0.001
!
" = 0.03
!
" = 0.005
!
" = 0.1
!
" =1
!
" = 2
!
" = 20
!
" = 200
![Page 66: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/66.jpg)
66
Curse of dimensionality
• Low number of parameters– Low complexity
– Low variance– High bias
• High number of parameters– High complexity
– High variance– Low bias– Space is too sparse; estimation is not reliable
• Trade-off must be found by prediction error estimation
![Page 67: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/67.jpg)
67
Conclusion
• Clustering– Ill-defined problem ⇒ many algorithms around
– Most important: a relevant dissimilarity measure– Requires cautious interpretation– Still useful tool for data exploration
• Classification– Well-defined problem– Kernel SVM is a fast and versatile algorithm suitable to many problems
– Most important: prediction error estimation using cross-validation
• Feature selection– Supervised feature selection– Regular penalized methods (e.g. Lasso) are key techniques
![Page 68: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/68.jpg)
68
Going further
The Elements of Statistical LearningHastie, Tibshirani and Friedman
• Statistical learning• Machine learning• Features selection• Classification• Unsupervised clustering• Kernel methods• Neural networks• Boosting
![Page 69: Clustering and classification with applications to ...master.bioconductor.org/help/course-materials/2010/... · Microarray data • Clustering – Are there patient groups with similar](https://reader034.vdocument.in/reader034/viewer/2022051812/602bf6fbc8ccd24caf4f5fef/html5/thumbnails/69.jpg)
69
Acknowledgments
• Thanks to Bernd Fisher, Richard Bourgon, Joerg Rahnenfuehrer !