unsupervised learning with random forest predictors: applied to tissue microarray data
DESCRIPTION
Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data. Steve Horvath Biostatistics and Human Genetics University of California, LA. Contents. Tissue Microarray Data Random forest (RF) predictors Understanding RF clustering - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/1.jpg)
Unsupervised Learning with Random Forest Predictors:
Applied to Tissue Microarray Data
Steve HorvathBiostatistics and Human Genetics
University of California, LA
![Page 2: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/2.jpg)
Contents
• Tissue Microarray Data• Random forest (RF) predictors• Understanding RF clustering
– Shi, T. and Horvath, S. (2006) “Unsupervised learning using random forest predictors” J. Comp. Graph. Stat.
• Applications to Tissue Microarray Data:• Shi et al (2004) “Tumor Profiling of Renal Cell
Carcinoma Tissue Microarray Data” Modern Pathology
• Seligson DB et al (2005) Global histone modification patterns predict risk of prostate cancer recurrence. Nature
![Page 3: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/3.jpg)
Acknowledgements• Former students &
Postdocs for TMA– Tao Shi, PhD– Tuyen Hoang, PhD – Yunda Huang, PhD– Xueli Liu, PhD
UCLA Tissue Microarray
Core– David Seligson, MD– Aarno Palotie, MD – Arie Belldegrun, MD– Robert Figlin, MD– Lee Goodglick, MD– David Chia, MD– Siavash Kurdistani,
MD
![Page 4: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/4.jpg)
Tissue Microarray Data
![Page 5: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/5.jpg)
Tissue Microarray
DNA Microarray
![Page 6: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/6.jpg)
Tissue Array Section
~700 TissueSamples
0.6 mm 0.2mm
![Page 7: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/7.jpg)
Ki-67 Expression in Kidney Cancer
High Grade Low Grade
Message: brown staining related to tumor grade
![Page 8: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/8.jpg)
Multiple measurements per patient:Several spots per tumor sample and several “scores” per spot
• Maximum intensity = Max
• Percent of cells staining = Pos
• Spots have a spot grade: NL,1,2,.
• Each patients (tumor sample) is usually represented by multiple spots
– 3 tumor spots –1 matched normal spot
![Page 9: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/9.jpg)
Properties of TMA Data
• Highly skewed, non-normal, semi-continuous.– Often a good idea to model as
ordinal variables with many levels.
• Staining scores of the same markers are highly correlated
![Page 10: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/10.jpg)
0 20 40 60 80 100
05
01
00
15
02
00
25
0
0 20 40 60 80 100
05
01
00
15
02
00
25
0
0 20 40 60 80 100
05
01
00
15
02
00
0 0.5 1 1.5 2 2.5 3
05
01
00
15
0
0 0.5 1 1.5 2 2.5 3
05
01
00
15
02
00
0 0.5 1 1.5 2 2.5 3
05
01
00
15
0
P53 CA9 EpCamPercent of Cells Staining(POS)
Maximum Intensity (MAX)
Histogram of tumor marker expression scores: POS and MAX
![Page 11: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/11.jpg)
Thresholding methods for tumor marker expressions
• Since clinicians and pathologists prefer thresholding tumor marker expressions, it is natural to use statistical methods that are based on thresholding covariates, e.g. regression trees, survival trees, rpart, forest predictors etc.
• Dichotomized marker expressions are often fitted in a Cox (or alternative) regression model– Danger: Over-fitting due to optimal cut-off selection.– Several thresholding methods and ways for adjusting
for multiple comparisons are reviewed in
• Liu X, Minin V, Huang Y, Seligson DB, Horvath S (2004) Statistical Methods for Analyzing Tissue Microarray Data. J of Biopharmaceutical Statistics. Vol 14(3) 671-685
![Page 12: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/12.jpg)
Tumor class discoveryKeywords: unsupervised learning, clustering
![Page 13: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/13.jpg)
Tumor Class Discovery
• Molecular tumor classes=clusters of patients with similar gene expression profiles
• Main road for tumor class discovery– DNA microarrays– Proteomics etc– unsupervised learning: clustering, multi-dimensional
scaling plots
• Tissue microarrays have been used for tumor marker validation– supervised learning, Cox regression etc
• Challenge: show that tissue microarray data can be used in unsupervised learning to find tumor classes– road less travelled
![Page 14: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/14.jpg)
Tumor Class Discovery using DNA Microarray
Data• Tumor class
discovery entails using a unsupervised learning algorithm (e.g hierarchical, k-means, clustering etc.) to automatically group tumor samples based on their gene expression pattern.
Bullinger et al. N Engl J Med. 2004
![Page 15: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/15.jpg)
Clusters involving TMA data may have unconventional shapes:
Low risk prostate cancer patients are colored in black.
0 20 40 60 80 100
020
4060
8010
0
H3K18AC
H3K
4diM
e
•Scatter plot involving 2 `dependent’ tumor markers. The remaining, less dependent markers are not shown.•Low risk cluster can be described using the following ruleMarker H3K4 > 45% and H3K18 > 70%.•The intuition is quite different from that of Euclidean distance based clusters.
![Page 16: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/16.jpg)
Unconventional shape of a clinically meaningful patient cluster
• 3 dimensional scatter plot along tumor markers
• Low risk patients are colored in black
MARKER 1
MARKER 2
![Page 17: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/17.jpg)
How to cluster patients on the basis of Tissue Microarray
Data?
![Page 18: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/18.jpg)
A dissimilarity measure is an essential input for tumor class
discovery
• Dissimilarities between tumor samples are used in clustering and other unsupervised learning techniques
• Commonly used dissimilarity measures include Euclidean distance, 1 - correlation
![Page 19: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/19.jpg)
Challenge• Conventional dissimilarity measures
that work for DNA microarray data may not be optimal for TMA data.– Dissimilarity measure that are based on
the intuition of multivariate normal distributions (clusters have elliptical shapes) may not be optimal
– For tumor marker data, one may want to use a different intuition: clusters are described using thresholding rules involving dependent markers.
– It may be desirable to have a dissimilarity that is invariant under monotonic transformations of the tumor marker expressions.
![Page 20: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/20.jpg)
We have found that a random forest (Breiman 2001) dissimilarity can work well in the unsupervised analysis of TMA data.Shi et al 2004, Seligson et al 2005.http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm
![Page 21: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/21.jpg)
Kidney cancer:Comparing PAM clusters that result from using the RF dissimilarity vs the Euclidean distance
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al
RF cluster 1, Euclid cluster 1RF cluster 1, Euclid cluster 2RF cluster 2, Euclid cluster 1RF cluster 2, Euclid cluster 2
Message: In this application,RF clusters are more meaningfulregardingsurvival time
Kaplan Meier plots for groups defined by cross tabulating patients according to their RF and Euclidean distance cluster memberships.
![Page 22: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/22.jpg)
The RF dissimilarity is determined by dependent tumor markers
• The RF dissimilarity focuses on the most dependent markers (1,2).
• In some applications, it is good to focus on markers that are dependent since they may constitute a disease pathway.
• The Euclidean distance focuses on the most varying marker (4)
Marker 1 2 3 4 5 6 7 8
0 10050
Patientssorted by cluster
Tumor markers
![Page 23: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/23.jpg)
The RF cluster can be described using a thresholding rule involving
the most dependent markers• Low risk patient if
marker1>cut1 & marker2> cut2
• This kind of thresholding rule can be used to make predictions on independent data sets.
• Validation on independent data set
0 20 40 60 80 100
02
04
06
08
01
00
Marker1
Mar
ker2
![Page 24: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/24.jpg)
Random Forest PredictorsBreiman L. Random forests. Machine Learning
2001;45(1):5-32http://stat-www.berkeley.edu/users/breiman/RandomFo
rests/
![Page 25: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/25.jpg)
Tree predictors are the basic unit of random forest
predictorsClassification andRegression Trees (CART)by
– Leo Breiman– Jerry Friedman– Charles J. Stone– Richard Olshen
• RPART library in R softwareTherneau TM, et al.
![Page 26: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/26.jpg)
An example of CART
• Goal: For the patients admitted into ER, to predict who is at higher risk of heart attack
• Training data set:– No. of subjects = 215– Outcome variable = High/Low Risk
determined– 19 noninvasive clinical and lab
variables were used as the predictors
![Page 27: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/27.jpg)
High 12%Low 88%
High 17%Low 83%
Is BP > 91?
High 70%Low 30%
High 11%Low 89%
High 50%Low 50%
High 2%Low 98%
High 23%Low 77%
Is age <= 62.5?Classified as high risk!
Classified as low risk!
Classified as high risk! Classified as low risk!
Is ST present?
CART Constructi
onYes No
No
No
Yes
Yes
![Page 28: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/28.jpg)
CART Construction
• Binary-- split parent node into two child nodes
• Recursive -- each child node can be treated as
parent node
• Partitioning-- data set is partitioned into mutually
exclusive subsets in each split
![Page 29: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/29.jpg)
RF Construction
…
![Page 30: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/30.jpg)
Random Forest (RF)
• An RF is a collection of tree predictors such that each tree depends on the values of an independently sampled random vector.
![Page 31: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/31.jpg)
Prediction by plurality voting
• The forest consists of N trees
• Class prediction: – Each tree votes for a class; the
predicted class C for an observation is the plurality, maxC k [fk(x,T) = C]
![Page 32: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/32.jpg)
Random forest predictors give rise to a dissimilarity measure
![Page 33: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/33.jpg)
Intrinsic Similarity Measure
• Terminal tree nodes contain few observations
• If case i and case j both land in the same terminal node, increase the similarity between i and j by 1.
• At the end of the run divide by 2 x no. of trees.
• Dissimilarity = sqrt(1-Similarity)
![Page 34: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/34.jpg)
High 12%Low 88%
High 17%Low 83%
Is BP > 91?
High 70%Low 30%
High 11%Low 89%
High 50%Low 50%
High 2%Low 98%
High 23%Low 77%
Is age <= 62.5?
Is ST present?
Age BP …Patient 1: 50 85 …Patient 2: 45 80 …Patient 3: … … …
Yes No
No
No
Yes
Yes• patients 1 and 2 end up in the same terminal node
• the proximity between them is increased by 1
![Page 35: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/35.jpg)
Unsupervised problem as a Supervised problem (RF implementation)
• Key Idea (Breiman 2003)– Label observed data as class 1– Generate synthetic observations and
label them as class 2– Construct a RF predictor to distinguish
class 1 from class 2– Use the resulting dissimilarity measure
in unsupervised analysis
![Page 36: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/36.jpg)
Two standard ways of generating synthetic
covariatesindependent sampling from each of the univariate distributions of the variables (Addcl1 =independent marginals).
independent sampling from uniforms such that each uniform has range equal to the range of the corresponding variable (Addcl2).
The scatter plot of original (black) and synthetic (red) data based on Addcl2 sampling.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x1
x2
![Page 37: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/37.jpg)
RF clustering
• Compute distance matrix from RF– distance matrix = sqrt(1-similarity
matrix)
• Conduct partitioning around medoid (PAM) clustering analysis
– input parameter = no. of clusters k
![Page 38: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/38.jpg)
Understanding RF Clustering
(Theoretical Studies)Shi, T. and Horvath, S. (2005) “Unsupervised learning using random forest predictors” J.
Comp. Graph. Stat
![Page 39: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/39.jpg)
Abstract:Random forest dissimilarity• Intrinsic variable selection focuses on dependent
variables– Depending on the application, this can be attractive
• Resulting clusters can often be described using thresholding rulesattractive for TMA data.
• RF dissimilarity invariant to monotonic transformations of variables
• In some cases, the RF dissimilarity can be approximated using a Euclidean distance of ranked and scaled features.
• RF clustering was originally suggested by L. Breiman (RF manual). Theoretical properties are studied as part of the dissertation work of Tao Shi. Technical report and R code can be found at www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm www.genetics.ucla.edu/labs/horvath/kidneypaper/RCC.htm
![Page 40: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/40.jpg)
Geometric interpretation of RF clusters
RF cuts along the feature axes that isolate synthetic
from observed observations will lead to clusters.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
var1
var2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
var1
var2
1
2
3
3
22
1
1
2
4
43
1
1
1
4
3
4
3
2
3
3
22
2
3
3
2
4
2
3
3
4
22
1
3
4
2
1
4
2
1
4
1
4
1
2
4
33
1
3
3
4 4
2
3
2
4
33
3
1
3
1
3
1
2
4
1
4
4
4
4
2
1
4
3
2
4
4
1
2
1
44
1
2
4
2
3
1
343
4
4
2
1
5555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
Highly unusual synthetic data lead to 4 clusters
Original data: no cluster structure according to Euclidean distance
![Page 41: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/41.jpg)
Geometric interpretation of RF clusters
RF cuts along the feature axes that isolate synthetic
from observed observations will lead to clusters.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
var1
va
r2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
var1
va
r2
1
2
3
3
22
1
1
2
4
43
1
1
1
4
3
4
3
2
3
3
22
2
3
3
2
4
2
3
3
4
22
1
3
4
2
1
4
2
1
4
1
4
1
2
4
33
1
3
3
4 4
2
3
2
4
33
3
1
3
1
3
1
2
4
1
4
4
4
4
2
1
4
3
2
4
4
1
2
1
44
1
2
4
2
3
1
3 43
4
4
2
1
55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555
![Page 42: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/42.jpg)
RF clustering is not rotationally invariant
Cuts along the axes succeed at separating observed data from Synthetic data.
Cuts along the axes do notseparate observed from synthetic(turquoise) data.
![Page 43: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/43.jpg)
Simulated Example ExRule: contrast RF dissimilarity with Euclidean distance
![Page 44: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/44.jpg)
Scatter plot of 2 signal variables
Cluster can be described by threshold rules.150 observations in each cluster.
Simulated Cluster structure
histogram of other noise variables
X4noiseF
req
ue
ncy
0.0 0.2 0.4 0.6 0.8 1.0
02
04
06
0
Histogram of noise variables
X3noise
X4-X100.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
X1
X2
Histogram of X3
X3
Fre
qu
en
cy
0.0 0.2 0.4 0.6 0.8 1.0
05
01
00
15
0
![Page 45: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/45.jpg)
Example ExRule
-0.1 0.0 0.1
-0.1
50.
000.
100.
20
Addcl1, cMDS
2 4 6 8 10
1020
3040
Addcl1, Variable Imp.
variables
Gin
i Ind
ex M
easu
re
-0.6 -0.2 0.2 0.6
-0.5
0.0
0.5
Euclidean, cMDS
2 4 6 8 10
0.10
0.15
0.20
0.25
Variable Variance
variables
Var
ianc
e
Black if X1>0.8 & X3=0Red if X1>0.8 & X3=1Green if X1<0.8 & X3=0Blue if X1<0.8 & X3=1
Message: RF clusters correspond to variable X1while Euclidean clusters correspond to X3.
![Page 46: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/46.jpg)
The clustering results for example ExRule
• Addcl1 dissimilarity focuses on most dependent variables clusters are determined by cuts along variables X1 and X2.
• Resulting clusters can be described using a simple thresholding rule.
• Euclidean distance focuses on most varying variable X3 PAM clusters and MDS point clouds are driven by X3.
![Page 47: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/47.jpg)
Typical Addcl2 ExampleFew independent covariates contains
cluster info (binary signal), rest are noise
Example:One binary variableRest random uniform
var1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
1.0
1.2
1.4
1.6
1.8
2.0
0.0
0.2
0.4
0.6
0.8
1.0
var2
var3
0.0
0.2
0.4
0.6
0.8
1.0
1.0 1.2 1.4 1.6 1.8 2.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
var4
Pairwise scatter plot
![Page 48: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/48.jpg)
Nature of Addcl2 RF clustering
Addcl1 completely fails.
Addcl2 clustering works well
![Page 49: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/49.jpg)
RF dissimilarity vs. Euclidean distance (DNA Microarray Data)
Euclidean Distance
RF
Dis
tanc
e
20 40 50 60 70 80 90
0.75
0.80
0.85
0.90
EuclidDist (Standardized Ranks)20 40 50 60
0.75
0.80
0.85
0.90
![Page 50: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/50.jpg)
Theoretical reasons for using an RF dissimilarity
for TMA data• Main reasons
– natural way of weighing tumor marker contributions to the dissimilarity
• The more related a tumor marker is to other tumor markers the more it contributes to the definition of the dissimilarity
– no need to transform the often highly skewed features• based feature ranks
– Chooses cut-off values automatically– resulting clusters can often be described using simple
thresholding rules• Other reasons
– elegant way to deal with missing covariates– intrinsic proximity matrix handles mixed variable types well
• CAVEAT: The choice of the dissimilarity should be determined by the kind of patterns one hopes to find. There will be situations when other dissimilarities are preferrable.
![Page 51: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/51.jpg)
Applications to prostatetissue microarray data
Seligson DB, Horvath S, Shi T, Yu H, Tze S, Grunstein M, Kurdistani SK (2005) Global histone modification patterns
predict risk of prostate recurrence. Nature
![Page 52: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/52.jpg)
![Page 53: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/53.jpg)
Analysis Outline
• Used RF clustering to find distinct patient clusters without regard to outcome
• Relating the clusters to clinical information showed that patient clusters have distinct PSA recurrence profiles
• Constructed a rule for predicting cluster membership
• Applied this rule to an independent validation data set to show that the rule predicts PSA recurrence
![Page 54: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/54.jpg)
Cluster Analysis of Low Gleason Score Prostate Samples
(UCLA data)
![Page 55: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/55.jpg)
1) Construct a tumor marker rule for predicting RF cluster membership.
2) Validate the rule predictions in an independent data set
Threshold Rule Validation
![Page 56: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/56.jpg)
Discussion Prostate TMA Data
• Very weak evidence that individual markers predict PSA recurrence
• None of the markers validated individually
• However, cluster membership was highly predictive, i.e the rule could be validated in an independent data set.
![Page 57: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/57.jpg)
Summary
• We have been motivated by the special features of TMA data and explored the use of RF dissimilarity in clustering analysis.
• We have carried out theoretical studies to gain more insights into RF clustering.
• We have applied RF clustering to different types of genomic data such as TMA, DNA microarray, genomic sequence (Allen et al. 2003) and SAGE (unpublished) data.
![Page 58: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/58.jpg)
Acknowledgements• Former students &
Postdocs for TMA– Tao Shi, PhD– Tuyen Hoang, PhD – Yunda Huang, PhD– Xueli Liu, PhD
• Special Consultant– Panda Bamboo, PhD
UCLA Tissue Microarray Core– David Seligson, MD– Aarno Palotie, MD – Arie Belldegrun, MD– Robert Figlin, MD– Lee Goodglick, MD– David Chia, MD– Siavash Kurdistani, MD– ETC
![Page 59: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/59.jpg)
References RF clustering• Unsupervised learning tasks in TMA data analysis
– Review random forest predictors (introduced by L. Breiman)
• Shi, T. and Horvath, S. (2005) “Unsupervised learning using random forest predictors” Journal of Computational and Graphical Statistics
– www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm
• Application to Tissue Array Data – Shi, T., Seligson, D., Belldegrun, A. S., Palotie, A., Horvath, S.
(2004) Tumor Profiling of Renal Cell Carcinoma Tissue Microarray Data. Modern Pathology
– Seligson DB, Horvath S, Shi T, Yu H, Tze S, Grunstein M, Kurdistani S (2005) Global histone modification patterns predict risk of prostate cancer recurrence. Nature
![Page 60: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/60.jpg)
Applications to renal cell carcinoma tissue microarray data
Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S (2005) Tumor Classification by Tissue Microarray Profiling: Random Forest Clustering Applied
to Renal Cell Carcinoma. Mod Pathol. 2005 Apr;18(4):547-57.
![Page 61: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/61.jpg)
TMA Data
• 366 patients with Renal Cell Carcinoma (RCC) admitted to UCLA between 1989 and 2000.
• Immuno-histological measures of 8 tumor markers were obtained from tissue microarrays constructed from the tumor samples of these patients.
![Page 62: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/62.jpg)
MDS Plot of All the RCC Patients
• Colored by their RF cluster and labeled by tumor subtypes.
![Page 63: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/63.jpg)
Interpreting the clusters in terms of survival
Clustering
label
Non clear
Cellpatients
Clear cell
patients
1 20 307
2 30 9
![Page 64: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/64.jpg)
Hierarchical clustering with Euclidean distance leads to less
satisfactory results
11 1
11 1
11
1 11
1 11 1 1
01 1
1 11
1 11
1 11
1 11
1 11
11 1
11
1 11
1 11 1
11 1
11
1 11 1
11
1 11
1 11 1
1 11 1
11 1 1
11 1
1 1 1 11
11 1
11 1 11 1
11 1 1 1
11 1
1 11
1 11 1
1 11 1
11 1
11 1 1
11 1
1 11 1 1 1 1
1 1 1 11
1 1 11
1 1 1 1 1 11 1
11 1 1 1 11 1 1
1 1 11
1 1 11 1
11
1 11
1 1 1 1 1 1 1 11 1
1 11
11
1 11 1
1 11 1
1 1 1 11
1 11
1 11
1 11
1 1 1 11 1
11 1
11 1
1 1 1 1 1 0 11 0
11
1 11
11
1 11 1
11 1
11 1
1 11
11 1 1 1
1 11
1 11 1
1 11 1
11
0 1 11 1
11
1 11
11 1
01 1
11
0 11
11 1
1 01
1 10
01
1 11 1
01
00 0
11 1
11 1
10 0
0 00 0
1 11 0 0
0 00 0
11
1 01
00 1
10 0
0 10
10 1
1 10
00 0 0 0
0 00
0 00 0
11
0 10
0 01
1 1
05
01
00
15
0
Cluster Dendrogram
hclust (*, "average")dist(KidneyRF)
He
igh
t
Cluster-
ing labe
l
NonclearCell
patients
Clearcell
patients
1 9 (20)
286 (307)
2 41(30)
30 (9)
* RF clustering grouping in red
![Page 65: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/65.jpg)
Molecular grouping is superior to pathological
grouping
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al
327 patients in cluster 139 patients in cluster 2
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al
316 clear cell patients50 non-clear cell patients
p = 0.0229p = 9.03e-05
Molecular Grouping Pathological Grouping
![Page 66: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/66.jpg)
Identify “irregular” patients
Clustering label
Non clearCell
patients
Clear cell
patients
1 20 307
2 30 9
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Time to death (years)
Su
rviv
al
p = 0.00522
9 irregular clear cell patients307 regular clear cell patients
50 non-clear cell patients
![Page 67: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/67.jpg)
`Regular’ Clear Cell Patients
![Page 68: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/68.jpg)
`Regular’ Clear Cell Patients (cont.)
![Page 69: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/69.jpg)
Detect novel cancer subtypes
• Group clear cell grade 2 patients into two clusters with significantly different survival.
![Page 70: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/70.jpg)
Results TMA clustering
• Clusters reproduce well known clinical subgroups– Example: global expression differences
between clear cell and non-clear cell patients• RF clustering allows one to identify
“outlying” tumor samples. • Can detect previously unknown sub-groups• Empirical evidence suggests that RF
clustering is better than standard clustering in this setting (prostate data, unpublished)
![Page 71: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/71.jpg)
Acknowledgements• Former students &
Postdocs for TMA– Tao Shi, PhD– Tuyen Hoang, PhD – Yunda Huang, PhD– Xueli Liu, PhD
UCLA Tissue Microarray
Core– David Seligson, MD– Aarno Palotie, MD – Arie Belldegrun, MD– Robert Figlin, MD– Lee Goodglick, MD– David Chia, MD– Siavash Kurdistani,
MD
![Page 72: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/72.jpg)
THE END
![Page 73: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/73.jpg)
Appendix
![Page 74: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/74.jpg)
Casting an unsupervised problem into a supervised
problem
![Page 75: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/75.jpg)
Detect novel cancer subtypes
• Group clear cell grade 2 patients into two clusters with significantly different survival.
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
K-M curves
Time to death (years)
Su
rviv
al
p value= 0.0125
![Page 76: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/76.jpg)
0.50 0.55 0.60 0.65 0.70 0.75 0.80
01
22
45
67
R2 = 0.299p= 2.95e-12
Average correlation
-lo
g10
(Co
x.p
.val
ue)
![Page 77: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/77.jpg)
RF variable importance vs. Average Corr and Cox p
value
Message: The more correlated a gene isWith other genes the moreImportant it is for the Def
The more important a gene is according to RF, the more important it is for survival prediction
![Page 78: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/78.jpg)
Which multi-dimensional scaling method to use?
cmdscale usually works well with Addcl1 but not with Addcl2 because it may lead to spurious clusters. However isoMDS works well with Addcl2!
-0.2 -0.1 0.0 0.1 0.2
-0.2
-0.1
0.0
0.1
0.2
111
1
1
1
1
1
11 1
1
1
1
1
1
1
11
1 1
1
1
1
1
1
11
1
1
11
1
1
1 1
1
11
1
1
11
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
11
1
111
1
11
1
1
1
1
1
1
1
1
1
1
1 1
11
1
1
1
1
1
1
-0.2 -0.2 -0.1 0.0 0.1 0.2 0.2
-0.2
-0.1
0.1
0.2
0.2
1
1
1
1
1
1
1
1
1
1 11
1
1
1
1
1
11
11
1
1
1
1111
1
1
1 1
1
1
1 1
1
11
1
1
11
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
-0.1 0.0 0.1 0.2 0.2
-0.2
-0.1
0.0
0.1
0.2
1
1
2
3
1
2
111
2
1
2
1
1
3
2
1 3
31
1
2
1
1
1
2
1 11
3
3
3
1
1
2 3
11
3
3
2
3
1
2
1
11
1
1
2
1
2
2
3111
1
2
23
2
1
32
2
3
2
1
2
1
3
1
2
11
1
2 2 3
1
1
1
2
1
2
11
2
1
22
2
3
2
1
3
2
2
2
-4 -2 0 2 4
-4-2
02
4
1
1
1
1 1
1
11
1
1
1
1
1
11
1
1
11
11
1
1
1
1
1
1 11
1
1
1 1
1
1
1
11
11
1
11
1
11
1
1 1
1
1
1
1
1
111
1
1
1
1
11
1 11
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
11
1
1
1
1
1
1
1 1
Ad
dcl1
Ad
dcl2
cmdscale isoMDS
![Page 79: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/79.jpg)
The random forest dissimilarityL. Breiman: RF manual
Technical Report: Shi and Horvath 2005http://www.genetics.ucla.edu/labs/horvath/RFclustering/RFclustering.htm
![Page 80: Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data](https://reader035.vdocument.in/reader035/viewer/2022062217/5681503f550346895dbe3e27/html5/thumbnails/80.jpg)
Frequency plot of the same tumor marker in 2 independent data sets
datsteve$H3K18Ac.Pos
Freq
uenc
y
0 20 40 60 80 100
02
46
8
Histogram of dat2$H3K18ACPos.mean
Marker Expression
Fre
qu
en
cy
0 20 40 60 80 100
05
10
15
20
25
30
35
The cut-off corresponds roughly to the 66% percentile.Thresholding this tumor marker allows one to stratify the cancer patients into high risk and low risk patients. Although the distribution looks very differentthe percentile threshold can be validated and is clinically relevant.
DATA SET 1 Validation Data Set 2