molecular classification of cancer
DESCRIPTION
Molecular Classification of Cancer. Class Discovery and Class Prediction by Gene Expression Monitoring. Overview. Motivation Microarray Background Our Test Case Class Prediction Class Discovery. Motivation. Importance of cancer classification - PowerPoint PPT PresentationTRANSCRIPT
Molecular Classification
of Cancer
Class Discovery and Class Prediction by Gene Expression
Monitoring
Overview Motivation Microarray Background Our Test Case Class Prediction Class Discovery
Motivation Importance of cancer classification Cancer classification has historically
relied on specific biological insights We will discuss a systematic and
unbiased approach for recognizing tumor subtypes
Microarray Background Microarrays enable simultaneous
measurement of the expression levels of thousands of genes in a sample
Microarray:– Glass slide with a matrix of thousands of
spots printed on to it– Each spot contains probes which bind to a
specific gene
Microarray Background (cont.) The process:
– DNA samples are taken from the test subjects
– Samples are dyed with fluorescent colors and placed on the Microarray
– Hybridization of DNA and cDNA
The result:– Spots in the array
are dyed in shades of red to green
Microarray Background (cont.)
Microarray data is translated into an n x p table(p – number of genes, n – number of samples)
0.091.85Gene 4
1.053.34Gene 3
10.53.2Gene 2
2.081.04Gene 1
Sample 2Sample 1
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
Demonstration
Our Test Case 38 bone marrow samples from acute
leukemia patients (27 ALL, 11 AML) RNA from the samples was hybridized
to microarrays containing probes for 6817 human genes
For each gene, an expression level was obtained
Class Prediction Initial collection of samples belonging to
known classes Goal: create a “class predictor” to
classify new samples– Look for “informative genes”– Make a prediction based on these genes– Test the validity of the predictor
Informative genes
Genes whose expression pattern is strongly correlated with the class distinction
strongly correlated
poorly correlated
Neighborhood Analysis
Are the observed correlations stronger than would be expected by chance?
C* is a random permutation of C.Represents a random class distinction
C represents the AML/ALL class distinction
Application to the Test Case
Roughly 1100 genes were more highly correlated with the AML-ALL class distinction than would be expected by chance
Make a Prediction
Use a fixed subset of “informative genes” (most correlated with the class distinction)
Make a prediction on the basis of the expression level of these genes in a new sample
Prediction Algorithm
Each gene Gi votes, depending on whether its expression level Xi in the sample is closer to µAML or µALL
The magnitude of the vote is Wi Vi
– Wi reflects how well the gene is correlated with the class distinction
–
reflects the deviation of Xi from the average of µAML and µALL
2AML ALL
i iV X
Prediction Algorithm (cont.) The votes for each class are summed to
obtain total votes VAML and VALL
Prediction Algorithm (cont.)
The prediction strength is calculated:
The sample is assigned to the winning class provided that the PS exceeds a predetermined threshold(0.3 in the test case)
win lose
win lose
V VPSV V
Testing the Validity of Class Predictors Cross Validation
– withhold a sample– build a predictor based on the remaining
samples– predict the class of the withheld sample– repeat for each sample
Assess accuracy on an independent set of samples
Application to the Test Case
50 genes most highly correlated with the AML-ALL distinction were chosen
A class predictor based on these genes was built
Application to the Test Case
Performance in cross validation:– Out of 38 samples there were 36
predictions and 2 uncertainties (PS < 0.3)– 100% accuracy– PS median 0.77
Application to the Test Case (cont.) Performance on an independent set of
samples:– Out of 34 samples there were 29
predictions and 5 uncertainties (PS < 0.3)– 100% accuracy– PS median 0.73
Genes useful for cancer class prediction may also provide insight into cancer pathogenesis and pharmacology
Comments
Why 50 genes?– Large enough to be robust against noise– Small enough to be readily applied in a clinical
setting– Predictors based on between 10 to 200 genes all
performed well
Comments (cont.)
Creation of a new predictor involves expression analysis of thousands of genes
Application of the predictor then requires only monitoring the expression level of few informative genes
Class Discovery Cluster tumors by gene expression
– Apply a clustering technique to produce presumed classes
Evaluation of the Classes: – Are the classes meaningful?– Do they reflect true structure?
Clustering Technique - SOMs
SOMs – Self Organizing MapsWell suited for identifying a small number of prominent classes– Find an optimal set of “centroids”– Partition the data set according to the centroids– Each centroid defines a cluster consisting of the
data points nearest to it We won't go into details about the
calculation of SOMs
Application of a two-cluster SOM to the test case
Class A1:24 ALL, 1 AML
Class A2:10 AML, 3 AML
Quite effective at automatically discovering the two types of leukemia
Not perfect
Evaluation of the Classes
How can we evaluate such classes if the “right” answer is not already known?
Hypothesis: class discovery can be tested by class prediction– If the classes reflect true structure, then a
class predictor based on them should perform well
Let’s test this hypothesis...
Validity of Predictors Based on A1 and A2 Predictors based on different numbers
of informative genes performed well For example: a 20-gene predictor
Validity of Predictors Based on A1 and A2 cont. Performance on
independent samples:– PS median 0.61– Prediction made for
74% of samples
Validity of Predictors Based on A1 and A2 cont. Performance in
cross validation:– 34 accurate
predictions with high prediction strength
– One error– Three uncertains
the one cross validation error
2 of the 3 cross validation
uncertains
Iterative Procedure
Use a SOM to initially cluster the data Construct a predictor Remove samples that are not correctly
predicted in cross-validation Use the remaining samples to generate
an improved predictor Test on an independent data set
Performance:– Poor accuracy in
cross validation– Low PS on
independent samples
Validity of Predictors Based on Random Clusters
Conclusion
The AML-ALL distinction could have been automatically discovered and confirmed without previous biological knowledge
Application of a 4-cluster SOM to the Test Case
Evaluation of the Classes Complement approach:
– Construct class predictors to distinguish each class from its complement
Pair-wise approach:– Construct class predictors to distinguish
between each pair of classes Ci,Cj
– Perform cross validation only on samples in Ci and Cj
Evaluation of the Classes Class predictors distinguished the
classes from one another, with the exception of B3 versus B4
Conclusion
The results suggest the merging of classes B3 and B4
The distinction corresponding to AML, B-ALL and T-ALL was confirmed
Uses of Class Discovery
Identify fundamental subtypes of any cancer
Search for fundamental mechanisms that cut across distinct types of cancers
Questions?
Thank you for listening