instance-based classification
DESCRIPTION
Instance-based Classification. Examine the training samples each time a new query instance is given. The relationship between the new query instance and training examples will be checked to assign a class label to the query instance. KNN: k-Nearest Neighbor. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/1.jpg)
Instance-based Classification
• Examine the training samples each time a new query instance is given.
• The relationship between the new query instance and training examples will be checked to assign a class label to the query instance.
![Page 2: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/2.jpg)
KNN: k-Nearest Neighbor
• A test sample x can be best predicted by determining the most common class label among k training samples to which x is most similar.
• Xj—jth training sample, yj—the class label for xj, Nx—the set of k nearest neighbors of x in training set. Estimate the probability x belongs to ith class:
![Page 3: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/3.jpg)
KNN: k-Nearest Neighbor, con’t
• Proportion of K nearest neighbors that belong to ith class:
• The ith class which maximizes the proportion above will be assigned as the label of x.
• Variants of KNN: filtering out irrelevant genes before applying KNN.
K
Nxiyxip xjj |}|{|)|(
^
![Page 4: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/4.jpg)
Molecular Classification of Cancer
Class Discovery and Class Prediction by Gene Expression Monitoring
![Page 5: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/5.jpg)
"Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov, Coller, Lo, Downing, Caligiuri, Bloomfield, Lander
Appears in Science Volume 286, October 15, 1999
Whitehead Institute/MIT Center for Genome Researchhttp://www-genome.wi.mit.edu/cancer
...and Dana-Farber (Boston), St. Jude (Memphis), Ohio State
...additional publications by same group shows similar technique applied to different disease modalities.
Publication InfoPublication Info
![Page 6: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/6.jpg)
Cancer ClassificationCancer Classification
Class Discovery: defining previously unrecognized tumor subtypes
Class Prediction: assignment of particular tumor samples to already-defined classes
Given bone marrow samples:Which cancer classes are present among sample?How many cancer classes? 2, 4?Given samples are from leukemia patients, what
type of leukemia is each sample (AML vs ALL)?
![Page 7: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/7.jpg)
Cancer of bone marrowMyelogenous or lymphocytic, acute or chronicAcute Myelogenous Leukemia (AML) vs Acute Lymphocytic Leukemia (ALL)
Marrow cannot produce appropriate amount of red and white blood cells
Anemia -> weakness, minor infections; Platlet deficiency -> easy bruising
AML: 10,000 new adult cases per yearALL: 3,500/2,400 new adult/child cases per yearAML vs. ALL in adults & children
Leukemia: Definitions & SymptomsLeukemia: Definitions & Symptoms
![Page 8: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/8.jpg)
Leukemia: Treatment & expected Leukemia: Treatment & expected outcomeoutcome
Diagnosis via highly specialized laboratoryALL: 58% survival rateAML: 14% survival rateTreatment: chemotherapy, bone marrow transplant
ALL: corticosteroids, vincristine, methotrexate, L-asparaginase
AML: daunorubicin, cytarabineCorrect diagnosis very important for treatment options and expected outcome!!!
Microarray could provide systematic diagnosis optionBUT ONLY ONE TYPE OF DIAGNOSTIC TOOL!!!
![Page 9: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/9.jpg)
38 bone marrow samples (27 AML, 11 AML)
6817 human gene probes
Leukemia: Data setLeukemia: Data set
![Page 10: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/10.jpg)
Cancer Class Prediction
• Learning Task– Given: Expression profiles of leukemia patients
– Compute: A model distinguishing disease classes (e.g., AML vs. ALL patients) from expression data.
• Classification Task– Given: Expression profile of a new patient + A
learned model (e.g., one computed in a learning task)
– Determine: The disease class of the patient (e.g., whether the patient has AML or ALL)
![Page 11: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/11.jpg)
Cancer Class Prediction
• n genes measured in m patients
g1,1 g1,n à class1
g2,1 g2,n à class2
gm,1 gm,n à classm
Vector for a
patient
![Page 12: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/12.jpg)
Cancer Class Prediction Approach
• Rank genes by their correlation with class variable (AML/ALL)
• Select subset of “informative” genes
• Have these genes do a weighted vote to classify a previously unclassified patient.
• Test validity of predictors.
![Page 13: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/13.jpg)
Ranking Genes
• Rank genes by how predictive they are (individually) of the class…
g1,1 g1,n à class1
g2,1 g2,n à class2
gm,1 gm,n à classm
![Page 14: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/14.jpg)
Ranking Genes• Split the expression values for a given
gene g into two pools – one for each class (AML vs. ALL)
• Determine their mean and standard deviation sigma of each pool
• Rank genes by correlation metric (separation)
P(g, class) = (ALL - AML)/(ALL + AML)
The mean difference between the classes
relative to the SD within the classes.
![Page 15: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/15.jpg)
Neighborhood AnalysisNeighborhood AnalysisEach gene g: V(g) = (e1, e2, …, en), ei: expression level of gene g in ith sample.Idealized pattern: c = (c1, c2, …, cn), ci: 1 or 0 (sample I belongs to class 1 or 2.C* idealized random pattern. Counting the number of genes having various levels of correlation with C, compared with the corresponding distribution obtained for random pattern C*.
![Page 16: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/16.jpg)
Selecting Informative Genes• Select the kALL top ranked genes (highly
expressed in ALL) and the kAML bottom ranked genes (highly expressed in AML)
P(g, class) = (ALL - AML)/(ALL + AML)
In Golub’s paper, 25 most positively correlated and 25 most negatively correlated genes are selected.
![Page 17: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/17.jpg)
Determine significant genesDetermine significant genes
1% significance level means 1% of random neighborhoods contain as many points as observed neighborhood.
P(g,c)>0.30 is 709 genes (intersects 1%)
Median is ~150 genes (if totally random)
![Page 18: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/18.jpg)
Weighted Voting
• Given a new patient to classify, each of the selected genes casts a weighted vote for only one class.
• The class that gets the most vote is the prediction.
![Page 19: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/19.jpg)
Weighted Voting
• Suppose that x is the expression level measured for gene g in the patient
V = P(g,class) X |x – [ALL + AML]/2|
Weight for gene g – weighting factor
reflecting how well the gene is
correlated with the class distinction
Distance from the measurement to the
class boundary -- reflecting the deviation of the expression level in the sample from the average
of AML and ALL
![Page 20: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/20.jpg)
PredictionPrediction
Weighted vote:VAML=viwi|vi is vote for AML where vi=|xi-(AML+ALL)/2|
![Page 21: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/21.jpg)
Prediction Strength
• Can assess the “strength” of a prediction as follows:
PS = (Vwinner – Vloser)/(Vwinner+ Vloser)
where Vwinner is the summed vote (absolute value) from the winning class, and Vloser is the summed vote (absolute value) for the losing class
![Page 22: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/22.jpg)
Prediction Strength
• When classifying new cases, the algorithm ignores those cases where the strength of the prediction is below a threshold…
• Prediction =– [ALL, if VALL > VAML Æ PS >
– [AML, if VAML > VALL Æ PS >
– [No-call, otherwise.
![Page 23: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/23.jpg)
Experiments
• Cross validation with the original set of patients– For i = 1 to 38
• Hold the ith sample aside
• Use the other 37 samples to determine weights
• With this set of weights, make prediction on the ith samples
• Testing with another set of 34 patients…
![Page 24: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/24.jpg)
"Training set" results were 36/38 with 100% accuracy, 2 unknown via cross-validation (37 train, 1 test)
Independent "test set" consisted of 34 samples
24 bone marrow samples, 10 peripheral blood samplesNOTE: "training set" was ONLY bone marrow samples"test set" contained childhood AML samples, different laboratories
Strong predictions (PS=0.77) for 29/34 samples with 100% accuracy
Low prediction strength from questionable laboratory
Prediction: ResultsPrediction: Results
Slection of 8-200 genes gives roughly the same prediction quality.
![Page 25: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/25.jpg)
Cancer Class Discovery
• Given– Expression profiles of leukemia patients
• Do– Cluster the profiles, leading to discovery of
the subclasses of leukemia represented by the set of patients
![Page 26: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/26.jpg)
Cancer Class Discovery Experiment
• Cluster the expression profiles of 38 patients in the training set– Using self-organizing maps with a predefined
number of clusters (say, k)
• Run with k = 2– Cluster 1 contained 1 AML, 24 ALL– Cluster 2 contained 10 AML, 3 ALL
![Page 27: Instance-based Classification](https://reader035.vdocument.in/reader035/viewer/2022062221/56813c65550346895da5f281/html5/thumbnails/27.jpg)
Cancer Class Discovery Experiment
• Run with k = 4– Cluster 1 contained mostly AML– Cluster 2 contained mostly T-cell ALL– Cluster 3 contained mostly B-cell ALL– Cluster 4 contained mostly B-cell ALL
• It is unlikely that the clustering algorithm was able to discover the distinction between T-cell and B-cell ALL cases