instance-based classification

Instance-based Classification

• Examine the training samples each time a new query instance is given.

• The relationship between the new query instance and training examples will be checked to assign a class label to the query instance.

KNN: k-Nearest Neighbor

• A test sample x can be best predicted by determining the most common class label among k training samples to which x is most similar.

• Xj—jth training sample, yj—the class label for xj, Nx—the set of k nearest neighbors of x in training set. Estimate the probability x belongs to ith class:

KNN: k-Nearest Neighbor, con’t

• Proportion of K nearest neighbors that belong to ith class:

• The ith class which maximizes the proportion above will be assigned as the label of x.

• Variants of KNN: filtering out irrelevant genes before applying KNN.

K

Nxiyxip xjj |}|{|)|(

^

Molecular Classification of Cancer

Class Discovery and Class Prediction by Gene Expression Monitoring

"Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov, Coller, Lo, Downing, Caligiuri, Bloomfield, Lander

Appears in Science Volume 286, October 15, 1999

Whitehead Institute/MIT Center for Genome Researchhttp://www-genome.wi.mit.edu/cancer

...and Dana-Farber (Boston), St. Jude (Memphis), Ohio State

...additional publications by same group shows similar technique applied to different disease modalities.

Publication InfoPublication Info

Cancer ClassificationCancer Classification

Class Discovery: defining previously unrecognized tumor subtypes

Class Prediction: assignment of particular tumor samples to already-defined classes

Given bone marrow samples:Which cancer classes are present among sample?How many cancer classes? 2, 4?Given samples are from leukemia patients, what

type of leukemia is each sample (AML vs ALL)?

Cancer of bone marrowMyelogenous or lymphocytic, acute or chronicAcute Myelogenous Leukemia (AML) vs Acute Lymphocytic Leukemia (ALL)

Marrow cannot produce appropriate amount of red and white blood cells

Anemia -> weakness, minor infections; Platlet deficiency -> easy bruising

AML: 10,000 new adult cases per yearALL: 3,500/2,400 new adult/child cases per yearAML vs. ALL in adults & children

Leukemia: Definitions & SymptomsLeukemia: Definitions & Symptoms

Leukemia: Treatment & expected Leukemia: Treatment & expected outcomeoutcome

Diagnosis via highly specialized laboratoryALL: 58% survival rateAML: 14% survival rateTreatment: chemotherapy, bone marrow transplant

ALL: corticosteroids, vincristine, methotrexate, L-asparaginase

AML: daunorubicin, cytarabineCorrect diagnosis very important for treatment options and expected outcome!!!

Microarray could provide systematic diagnosis optionBUT ONLY ONE TYPE OF DIAGNOSTIC TOOL!!!

38 bone marrow samples (27 AML, 11 AML)

6817 human gene probes

Leukemia: Data setLeukemia: Data set

Cancer Class Prediction

• Learning Task– Given: Expression profiles of leukemia patients

– Compute: A model distinguishing disease classes (e.g., AML vs. ALL patients) from expression data.

• Classification Task– Given: Expression profile of a new patient + A

learned model (e.g., one computed in a learning task)

– Determine: The disease class of the patient (e.g., whether the patient has AML or ALL)

Cancer Class Prediction

• n genes measured in m patients

g1,1 g1,n Ã class1

g2,1 g2,n Ã class2

gm,1 gm,n Ã classm

Vector for a

patient

Cancer Class Prediction Approach

• Rank genes by their correlation with class variable (AML/ALL)

• Select subset of “informative” genes

• Have these genes do a weighted vote to classify a previously unclassified patient.

• Test validity of predictors.

Ranking Genes

• Rank genes by how predictive they are (individually) of the class…

g1,1 g1,n Ã class1

g2,1 g2,n Ã class2

gm,1 gm,n Ã classm

Ranking Genes• Split the expression values for a given

gene g into two pools – one for each class (AML vs. ALL)

• Determine their mean and standard deviation sigma of each pool

• Rank genes by correlation metric (separation)

P(g, class) = (ALL - AML)/(ALL + AML)

The mean difference between the classes

relative to the SD within the classes.

Neighborhood AnalysisNeighborhood AnalysisEach gene g: V(g) = (e1, e2, …, en), ei: expression level of gene g in ith sample.Idealized pattern: c = (c1, c2, …, cn), ci: 1 or 0 (sample I belongs to class 1 or 2.C* idealized random pattern. Counting the number of genes having various levels of correlation with C, compared with the corresponding distribution obtained for random pattern C*.

Selecting Informative Genes• Select the kALL top ranked genes (highly

expressed in ALL) and the kAML bottom ranked genes (highly expressed in AML)

P(g, class) = (ALL - AML)/(ALL + AML)

In Golub’s paper, 25 most positively correlated and 25 most negatively correlated genes are selected.

Determine significant genesDetermine significant genes

1% significance level means 1% of random neighborhoods contain as many points as observed neighborhood.

P(g,c)>0.30 is 709 genes (intersects 1%)

Median is ~150 genes (if totally random)

Weighted Voting

• Given a new patient to classify, each of the selected genes casts a weighted vote for only one class.

• The class that gets the most vote is the prediction.

Weighted Voting

• Suppose that x is the expression level measured for gene g in the patient

V = P(g,class) X |x – [ALL + AML]/2|

Weight for gene g – weighting factor

reflecting how well the gene is

correlated with the class distinction

Distance from the measurement to the

class boundary -- reflecting the deviation of the expression level in the sample from the average

of AML and ALL

PredictionPrediction

Weighted vote:VAML=viwi|vi is vote for AML where vi=|xi-(AML+ALL)/2|

Prediction Strength

• Can assess the “strength” of a prediction as follows:

PS = (Vwinner – Vloser)/(Vwinner+ Vloser)

where Vwinner is the summed vote (absolute value) from the winning class, and Vloser is the summed vote (absolute value) for the losing class

Prediction Strength

• When classifying new cases, the algorithm ignores those cases where the strength of the prediction is below a threshold…

• Prediction =– [ALL, if VALL > VAML Æ PS >

– [AML, if VAML > VALL Æ PS >

– [No-call, otherwise.

Experiments

• Cross validation with the original set of patients– For i = 1 to 38

• Hold the ith sample aside

• Use the other 37 samples to determine weights

• With this set of weights, make prediction on the ith samples

• Testing with another set of 34 patients…

"Training set" results were 36/38 with 100% accuracy, 2 unknown via cross-validation (37 train, 1 test)

Independent "test set" consisted of 34 samples

24 bone marrow samples, 10 peripheral blood samplesNOTE: "training set" was ONLY bone marrow samples"test set" contained childhood AML samples, different laboratories

Strong predictions (PS=0.77) for 29/34 samples with 100% accuracy

Low prediction strength from questionable laboratory

Prediction: ResultsPrediction: Results

Slection of 8-200 genes gives roughly the same prediction quality.

Cancer Class Discovery

• Given– Expression profiles of leukemia patients

• Do– Cluster the profiles, leading to discovery of

the subclasses of leukemia represented by the set of patients

Cancer Class Discovery Experiment

• Cluster the expression profiles of 38 patients in the training set– Using self-organizing maps with a predefined

number of clusters (say, k)

• Run with k = 2– Cluster 1 contained 1 AML, 24 ALL– Cluster 2 contained 10 AML, 3 ALL

Cancer Class Discovery Experiment

• Run with k = 4– Cluster 1 contained mostly AML– Cluster 2 contained mostly T-cell ALL– Cluster 3 contained mostly B-cell ALL– Cluster 4 contained mostly B-cell ALL

• It is unlikely that the clustering algorithm was able to discover the distinction between T-cell and B-cell ALL cases

instance-based classification

Documents