selective gaussian naïve bayes model for diffuse large-b-cell lymphoma classification: some...

0.5setgray0

0.5setgray1

0.5setgray.70

0.5setgray.90

Selective Gaussian Naive Bayes Model forDiffuse Large-B-Cell Lymphoma Classification:Some Improvements in Preprocessing and Variable Elimination

Barcelona, July 2005Andres Cano, F. Javier Garcıa, Andres Masegosa and Serafın Moral

Dept. Computer Science and Artificial Intelligence

University of Granada

Slide . . ..– p. 1/13

Introduction: Gene Expression Data

MicroArray: a biochipthat measures the expressionlevel of thousands of genes inonly one experiment.

It’s a micromatrix.

Each row contains geneticmaterial of a given gene.

Each tumoral pattern is put ineach column andhybridizated (each cell iscolored).

The micromatrix is scannedand the data is obtained.

Hybridization of LymphochipSlide . . ..– p. 2/13

Introduction: Gene Expression Data

MicroArray: a biochipthat measures the expressionlevel of thousands of genes inonly one experiment.

It’s a micromatrix.

Each row contains geneticmaterial of a given gene.

Each tumoral pattern is put ineach column andhybridizated (each cell iscolored).

The micromatrix is scannedand the data is obtained.Hybridization of Lymphochip

Slide . . ..– p. 2/13

Diffuse Large-B-Cell Lymphoma Classification

The 60 % of patients with Diffuse Large-B-Cell Lymphoma(DLBCL) succumbs to this disease.

Alizadeh et al (2000) discovered, using the Lymphochip, thatDLBCL comprises two different diseases: GCB, with a highsurvival index and ABC, with a low survival index.

They provide a data set with 42 cases, 21 cases of GCB andACB, each one with the measure of 4096 gene expression level.

The problem is, using this sort of data sets:

Build an automatic classifier for the prediction of the subtypeof DLBCL pattern.

Find a minimum subset of genes that make thisclassification.

Slide . . ..– p. 3/13

Diffuse Large-B-Cell Lymphoma Classification

The 60 % of patients with Diffuse Large-B-Cell Lymphoma(DLBCL) succumbs to this disease.

Alizadeh et al (2000) discovered, using the Lymphochip, thatDLBCL comprises two different diseases: GCB, with a highsurvival index and ABC, with a low survival index.

They provide a data set with 42 cases, 21 cases of GCB andACB, each one with the measure of 4096 gene expression level.

The problem is, using this sort of data sets:

Build an automatic classifier for the prediction of the subtypeof DLBCL pattern.

Find a minimum subset of genes that make thisclassification.

Slide . . ..– p. 3/13

Bayesian Classification of gene expression

Data Domain:

Continuous Data

p(X1|C) C1 C2

10,3 0,8

10,7 0,2

Discretized Data

Data Dependences:

X1 X2 X3

Naive Bayes Structure

X1 X2 X3

TAN Structure

Slide . . ..– p. 4/13

Bayesian Classification of gene expression

Data Domain:

Continuous Data

p(X1|C) C1 C2

10,3 0,8

10,7 0,2

Discretized Data

Data Dependences:

X1 X2 X3

Naive Bayes Structure

X1 X2 X3

TAN Structure

Slide . . ..– p. 4/13

Feature Selection with Gene Expression Data

These data sets have:

High Dimensionality:4000 and 20000 genes.

Low number of Cases:40 and 200 cases.

FSS Problems:

High Risk Overfitting

Low reliability results.

Solutions:

Filter methods.

Wrapper methods.

Filter Method + Wrapper Method

Slide . . ..– p. 5/13

FSS Problems:

Solutions:

Filter methods.

Wrapper methods.

Slide . . ..– p. 5/13

FSS Problems:

Solutions:

Filter methods.

Wrapper methods.

Slide . . ..– p. 5/13

FSS Problems:

Solutions:

Filter methods.

Wrapper methods.

Slide . . ..– p. 5/13

FSS Problems:

Solutions:

Filter methods.

−Select the best features using a reasonable criterion.−Use a independent criterion.−Advantage: Very efficiency.−Problem: The criterion is not associated to the problem.

Wrapper methods.

Slide . . ..– p. 5/13

FSS Problems:

Solutions:

Filter methods.

Wrapper methods.

−Select the best features using a final criterion.−For each subset of features, try to solve the problem.−Advantage: It is very powerful.−Problem: It is very time cosuming.

Slide . . ..– p. 5/13

FSS Problems:

Solutions:

Filter methods.

Wrapper methods.

Slide . . ..– p. 5/13

Preordering the features

FSS Search:Non Selected Features

X1 X2 X3 X4

X5 X6 X7 X8

Changes:

Introduction of preorder in the features

Filter Preorder Accuracy Preorder

Limit the search space to the N-first features

Slide . . ..– p. 6/13

X1 X2 X3 X4

X5 X6 X7 X8X1

Accuracy = 83 %

Step1: Search in Non Selected Features.

Changes:

Slide . . ..– p. 6/13

X1 X2 X3 X4

X5 X6 X7 X8X2

Accuracy = 89 %

Changes:

Slide . . ..– p. 6/13

X1 X2 X3 X4

X5 X6 X7 X8X8

Accuracy = 84 %

Changes:

Slide . . ..– p. 6/13

X1 X2 X4

X5 X6 X7 X8

Step 2: Select the best Node.

Accuracy = 91 %

Changes:

Slide . . ..– p. 6/13

X1 X2 X4

X5 X6 X7 X8X1

Accuracy = 88 %

Step 3: Follow the Search.

Changes:

Slide . . ..– p. 6/13

X1 X2 X4

X5 X6 X7 X8X2

Accuracy = 91 %

Changes:

Slide . . ..– p. 6/13

X3 X7 X5

X1 X2 X4

Accuracy = 93 %

Until the Stop Condition.

Changes:

Slide . . ..– p. 6/13

X3 X7 X5

X1 X2 X4

Accuracy = 93 %

Changes:

Slide . . ..– p. 6/13

FSS Search:

Non Selected Features

Changes:

Slide . . ..– p. 6/13

FSS Search:

Changes:

Slide . . ..– p. 6/13

FSS Search:

Changes:

Slide . . ..– p. 6/13

FSS Search:

Limited Search Space

Changes:

Slide . . ..– p. 6/13

Limited FSS Search:

Slide . . ..– p. 6/13

Limited FSS Search:

Accuracy = 88 %

Slide . . ..– p. 6/13

Limited FSS Search:

Accuracy = 88 %

Slide . . ..– p. 6/13

Limited FSS Search:

Accuracy = 84 %

Slide . . ..– p. 6/13

Limited FSS Search:

Accuracy = 88 %Non Selected Features

Limited Search SpaceX5

Slide . . ..– p. 6/13

Limited FSS Search:

Accuracy = 89 %

Slide . . ..– p. 6/13

Limited FSS Search:

Accuracy = 87 %

Slide . . ..– p. 6/13

Limited FSS Search:

X3 X1 X5

Accuracy = 95 %

Slide . . ..– p. 6/13

Irrelevant Variable Elimination

Heuristic for Irrelevant features:

X1 X2 X3

Classifier X

Train Set

Classifier Z

Z not irrelevant to X

Slide . . ..– p. 7/13

Heuristic for Irrelevant features:C

X1 X2 X3

Classifier X

Train Set

Classifier Z

Right Classified Non-Right Classified

Slide . . ..– p. 7/13

X1 X2 X3

Classifier X

Train Set

Classifier Y

Y irrelevant to X

Classifier Z

Slide . . ..– p. 7/13

X1 X2 X3

Classifier X

Train Set

Classifier Z

Slide . . ..– p. 7/13

Limited Search with Variable Elimination

Wrapper Search with Variable Elimination:

Slide . . ..– p. 8/13

Not Selected Features

Slide . . ..– p. 8/13

Step 3: Elimination of Irrelvant Features respect to X3.

Slide . . ..– p. 8/13

X5 X1 X8 X7

Irrelevant Features

X9 X10

Not Selected FeaturesFinal Subset Selected

Slide . . ..– p. 8/13

Classifying Diffuse Large-B-Cell Lymphoma

Data Base I: Taken from the work of Alizadeth et al (2000).

42 samples ( 21 GCB + 21 ABC).

348 genes.

Validation Scheme: Leave-one-out Validation.

Data Base II: Taken from the work of Wright et al (2004).

217 samples (134 GCB + 83 ABC).

8503 genes.

Validation Scheme:

−10 Train and Test sets of equal size.

−Each Train set is reduced by a filter method.

−Number of Filtered Genes: 78,7± 4,4

Slide . . ..– p. 9/13

348 genes.

8503 genes.

Validation Scheme:

Slide . . ..– p. 9/13

348 genes.

8503 genes.

Validation Scheme:

Slide . . ..– p. 9/13

Experimental Results I

Slide . . ..– p. 10/13

Feature Preorder

Data Base Data Random Preorder Filter Preorder Accuracy Preorder

DB1 Accuracy 80,9± 4,9 81,0± 4,9 92,8± 2,1

DB1 No Genes 4,3± 0,5 3,2± 0,1 3,8± 0,5

DB2 Accuracy 88,9± 0,6 91,0± 0,4 89,1± 0,5

DB2 No Genes 8,0± 3,2 9,0± 5,1 7,6± 4,0

Slide . . ..– p. 10/13

Feature Preorder

DB1 Accuracy 80,9± 4,9 81,0± 4,9 92,8± 2,1

DB1 No Genes 4,3± 0,5 3,2± 0,1 3,8± 0,5

DB2 Accuracy 88,9± 0,6 91,0± 0,4 89,1± 0,5

DB2 No Genes 8,0± 3,2 9,0± 5,1 7,6± 4,0

Preorder Limit

Data Base Data LFSS

DB1 Accuracy 92,8± 2,1

DB1 No Genes 3,8± 0,3

Slide . . ..– p. 10/13

Feature Preorder

DB1 Accuracy 80,9± 4,9 81,0± 4,9 92,8± 2,1

DB1 No Genes 4,3± 0,5 3,2± 0,1 3,8± 0,5

DB2 Accuracy 88,9± 0,6 91,0± 0,4 89,1± 0,5

DB2 No Genes 8,0± 3,2 9,0± 5,1 7,6± 4,0

Preorder Limit

Data Base Data LFSS

Slide . . ..– p. 10/13

Feature Preorder

DB1 Accuracy 80,9± 4,9 81,0± 4,9 92,8± 2,1

DB1 No Genes 4,3± 0,5 3,2± 0,1 3,8± 0,5

DB2 Accuracy 88,9± 0,6 91,0± 0,4 89,1± 0,5

DB2 No Genes 8,0± 3,2 9,0± 5,1 7,6± 4,0

Preorder Limit

Data Base Data LFSS

Slide . . ..– p. 10/13

Experimental Results II

Slide . . ..– p. 11/13

Elimination of Irrelevant Features

Data Base Data LFSS-VE LFSS FSS

BD1 Accuracy 95,2± 1,4 92,8± 2,1 80,9± 4,9

BD1 No Genes 5,4± 0,1 3,8± 0,3 4,3± 0,5

BD1 No Eval 1882 2840 74900

DB2 Accuracy 93,0± 0,4 91,8± 0,4 88,9± 0,6

DB2 No Genes 8,1± 5,6 7,8± 3,0 8,0± 3,2

DB2 No Eval 1018 1080 8002

Slide . . ..– p. 11/13

BD1 Accuracy 95,2± 1,4 92,8± 2,1 80,9± 4,9

BD1 No Genes 5,4± 0,1 3,8± 0,3 4,3± 0,5

BD1 No Eval 1882 2840 74900

DB2 Accuracy 93,0± 0,4 91,8± 0,4 88,9± 0,6

DB2 No Genes 8,1± 5,6 7,8± 3,0 8,0± 3,2

DB2 No Eval 1018 1080 8002

Slide . . ..– p. 11/13

BD1 Accuracy 95,2± 1,4 92,8± 2,1 80,9± 4,9

BD1 No Genes 5,4± 0,1 3,8± 0,3 4,3± 0,5

BD1 No Eval 1882 2840 74900

DB2 Accuracy 93,0± 0,4 91,8± 0,4 88,9± 0,6

DB2 No Genes 8,1± 5,6 7,8± 3,0 8,0± 3,2

DB2 No Eval 1018 1080 8002

Slide . . ..– p. 11/13

BD1 Accuracy 95,2± 1,4 92,8± 2,1 80,9± 4,9

BD1 No Genes 5,4± 0,1 3,8± 0,3 4,3± 0,5

BD1 No Eval 1882 2840 74900

DB2 Accuracy 93,0± 0,4 91,8± 0,4 88,9± 0,6

DB2 No Genes 8,1± 5,6 7,8± 3,0 8,0± 3,2

DB2 No Eval 1018 1080 8002

Slide . . ..– p. 11/13

BD1 Accuracy 95,2± 1,4 92,8± 2,1 80,9± 4,9

BD1 No Genes 5,4± 0,1 3,8± 0,3 4,3± 0,5

BD1 No Eval 1882 2840 74900

DB2 Accuracy 93,0± 0,4 91,8± 0,4 88,9± 0,6

DB2 No Genes 8,1± 5,6 7,8± 3,0 8,0± 3,2

DB2 No Eval 1018 1080 8002

Slide . . ..– p. 11/13

BD1 Accuracy 95,2± 1,4 92,8± 2,1 80,9± 4,9

BD1 No Genes 5,4± 0,1 3,8± 0,3 4,3± 0,5

BD1 No Eval 1882 2840 74900

DB2 Accuracy 93,0± 0,4 91,8± 0,4 88,9± 0,6

DB2 No Genes 8,1± 5,6 7,8± 3,0 8,0± 3,2

DB2 No Eval 1018 1080 8002

Filter Preorder vs Accuracy Preorder

Data Base Data LFSS-VE Filter Preorder LFSS-VE Accuracy Preorder

DB1 Accuracy 88,1± 3,3 95,2± 1,4

DB1 No Genes 3,9± 0,1 5,4± 0,1

DB2 Accuracy 90,7± 0,5 93,0± 0,4

DB2 No Genes 7,6± 2,7 8,1± 5,6

Slide . . ..– p. 11/13

BD1 Accuracy 95,2± 1,4 92,8± 2,1 80,9± 4,9

BD1 No Genes 5,4± 0,1 3,8± 0,3 4,3± 0,5

BD1 No Eval 1882 2840 74900

DB2 Accuracy 93,0± 0,4 91,8± 0,4 88,9± 0,6

DB2 No Genes 8,1± 5,6 7,8± 3,0 8,0± 3,2

DB2 No Eval 1018 1080 8002

DB1 Accuracy 88,1± 3,3 95,2± 1,4

DB1 No Genes 3,9± 0,1 5,4± 0,1

DB2 Accuracy 90,7± 0,5 93,0± 0,4

DB2 No Genes 7,6± 2,7 8,1± 5,6

Slide . . ..– p. 11/13

BD1 Accuracy 95,2± 1,4 92,8± 2,1 80,9± 4,9

BD1 No Genes 5,4± 0,1 3,8± 0,3 4,3± 0,5

BD1 No Eval 1882 2840 74900

DB2 Accuracy 93,0± 0,4 91,8± 0,4 88,9± 0,6

DB2 No Genes 8,1± 5,6 7,8± 3,0 8,0± 3,2

DB2 No Eval 1018 1080 8002

DB1 Accuracy 88,1± 3,3 95,2± 1,4

DB1 No Genes 3,9± 0,1 5,4± 0,1

DB2 Accuracy 90,7± 0,5 93,0± 0,4

DB2 No Genes 7,6± 2,7 8,1± 5,6

Slide . . ..– p. 11/13

Experimental Results III

Results Comparison

Test Dataset

True class Predicted class

ABC GCB Unclass.

ABC 38 1 2

GCB 2 57 8

Test Dataset

ABC GCB Unclass.

ABC 32,7 3,5 4,8

GCB 3,2 58,8 5,0

Test Dataset

ABC GCB Unclass.

ABC 32,7 1,3 7,0

GCB 1,7 57,4 7,9

- Wright et al. Classifier- Validated in one partition- 27 genes selected.

- Wrapper + Abduction- Validated in 10 partitions.- 7.0 genes selected.

- LFSS-VE- Validated in 10 partitions.- 8.1 genes selected.

Slide . . ..– p. 12/13

Results Comparison

Test Dataset

ABC GCB Unclass.

ABC 38 1 2

GCB 2 57 8

Test Dataset

ABC GCB Unclass.

ABC 32,7 3,5 4,8

GCB 3,2 58,8 5,0

Test Dataset

ABC GCB Unclass.

ABC 32,7 1,3 7,0

GCB 1,7 57,4 7,9

Slide . . ..– p. 12/13

Results Comparison

Test Dataset

ABC GCB Unclass.

ABC 38 1 2

GCB 2 57 8

Test Dataset

ABC GCB Unclass.

ABC 32,7 3,5 4,8

GCB 3,2 58,8 5,0

Test Dataset

ABC GCB Unclass.

ABC 32,7 1,3 7,0

GCB 1,7 57,4 7,9

Slide . . ..– p. 12/13

Results Comparison

Test Dataset

ABC GCB Unclass.

ABC 38 1 2

GCB 2 57 8

Test Dataset

ABC GCB Unclass.

ABC 32,7 3,5 4,8

GCB 3,2 58,8 5,0

Test Dataset

ABC GCB Unclass.

ABC 32,7 1,3 7,0

GCB 1,7 57,4 7,9

Slide . . ..– p. 12/13

Results Comparison

Test Dataset

ABC GCB Unclass.

ABC 38 1 2

GCB 2 57 8

Test Dataset

ABC GCB Unclass.

ABC 32,7 3,5 4,8

GCB 3,2 58,8 5,0

Test Dataset

ABC GCB Unclass.

ABC 32,7 1,3 7,0

GCB 1,7 57,4 7,9

Slide . . ..– p. 12/13

Results Comparison

Test Dataset

ABC GCB Unclass.

ABC 38 1 2

GCB 2 57 8

Test Dataset

ABC GCB Unclass.

ABC 32,7 3,5 4,8

GCB 3,2 58,8 5,0

Test Dataset

ABC GCB Unclass.

ABC 32,7 1,3 7,0

GCB 1,7 57,4 7,9

Slide . . ..– p. 12/13

Conclusions and Future Work

The wrapper technique is a powerful method in supervisedclassification task.

Its main disadvantage is its high computational cost.

In special, in Gene Expression Data bases due to its highdimensionality.

LFSS-VE solves these disadvantages in the DLBCLclassification using a Preodering of the features and aLimited Search Space.

The elimination of irrelevant features is a good method toenhance the performance of a wrapper method.

The future line of work is the validation of our model withother data sets: breast cancer, colon cancer, leukemia ...

Slide . . ..– p. 13/13

selective gaussian naïve bayes model for diffuse large-b-cell lymphoma classification: some...

Science

naïve bayes - github...

data classification preprocessing naïve bayes...

naïve bayes classifier

naïve bayes: refinements

knn & naïve bayes

naïve bayes -...

mle and map•maximum a posteriori (map) estimate •naïve...

23: naïve bayes - stanford...

naïve bayes classifiers

text classification – naïve bayes

bayesian inference, naïve bayes model - svetlana...

naïve bayes discussion - umd

spam filtering with naïve bayes – which naïve...

more naïve bayes

naïve bayes (continued)

naïve bayes - university of california,...

bayes classifier and naïve bayes - oregon state...

naïve bayes classfication

data classification preprocessing naïve bayes...

classiﬁcation: naïve bayes