a hybrid method for gene selection in microarray datasets yungho leu, chien-pan lee and ai-chen...

A hybrid method for gene selection in microarray datasets

Yungho Leu, Chien-Pan Lee and Ai-Chen Chang

National Taiwan University of Science and Technology

2014/10/22

Outline

2

Experimental result

Microarray Datasets & Research Objective

Related work & backgroundResearch method

Conclusion

Microarray datasets

Microarray technology can be used to measure the expression levels of thousands of genes at the same time.

A microarray dataset records the gene expressions of different samples in a table.

3 Mobile Computing & Data Mining

Lab.

Microarray datasets

N ： Number of samples (40~200) M ： Num. of genes (2,000~30,000) gi,j ： expression level of gene j at sampel i

Class label ： the class label of the sample

Mobile Computing & Data Mining Lab.

4

(M >> N)

M genes Class label

N

Sam

ples

Gene1 Gene2 Class

S1 0.022 -0.721 0

S 2 -1.034 0.331 0

… … … …

Sj-1 -0.212 0.123 1

Sj 0.542 0.431 1

The Prostate cancer dataset ： (Simplified)

0 ： Absent1 ： Present

Research objective

M>>N pose challenges in diagnosis (or Classification)


5

To select a minimal subset of genes with high classification accuracy rate.

A gene selection problem

Outline

6

Experimental result


Related work & Background Research method

Conclusion

Related work

Ding, C., & Peng, H. used the Pearson correlation coefficient

to eliminate redundant genes from microarray datasets.

Minimum redundancy feature selection from microarray gene

expression data.(2003 & 2005)

Yang, et al. proposed to use information gain and genetic

algorithms for gene selection.

IG-GA: A Hybrid Filter/Wrapper Method for Feature

Selection of Microarray Data.(2010)


Lab.

Related work

Luo, et al. clustered genes into groups and treated genes in the

same group as redundant genes.

Improving the Computational Efficiency of Recursive

Cluster Elimination. (2011)


Lab.

Background knowledge

Information Gain: Proposed by Quinlan as a basis of attribute selection in Decision Tree.

Attributes with larger information gains are better for classification (or differentiating between different class labels of data samples).


9

Ecological correlation (Robinson)

Ecological Correlation

Divide dataset into groups, use the means of different

groups to calculate the Pearson correlation

coefficients.

Reduce the in-group variance, increase the value of

correlation coefficient between attributes.


10

Example

Leukemia1 dataset grouped by class labels (0 ,1 ,2)

Cor(gene1 {μ0, μ1, μ2},gene2{μ0, μ1, μ2}) = -0.9886


11

gene1 gene2 class

-0.9058 -0.9298 0

0.8371 -1.3022 0

1.0694 -0.7826 1

-1.5851 -0.8680 1

-0.1908 -0.6507 2

-1.0578 0.8268 2

μ0 μ1 μ2

gene1 -0.0344 -0.2578 -0.6243gene2 -1.1160 -0.8253 0.0881

mean

mean

Support Vector Machine

A classification method by Cortes & Vapnik(1995) To find a good hyper-plane to separate samples with different

class labels.


12

∣a1-a∣> |b1-b∣

Hyper-plane a is better than hyper-plane b.

margin

Support Vectors

b1bb2

a2a1 a

Outline

13

Experimental result



Conclusion

Research method


14

Data preprocessing

Step I ： Gene filtering using IG

Step II ： Redundant gene elimination using clustering

Step III ： Subset refinement using genetic algorithm

Data preprocessing － Normalization

Normalize the dataset using Z-Score

Z score of gene expression Xij:

Where

‐ Xij ： the expression gene j on sample i.

‐ ： Mean of gene i’s expression over different samples

‐ Si ： standard deviation of gene i’s expression over different samples.


Lab.

ix

i

iijij S

xXZ

Gene filtering by information gain

16

Gene filtering

Most of the genes have their IG values equal to 0.

Select the gene with IG greater than 0 for candidate genes.

For example, the Leukemia1 dataset has 5,327 genes; only 263

genes left after gene filtering with IG.


17

Grouping of gene

18

Grouping of genes

Gene list and threshold of cor. Build the list of candidate genes Set threshold = 0.8 （ strong positively correlated ）

Grouping method ： With the first gene on the list as the basis, group the rest genes

with the basis gene if their correlation coefficients is greater than 0.8.


19

Gene ID Cor.

Gene 1,2 0.83Gene 1,3 0.53Gene 1,4 0.32Gene 1,5 0.13

... ...

gene ID

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

...

Build a gene list Calculate correlation coefficients

Eliminate the genes in the group from the list; repeat the same procedure on the rest of genes until no gene left on the list.

Eliminate genes from the existing group


20

Gene ID

Gene 3

Gene 4

Gene 5

Gene ID Cor.

Gene 1,2 0.83

Gene 1,3 0.53

Gene 1,4 0.32

Gene 1,5 0.13

Gene1 Gene2

Gene3 Gene4 Gene5

Cluster1:

Select the gene with the highest IG from each group.

Select one gene from each group


ANOVA ：

For dataset with more than three two class labels, use

ANOVA to test whether the class means are all equal.

• Hypothesis:

Gene with no different means over different class labels

are eliminated.

Eliminate genes with no classification capability

22

equal allnot are：H

：H

1

3210

i


T-test

T-test is used to test whether the class means of a gene are

different.

Genes with no different class means are eliminated.

The significant level α is set to 0.05.

Eliminate genes


Lab.

Subset refinement using GA

24

Subset refinement

Encoding ： Binary Encoding:

• ” 0 ”--- gene not selected; ” 1 ”---gene is selected.

• example ： 011001---select the 2nd, 3rd, 6th genes from the

subset.

Chromosome length: the candidate gene subset from

step II.

Population size=5

Number of Iteration =1,000


25

Subset refinement

Fitness function ： the accuracy rate of SVM of the chromosome.

Selection method: Roulette Wheel • Selection probability is in proportional to the fitness value of

the chromosome

Single point crossover and mutation ：• Crossover Rate =0.7• Mutation Rate = 0.3


26

Termination condition

Termination condition ： (any of the following)• Accuracy rate = 100%• # of iteration = 1,000• # of iteration is greater than 100 and the accuracy rates

of the last 20 iterations are all the same. Final solution ： the chromosome with the largest

fitness value in the last iteration.


27

Outline

28

Experimental result



Conclusion

The datasets


29

Data set name # of samples # of class labels # of genes

9_Tumors 60 9 5,726

Brain_Tumor1 90 5 5,920

Brain_Tumor2 50 4 10,367

Leukemia1 72 3 5,327

Leukemia2 72 3 11,225

Lung Cancer 203 5 12,600

SRBCT 83 4 2,308

11_Tumors 174 11 12,533

Prostate Tumor 102 2 10,509

DLBCL 77 2 5,469

GEMS ： http://www.gems-system.org/

Genes selected in 3 steps


30

Data Set# of original

genesIG Grouping GA

9_Tumors 5,726 103 25 13

Brain_Tumor1 5,920 185 19 10

Brain_Tumor2 10,367 3,099 19 4

Leukemia1 5,327 263 7 4

Leukemia2 11,225 3,097 6 3

Lung_Cancer 12,600 3,183 36 18

SRBCT 2,308 351 14 7

11_Tumors 12,533 3,483 510 255

Prostate_Tumor 10,509 671 235 119

DLBCL 5,469 315 169 84

Compare with other paper

Comparisons of Our method(Hybrid), GEPUBLIC, PAM, IG-GA


31

Data Set GEPUBLIC PAM IG-GA Hybrid

9_Tumors 66.67(19) 43.33 (47) 85.00 (52) 71.67(13)

Brain_Tumor1 84.44(30) 85.56 (42) 93.33 (244) 91.12(10)

Brain_Tumor2 80.00(15) 66.00 (25) 88.00 (489) 92.00(4)

Leukemia1 97.22(11) 93.06 (11) 100.00 (82) 97.23(4)

Leukemia2 91.67(31) 91.67 (52) 98.61 (782) 100.00(3)

Lung_Cancer 94.58(29) 93.60 (75) 95.57 (2101) 97.05(18)

SRBCT 98.80(26) 98.80 (41) 100.00 (56) 100.00(7)

11_Tumors 86.21(87) 81.61 (203) 92.53 (479) 91.95(255)

Prostate_Tumor 95.10(4) 93.14 (13) 96.08 (343) 94.12(119)

DLBCL 97.40(13) 80.52 (70) 100.00 (107) 97.40(84)

Outline

32

Experimental result



Conclusion

Conclusion

Each step in our method effectively reduces noisy genes from

its previous step.

The hybrid method select fewer genes with higher

classification accuracy rate.

Need to further improve the hybrid method over 2-class

microarray datasets.


33

Q & AThank you for your listening.

34

Information Gain

For a dataset D with m different class lables, Info(D) measure how well the classes of D are evenly distributed ：

InfoA ： The equivalent Info (weighted sum) of subsets of D,

where D is split into subsets using attribute A ：

Gain(A) ：

35

)(log)( 21

i

m

ii ppDInfo

)(||

||)(

1j

v

j

jA DI

D

DDInfo

(D)InfoInfo(D)Gain(A) A

, Pi ： prob. of a sample in D belongs to class i.

A ： {a1,a2,…,av} ， attr. A has v different valuesD ： is split into {D1,D2,…,Dv}Di ： contains samples with A equal to aj

Data Mining: Concepts and Techniques

Attribute Selection: Information Gain

• Class P: buys_computer = “yes” ： 9• Class N: buys_computer = “no” ： 5

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIDInfoage

048.0)_(151.0)(029.0)(

ratingcreditGainstudentGainincomeGain

246.0)()()( DInfoDInfoageGain age

940.0)14

5(log

14

5)

14

9(log

14

9)5,9()( 22 IDInfo

4- 36

Age incomestuden

tcredit Buy

<=30 high no fair no<=30 high no excellent no

31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no

31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes

<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes

>40 medium no excellent no

age P N<=30 2 3

31…40 4 0>40 3 2

a hybrid method for gene selection in microarray datasets yungho leu, chien-pan lee and ai-chen...

Documents

present slide

mean slide

gene selection problem

classification method

microarray datasets

redundant genes

clustered genes

treated genes