a hybrid method for gene selection in microarray datasets yungho leu, chien-pan lee and ai-chen...

36
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology 2014/10/22

Upload: dina-simpson

Post on 24-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

A hybrid method for gene selection in microarray datasets

Yungho Leu, Chien-Pan Lee and Ai-Chen Chang

National Taiwan University of Science and Technology

2014/10/22

Page 2: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Outline

2

Experimental result

Microarray Datasets & Research Objective

Related work & backgroundResearch method

Conclusion

Page 3: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Microarray datasets

Microarray technology can be used to measure the expression levels of thousands of genes at the same time.

A microarray dataset records the gene expressions of different samples in a table.

3 Mobile Computing & Data Mining

Lab.

Page 4: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Microarray datasets

N : Number of samples (40~200) M : Num. of genes (2,000~30,000) gi,j : expression level of gene j at sampel i

Class label : the class label of the sample

Mobile Computing & Data Mining Lab.

4

(M >> N)

M genes Class label

N

Sam

ples

Gene1 Gene2 Class

S1 0.022 -0.721 0

S 2 -1.034 0.331 0

… … … …

Sj-1 -0.212 0.123 1

Sj 0.542 0.431 1

The Prostate cancer dataset : (Simplified)

0 : Absent1 : Present

Page 5: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Research objective

M>>N pose challenges in diagnosis (or Classification)

Mobile Computing & Data Mining Lab.

5

To select a minimal subset of genes with high classification accuracy rate.

A gene selection problem

Page 6: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Outline

6

Experimental result

Microarray Datasets & Research Objective

Related work & Background Research method

Conclusion

Page 7: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Related work

Ding, C., & Peng, H. used the Pearson correlation coefficient

to eliminate redundant genes from microarray datasets.

Minimum redundancy feature selection from microarray gene

expression data.(2003 & 2005)

Yang, et al. proposed to use information gain and genetic

algorithms for gene selection.

IG-GA: A Hybrid Filter/Wrapper Method for Feature

Selection of Microarray Data.(2010)

7 Mobile Computing & Data Mining

Lab.

Page 8: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Related work

Luo, et al. clustered genes into groups and treated genes in the

same group as redundant genes.

Improving the Computational Efficiency of Recursive

Cluster Elimination. (2011)

8 Mobile Computing & Data Mining

Lab.

Page 9: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Background knowledge

Information Gain: Proposed by Quinlan as a basis of attribute selection in Decision Tree.

Attributes with larger information gains are better for classification (or differentiating between different class labels of data samples).

Mobile Computing & Data Mining Lab.

9

Page 10: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Ecological correlation (Robinson)

Ecological Correlation

Divide dataset into groups, use the means of different

groups to calculate the Pearson correlation

coefficients.

Reduce the in-group variance, increase the value of

correlation coefficient between attributes.

Mobile Computing & Data Mining Lab.

10

Page 11: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Example

Leukemia1 dataset grouped by class labels (0 ,1 ,2)

Cor(gene1 {μ0, μ1, μ2},gene2{μ0, μ1, μ2}) = -0.9886

Mobile Computing & Data Mining Lab.

11

gene1 gene2 class

-0.9058 -0.9298 0

0.8371 -1.3022 0

1.0694 -0.7826 1

-1.5851 -0.8680 1

-0.1908 -0.6507 2

-1.0578 0.8268 2

μ0 μ1 μ2

gene1 -0.0344 -0.2578 -0.6243gene2 -1.1160 -0.8253 0.0881

mean

mean

Page 12: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Support Vector Machine

A classification method by Cortes & Vapnik(1995) To find a good hyper-plane to separate samples with different

class labels.

Mobile Computing & Data Mining Lab.

12

∣a1-a∣> |b1-b∣

Hyper-plane a is better than hyper-plane b.

margin

Support Vectors

b1bb2

a2a1 a

Page 13: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Outline

13

Experimental result

Microarray Datasets & Research Objective

Related work & Background Research method

Conclusion

Page 14: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Research method

Mobile Computing & Data Mining Lab.

14

Data preprocessing

Step I : Gene filtering using IG

Step II : Redundant gene elimination using clustering

Step III : Subset refinement using genetic algorithm

Page 15: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Data preprocessing - Normalization

Normalize the dataset using Z-Score

Z score of gene expression Xij:

Where

‐ Xij : the expression gene j on sample i.

‐ : Mean of gene i’s expression over different samples

‐ Si : standard deviation of gene i’s expression over different samples.

15 Mobile Computing & Data Mining

Lab.

ix

i

iijij S

xXZ

Page 16: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Gene filtering by information gain

16

Page 17: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Gene filtering

Most of the genes have their IG values equal to 0.

Select the gene with IG greater than 0 for candidate genes.

For example, the Leukemia1 dataset has 5,327 genes; only 263

genes left after gene filtering with IG.

Mobile Computing & Data Mining Lab.

17

Page 18: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Grouping of gene

18

Page 19: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Grouping of genes

Gene list and threshold of cor. Build the list of candidate genes Set threshold = 0.8 ( strong positively correlated )

Grouping method : With the first gene on the list as the basis, group the rest genes

with the basis gene if their correlation coefficients is greater than 0.8.

Mobile Computing & Data Mining Lab.

19

Gene ID Cor.

Gene 1,2 0.83Gene 1,3 0.53Gene 1,4 0.32Gene 1,5 0.13

... ...

gene ID

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

...

Build a gene list Calculate correlation coefficients

Page 20: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Eliminate the genes in the group from the list; repeat the same procedure on the rest of genes until no gene left on the list.

Eliminate genes from the existing group

Mobile Computing & Data Mining Lab.

20

Gene ID

Gene 3

Gene 4

Gene 5

Gene ID Cor.

Gene 1,2 0.83

Gene 1,3 0.53

Gene 1,4 0.32

Gene 1,5 0.13

Gene1 Gene2

Gene3 Gene4 Gene5

Cluster1:

Page 21: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Select the gene with the highest IG from each group.

Select one gene from each group

Mobile Computing & Data Mining Lab.

Page 22: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

ANOVA :

For dataset with more than three two class labels, use

ANOVA to test whether the class means are all equal.

• Hypothesis:

Gene with no different means over different class labels

are eliminated.

Eliminate genes with no classification capability

22

equal allnot are:H

:H

1

3210

i

Mobile Computing & Data Mining Lab.

Page 23: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

T-test

T-test is used to test whether the class means of a gene are

different.

Genes with no different class means are eliminated.

The significant level α is set to 0.05.

Eliminate genes

23 Mobile Computing & Data Mining

Lab.

Page 24: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Subset refinement using GA

24

Page 25: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Subset refinement

Encoding : Binary Encoding:

• ” 0 ”--- gene not selected; ” 1 ”---gene is selected.

• example : 011001---select the 2nd, 3rd, 6th genes from the

subset.

Chromosome length: the candidate gene subset from

step II.

Population size=5

Number of Iteration =1,000

Mobile Computing & Data Mining Lab.

25

Page 26: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Subset refinement

Fitness function : the accuracy rate of SVM of the chromosome.

Selection method: Roulette Wheel • Selection probability is in proportional to the fitness value of

the chromosome

Single point crossover and mutation :• Crossover Rate =0.7• Mutation Rate = 0.3

Mobile Computing & Data Mining Lab.

26

Page 27: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Termination condition

Termination condition : (any of the following)• Accuracy rate = 100%• # of iteration = 1,000• # of iteration is greater than 100 and the accuracy rates

of the last 20 iterations are all the same. Final solution : the chromosome with the largest

fitness value in the last iteration.

Mobile Computing & Data Mining Lab.

27

Page 28: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Outline

28

Experimental result

Microarray Datasets & Research Objective

Related work & Background Research method

Conclusion

Page 29: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

The datasets

Mobile Computing & Data Mining Lab.

29

Data set name # of samples # of class labels # of genes

9_Tumors 60 9 5,726

Brain_Tumor1 90 5 5,920

Brain_Tumor2 50 4 10,367

Leukemia1 72 3 5,327

Leukemia2 72 3 11,225

Lung Cancer 203 5 12,600

SRBCT 83 4 2,308

11_Tumors 174 11 12,533

Prostate Tumor 102 2 10,509

DLBCL 77 2 5,469

GEMS : http://www.gems-system.org/

Page 30: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Genes selected in 3 steps

Mobile Computing & Data Mining Lab.

30

Data Set# of original

genesIG Grouping GA

9_Tumors 5,726 103 25 13

Brain_Tumor1 5,920 185 19 10

Brain_Tumor2 10,367 3,099 19 4

Leukemia1 5,327 263 7 4

Leukemia2 11,225 3,097 6 3

Lung_Cancer 12,600 3,183 36 18

SRBCT 2,308 351 14 7

11_Tumors 12,533 3,483 510 255

Prostate_Tumor 10,509 671 235 119

DLBCL 5,469 315 169 84

Page 31: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Compare with other paper

Comparisons of Our method(Hybrid), GEPUBLIC, PAM, IG-GA

Mobile Computing & Data Mining Lab.

31

Data Set GEPUBLIC PAM IG-GA Hybrid

9_Tumors 66.67(19) 43.33 (47) 85.00 (52) 71.67(13)

Brain_Tumor1 84.44(30) 85.56 (42) 93.33 (244) 91.12(10)

Brain_Tumor2 80.00(15) 66.00 (25) 88.00 (489) 92.00(4)

Leukemia1 97.22(11) 93.06 (11) 100.00 (82) 97.23(4)

Leukemia2 91.67(31) 91.67 (52) 98.61 (782) 100.00(3)

Lung_Cancer 94.58(29) 93.60 (75) 95.57 (2101) 97.05(18)

SRBCT 98.80(26) 98.80 (41) 100.00 (56) 100.00(7)

11_Tumors 86.21(87) 81.61 (203) 92.53 (479) 91.95(255)

Prostate_Tumor 95.10(4) 93.14 (13) 96.08 (343) 94.12(119)

DLBCL 97.40(13) 80.52 (70) 100.00 (107) 97.40(84)

Page 32: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Outline

32

Experimental result

Microarray Datasets & Research Objective

Related work & Background Research method

Conclusion

Page 33: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Conclusion

Each step in our method effectively reduces noisy genes from

its previous step.

The hybrid method select fewer genes with higher

classification accuracy rate.

Need to further improve the hybrid method over 2-class

microarray datasets.

Mobile Computing & Data Mining Lab.

33

Page 34: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Q & AThank you for your listening.

34

Page 35: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Information Gain

For a dataset D with m different class lables, Info(D) measure how well the classes of D are evenly distributed :

InfoA : The equivalent Info (weighted sum) of subsets of D,

where D is split into subsets using attribute A :

Gain(A) :

35

)(log)( 21

i

m

ii ppDInfo

)(||

||)(

1j

v

j

jA DI

D

DDInfo

(D)InfoInfo(D)Gain(A) A

, Pi : prob. of a sample in D belongs to class i.

A : {a1,a2,…,av} , attr. A has v different valuesD : is split into {D1,D2,…,Dv}Di : contains samples with A equal to aj

Page 36: A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology

Data Mining: Concepts and Techniques

Attribute Selection: Information Gain

• Class P: buys_computer = “yes” : 9• Class N: buys_computer = “no” : 5

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIDInfoage

048.0)_(151.0)(029.0)(

ratingcreditGainstudentGainincomeGain

246.0)()()( DInfoDInfoageGain age

940.0)14

5(log

14

5)

14

9(log

14

9)5,9()( 22 IDInfo

4- 36

Age incomestuden

tcredit Buy

<=30 high no fair no<=30 high no excellent no

31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no

31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes

<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes

>40 medium no excellent no

age P N<=30 2 3

31…40 4 0>40 3 2