a hybrid method for gene selection in microarray datasets yungho leu, chien-pan lee and ai-chen...
TRANSCRIPT
A hybrid method for gene selection in microarray datasets
Yungho Leu, Chien-Pan Lee and Ai-Chen Chang
National Taiwan University of Science and Technology
2014/10/22
Outline
2
Experimental result
Microarray Datasets & Research Objective
Related work & backgroundResearch method
Conclusion
Microarray datasets
Microarray technology can be used to measure the expression levels of thousands of genes at the same time.
A microarray dataset records the gene expressions of different samples in a table.
3 Mobile Computing & Data Mining
Lab.
Microarray datasets
N : Number of samples (40~200) M : Num. of genes (2,000~30,000) gi,j : expression level of gene j at sampel i
Class label : the class label of the sample
Mobile Computing & Data Mining Lab.
4
(M >> N)
M genes Class label
N
Sam
ples
Gene1 Gene2 Class
S1 0.022 -0.721 0
S 2 -1.034 0.331 0
… … … …
Sj-1 -0.212 0.123 1
Sj 0.542 0.431 1
The Prostate cancer dataset : (Simplified)
0 : Absent1 : Present
Research objective
M>>N pose challenges in diagnosis (or Classification)
Mobile Computing & Data Mining Lab.
5
To select a minimal subset of genes with high classification accuracy rate.
A gene selection problem
Outline
6
Experimental result
Microarray Datasets & Research Objective
Related work & Background Research method
Conclusion
Related work
Ding, C., & Peng, H. used the Pearson correlation coefficient
to eliminate redundant genes from microarray datasets.
Minimum redundancy feature selection from microarray gene
expression data.(2003 & 2005)
Yang, et al. proposed to use information gain and genetic
algorithms for gene selection.
IG-GA: A Hybrid Filter/Wrapper Method for Feature
Selection of Microarray Data.(2010)
7 Mobile Computing & Data Mining
Lab.
Related work
Luo, et al. clustered genes into groups and treated genes in the
same group as redundant genes.
Improving the Computational Efficiency of Recursive
Cluster Elimination. (2011)
8 Mobile Computing & Data Mining
Lab.
Background knowledge
Information Gain: Proposed by Quinlan as a basis of attribute selection in Decision Tree.
Attributes with larger information gains are better for classification (or differentiating between different class labels of data samples).
Mobile Computing & Data Mining Lab.
9
Ecological correlation (Robinson)
Ecological Correlation
Divide dataset into groups, use the means of different
groups to calculate the Pearson correlation
coefficients.
Reduce the in-group variance, increase the value of
correlation coefficient between attributes.
Mobile Computing & Data Mining Lab.
10
Example
Leukemia1 dataset grouped by class labels (0 ,1 ,2)
Cor(gene1 {μ0, μ1, μ2},gene2{μ0, μ1, μ2}) = -0.9886
Mobile Computing & Data Mining Lab.
11
gene1 gene2 class
-0.9058 -0.9298 0
0.8371 -1.3022 0
1.0694 -0.7826 1
-1.5851 -0.8680 1
-0.1908 -0.6507 2
-1.0578 0.8268 2
μ0 μ1 μ2
gene1 -0.0344 -0.2578 -0.6243gene2 -1.1160 -0.8253 0.0881
mean
mean
Support Vector Machine
A classification method by Cortes & Vapnik(1995) To find a good hyper-plane to separate samples with different
class labels.
Mobile Computing & Data Mining Lab.
12
∣a1-a∣> |b1-b∣
Hyper-plane a is better than hyper-plane b.
margin
Support Vectors
b1bb2
a2a1 a
Outline
13
Experimental result
Microarray Datasets & Research Objective
Related work & Background Research method
Conclusion
Research method
Mobile Computing & Data Mining Lab.
14
Data preprocessing
Step I : Gene filtering using IG
Step II : Redundant gene elimination using clustering
Step III : Subset refinement using genetic algorithm
Data preprocessing - Normalization
Normalize the dataset using Z-Score
Z score of gene expression Xij:
Where
‐ Xij : the expression gene j on sample i.
‐ : Mean of gene i’s expression over different samples
‐ Si : standard deviation of gene i’s expression over different samples.
15 Mobile Computing & Data Mining
Lab.
ix
i
iijij S
xXZ
Gene filtering by information gain
16
Gene filtering
Most of the genes have their IG values equal to 0.
Select the gene with IG greater than 0 for candidate genes.
For example, the Leukemia1 dataset has 5,327 genes; only 263
genes left after gene filtering with IG.
Mobile Computing & Data Mining Lab.
17
Grouping of gene
18
Grouping of genes
Gene list and threshold of cor. Build the list of candidate genes Set threshold = 0.8 ( strong positively correlated )
Grouping method : With the first gene on the list as the basis, group the rest genes
with the basis gene if their correlation coefficients is greater than 0.8.
Mobile Computing & Data Mining Lab.
19
Gene ID Cor.
Gene 1,2 0.83Gene 1,3 0.53Gene 1,4 0.32Gene 1,5 0.13
... ...
gene ID
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
...
Build a gene list Calculate correlation coefficients
Eliminate the genes in the group from the list; repeat the same procedure on the rest of genes until no gene left on the list.
Eliminate genes from the existing group
Mobile Computing & Data Mining Lab.
20
Gene ID
Gene 3
Gene 4
Gene 5
Gene ID Cor.
Gene 1,2 0.83
Gene 1,3 0.53
Gene 1,4 0.32
Gene 1,5 0.13
Gene1 Gene2
Gene3 Gene4 Gene5
Cluster1:
Select the gene with the highest IG from each group.
Select one gene from each group
Mobile Computing & Data Mining Lab.
ANOVA :
For dataset with more than three two class labels, use
ANOVA to test whether the class means are all equal.
• Hypothesis:
Gene with no different means over different class labels
are eliminated.
Eliminate genes with no classification capability
22
equal allnot are:H
:H
1
3210
i
Mobile Computing & Data Mining Lab.
T-test
T-test is used to test whether the class means of a gene are
different.
Genes with no different class means are eliminated.
The significant level α is set to 0.05.
Eliminate genes
23 Mobile Computing & Data Mining
Lab.
Subset refinement using GA
24
Subset refinement
Encoding : Binary Encoding:
• ” 0 ”--- gene not selected; ” 1 ”---gene is selected.
• example : 011001---select the 2nd, 3rd, 6th genes from the
subset.
Chromosome length: the candidate gene subset from
step II.
Population size=5
Number of Iteration =1,000
Mobile Computing & Data Mining Lab.
25
Subset refinement
Fitness function : the accuracy rate of SVM of the chromosome.
Selection method: Roulette Wheel • Selection probability is in proportional to the fitness value of
the chromosome
Single point crossover and mutation :• Crossover Rate =0.7• Mutation Rate = 0.3
Mobile Computing & Data Mining Lab.
26
Termination condition
Termination condition : (any of the following)• Accuracy rate = 100%• # of iteration = 1,000• # of iteration is greater than 100 and the accuracy rates
of the last 20 iterations are all the same. Final solution : the chromosome with the largest
fitness value in the last iteration.
Mobile Computing & Data Mining Lab.
27
Outline
28
Experimental result
Microarray Datasets & Research Objective
Related work & Background Research method
Conclusion
The datasets
Mobile Computing & Data Mining Lab.
29
Data set name # of samples # of class labels # of genes
9_Tumors 60 9 5,726
Brain_Tumor1 90 5 5,920
Brain_Tumor2 50 4 10,367
Leukemia1 72 3 5,327
Leukemia2 72 3 11,225
Lung Cancer 203 5 12,600
SRBCT 83 4 2,308
11_Tumors 174 11 12,533
Prostate Tumor 102 2 10,509
DLBCL 77 2 5,469
GEMS : http://www.gems-system.org/
Genes selected in 3 steps
Mobile Computing & Data Mining Lab.
30
Data Set# of original
genesIG Grouping GA
9_Tumors 5,726 103 25 13
Brain_Tumor1 5,920 185 19 10
Brain_Tumor2 10,367 3,099 19 4
Leukemia1 5,327 263 7 4
Leukemia2 11,225 3,097 6 3
Lung_Cancer 12,600 3,183 36 18
SRBCT 2,308 351 14 7
11_Tumors 12,533 3,483 510 255
Prostate_Tumor 10,509 671 235 119
DLBCL 5,469 315 169 84
Compare with other paper
Comparisons of Our method(Hybrid), GEPUBLIC, PAM, IG-GA
Mobile Computing & Data Mining Lab.
31
Data Set GEPUBLIC PAM IG-GA Hybrid
9_Tumors 66.67(19) 43.33 (47) 85.00 (52) 71.67(13)
Brain_Tumor1 84.44(30) 85.56 (42) 93.33 (244) 91.12(10)
Brain_Tumor2 80.00(15) 66.00 (25) 88.00 (489) 92.00(4)
Leukemia1 97.22(11) 93.06 (11) 100.00 (82) 97.23(4)
Leukemia2 91.67(31) 91.67 (52) 98.61 (782) 100.00(3)
Lung_Cancer 94.58(29) 93.60 (75) 95.57 (2101) 97.05(18)
SRBCT 98.80(26) 98.80 (41) 100.00 (56) 100.00(7)
11_Tumors 86.21(87) 81.61 (203) 92.53 (479) 91.95(255)
Prostate_Tumor 95.10(4) 93.14 (13) 96.08 (343) 94.12(119)
DLBCL 97.40(13) 80.52 (70) 100.00 (107) 97.40(84)
Outline
32
Experimental result
Microarray Datasets & Research Objective
Related work & Background Research method
Conclusion
Conclusion
Each step in our method effectively reduces noisy genes from
its previous step.
The hybrid method select fewer genes with higher
classification accuracy rate.
Need to further improve the hybrid method over 2-class
microarray datasets.
Mobile Computing & Data Mining Lab.
33
Q & AThank you for your listening.
34
Information Gain
For a dataset D with m different class lables, Info(D) measure how well the classes of D are evenly distributed :
InfoA : The equivalent Info (weighted sum) of subsets of D,
where D is split into subsets using attribute A :
Gain(A) :
35
)(log)( 21
i
m
ii ppDInfo
)(||
||)(
1j
v
j
jA DI
D
DDInfo
(D)InfoInfo(D)Gain(A) A
, Pi : prob. of a sample in D belongs to class i.
A : {a1,a2,…,av} , attr. A has v different valuesD : is split into {D1,D2,…,Dv}Di : contains samples with A equal to aj
Data Mining: Concepts and Techniques
Attribute Selection: Information Gain
• Class P: buys_computer = “yes” : 9• Class N: buys_computer = “no” : 5
694.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIDInfoage
048.0)_(151.0)(029.0)(
ratingcreditGainstudentGainincomeGain
246.0)()()( DInfoDInfoageGain age
940.0)14
5(log
14
5)
14
9(log
14
9)5,9()( 22 IDInfo
4- 36
Age incomestuden
tcredit Buy
<=30 high no fair no<=30 high no excellent no
31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no
31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes
<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes
>40 medium no excellent no
age P N<=30 2 3
31…40 4 0>40 3 2