feature selection and its application in genomic data analysis march 9, 2004
DESCRIPTION
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004. Lei Yu Arizona State University. Outlines. Introduction to feature selection Motivation Problem statement Key research issues Application in genomic data analysis Overview of data mining for microarray data - PowerPoint PPT PresentationTRANSCRIPT
Feature Selection and Its Application in Genomic Data Analysis
March 9, 2004
Lei YuArizona State University
2
Outlines
Introduction to feature selectionMotivationProblem statementKey research issues
Application in genomic data analysisOverview of data mining for microarray dataGene selectionA case study
Current research directions
3
Motivation
An active field inPattern recognitionMachine learningData miningStatistics
F1 F2 . . . FN CI1 f11 f12 . . . f1N c1
I2 f21 f22 . . . f2N c2
. . . . . .
. . . . . .
. . . . . .
IM fM1 fM2 . . . fMN cMGoodnessReducing dimensionalityImproving learning efficiencyIncreasing predicative accuracyReducing complexity of learned results
4
Problem Statement
A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991)Selecting a minimum subset G such that P(C|G) is equal or as close as possible to P(C|F) (Koller and Sahami, 1996)
5
An Example for the Problem
Data setFive Boolean featuresC = F1∨F2
F3 = ┐F2 , F5 = ┐F4
Optimal subset: {F1, F2} or {F1, F3}
Combinatorial nature of searching for an optimal subset
F1 F2 F3 F4 F5 C0 0 1 0 1 00 1 0 0 1 11 0 1 0 1 11 1 0 0 1 10 0 1 1 0 00 1 0 1 0 11 0 1 1 0 11 1 0 1 0 1
6
Subset Search
An example of search space (Kohavi and John, 1997)
7
Evaluation Measures
Wrapper modelRelying on a predetermined classification algorithmUsing predictive accuracy as goodness measureHigh accuracy, computationally expensive
Filter modelSeparating feature selection from classifier learningRelying on general characteristics of data (distance, correlation, consistency)No bias toward any learning algorithm, fast
8
A Framework for Algorithms
SubsetGeneration
SubsetEvaluation
StoppingCriterion
Original
Set
Current Best Subset
Candidate
Subset
YesNo
Selected Subset
9
Feature Ranking
Weighting and ranking individual featuresSelecting top-ranked ones for feature selectionAdvantages
Efficient: O(N) in terms of dimensionality NEasy to implement
DisadvantagesHard to determine the thresholdUnable to consider correlation between features
10
Applications of Feature Selection
Text categorizationYang and Pederson, 1997 (CMU)Forman, 2003 (HP Labs)
Image retrieval Swets and Weng, 1995 (MSU)Dy et al, 2003 (Purdue University)
Gene expression microarrray data analysis
Xing et al, 2001 (UC Berkeley)Lee et al, 2003 (Texas A&M)
Customer relationship managementNg and Liu, 2000 (NUS)
Intrusion detectionLee et al, 2000 (Columbia University)
11
Microarray Technology
Enabling simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experimentProviding new opportunities and challenges for data mining
Gene ValueM23197_atU66497_at
M92287_at
26188
4778...
.
.
.
12
Two Ways to View Microarray Data
.
.
.
AML
ALL
ALL
Class
. . .498341450Sample 3
. . .270074101Sample 2
. . .477888261Sample 1
. . .
. . .
. . .
. . .
.
.
.
M92287_at
.
.
.
U66497_at
.
.
.
M23197_at
.
.
.
Gene Sample
13
Data Mining Tasks
Genes Samples
Clustering
Classification
Building a classifier to predict the classes of new samples
Grouping similar samples together to find classes or subclasses
Grouping similar genes together to find co-regulated genes
Data points are:
14
Gene Selection
Data characteristics in sample classification
High dimensionality (thousands of genes)Small sample size (often less than 100 samples)
ProblemsCurse of dimensionalityOverfitting the training data
Traditional gene selection methodsWithin the filter modelGene ranking
15
A Case Study (Golub et al., 1999)
Leukemia data7129 genes, 72 samplesTraining: 38 (27 ALL, 11 AML)Test: 34 (20 ALL, 14 AML)
NormalizationMean: 0Standard deviation: 1
Correlation measure -3 0 3
Normalized Expression
ALL AML
16
Case Study (continued)
Performance of selected genesAccuracy on training set: 36 out 38 (94.74%) correctly classifiedAccuracy on test set: 29 out 34 (85.29%) correctly classified
LimitationsDomain knowledge required to determine the number of genes selectedUnable to remove redundant genes
17
Feature/Gene Redundancy
Examining redundant genes Two heads are not necessarily better than oneEffects of redundant genes
How to handle redundancyA challengeSome latest work•MRMR (Maximum Relevance Minimum
Redundancy) (Ding and Peng, CSB-2003)•FCBF (Fast Correlation Based Filter) (Yu
and Liu, ICML-2003)
18
Research Directions
Feature selection for unlabeled dataCommon things as for labeled dataDifference
Dealing with different data typesNominal, discrete, continuousDiscretization
Dealing with large size dataComparative study and intelligent selection of feature selection methods
19
References
G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. ICML-1994.L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML-2003.T. R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science-1999.C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. CSB-2003.J. Shavlik and D. Page. Machine learning and genetic microarrays. ICML-2003 tutorial. http://www.cs.wisc.edu/~dpage/ICML-2003-Tutorial-Shavlik-Page.ppt