feature selection and its application in genomic data analysis march 9, 2004

Feature Selection and Its Application in Genomic Data Analysis

March 9, 2004

Lei YuArizona State University

2

Outlines

Introduction to feature selectionMotivationProblem statementKey research issues

Application in genomic data analysisOverview of data mining for microarray dataGene selectionA case study

Current research directions

3

Motivation

An active field inPattern recognitionMachine learningData miningStatistics

F1 F2 . . . FN CI1 f11 f12 . . . f1N c1

I2 f21 f22 . . . f2N c2

. . . . . .

. . . . . .

. . . . . .

IM fM1 fM2 . . . fMN cMGoodnessReducing dimensionalityImproving learning efficiencyIncreasing predicative accuracyReducing complexity of learned results

4

Problem Statement

A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991)Selecting a minimum subset G such that P(C|G) is equal or as close as possible to P(C|F) (Koller and Sahami, 1996)

5

An Example for the Problem

Data setFive Boolean featuresC = F1∨F2

F3 = ┐F2 , F5 = ┐F4

Optimal subset: {F1, F2} or {F1, F3}

Combinatorial nature of searching for an optimal subset

F1 F2 F3 F4 F5 C0 0 1 0 1 00 1 0 0 1 11 0 1 0 1 11 1 0 0 1 10 0 1 1 0 00 1 0 1 0 11 0 1 1 0 11 1 0 1 0 1

6

Subset Search

An example of search space (Kohavi and John, 1997)

7

Evaluation Measures

Wrapper modelRelying on a predetermined classification algorithmUsing predictive accuracy as goodness measureHigh accuracy, computationally expensive

Filter modelSeparating feature selection from classifier learningRelying on general characteristics of data (distance, correlation, consistency)No bias toward any learning algorithm, fast

8

A Framework for Algorithms

SubsetGeneration

SubsetEvaluation

StoppingCriterion

Original

Set

Current Best Subset

Candidate

Subset

YesNo

Selected Subset

9

Feature Ranking

Weighting and ranking individual featuresSelecting top-ranked ones for feature selectionAdvantages

Efficient: O(N) in terms of dimensionality NEasy to implement

DisadvantagesHard to determine the thresholdUnable to consider correlation between features

10

Applications of Feature Selection

Text categorizationYang and Pederson, 1997 (CMU)Forman, 2003 (HP Labs)

Image retrieval Swets and Weng, 1995 (MSU)Dy et al, 2003 (Purdue University)

Gene expression microarrray data analysis

Xing et al, 2001 (UC Berkeley)Lee et al, 2003 (Texas A&M)

Customer relationship managementNg and Liu, 2000 (NUS)

Intrusion detectionLee et al, 2000 (Columbia University)

11

Microarray Technology

Enabling simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experimentProviding new opportunities and challenges for data mining

Gene ValueM23197_atU66497_at

M92287_at

26188

4778...

.

.

.

12

Two Ways to View Microarray Data

.

.

.

AML

ALL

ALL

Class

. . .498341450Sample 3

. . .270074101Sample 2

. . .477888261Sample 1

. . .

. . .

. . .

. . .

.

.

.

M92287_at

.

.

.

U66497_at

.

.

.

M23197_at

.

.

.

Gene Sample

13

Data Mining Tasks

Genes Samples

Clustering

Classification

Building a classifier to predict the classes of new samples

Grouping similar samples together to find classes or subclasses

Grouping similar genes together to find co-regulated genes

Data points are:

14

Gene Selection

Data characteristics in sample classification

High dimensionality (thousands of genes)Small sample size (often less than 100 samples)

ProblemsCurse of dimensionalityOverfitting the training data

Traditional gene selection methodsWithin the filter modelGene ranking

15

A Case Study (Golub et al., 1999)

Leukemia data7129 genes, 72 samplesTraining: 38 (27 ALL, 11 AML)Test: 34 (20 ALL, 14 AML)

NormalizationMean: 0Standard deviation: 1

Correlation measure -3 0 3

Normalized Expression

ALL AML

16

Case Study (continued)

Performance of selected genesAccuracy on training set: 36 out 38 (94.74%) correctly classifiedAccuracy on test set: 29 out 34 (85.29%) correctly classified

LimitationsDomain knowledge required to determine the number of genes selectedUnable to remove redundant genes

17

Feature/Gene Redundancy

Examining redundant genes Two heads are not necessarily better than oneEffects of redundant genes

How to handle redundancyA challengeSome latest work•MRMR (Maximum Relevance Minimum

Redundancy) (Ding and Peng, CSB-2003)•FCBF (Fast Correlation Based Filter) (Yu

and Liu, ICML-2003)

18

Research Directions

Feature selection for unlabeled dataCommon things as for labeled dataDifference

Dealing with different data typesNominal, discrete, continuousDiscretization

Dealing with large size dataComparative study and intelligent selection of feature selection methods

19

References

G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. ICML-1994.L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML-2003.T. R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science-1999.C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. CSB-2003.J. Shavlik and D. Page. Machine learning and genetic microarrays. ICML-2003 tutorial. http://www.cs.wisc.edu/~dpage/ICML-2003-Tutorial-Shavlik-Page.ppt

feature selection and its application in genomic data analysis march 9, 2004

Documents