applications of data mining in microarray data analysis

46
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Upload: jeanette-lara

Post on 04-Jan-2016

39 views

Category:

Documents


5 download

DESCRIPTION

Applications of Data Mining in Microarray Data Analysis. Yen-Jen Oyang Dept. of Computer Science and Information Engineering. Observations and Challenges in the Information Age. A huge volume of information has been and is being digitized and stored in the computer. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Applications of Data Mining in Microarray Data Analysis

Applications of Data Mining in Microarray Data Analysis

Yen-Jen Oyang

Dept. of Computer Science and Information Engineering

Page 2: Applications of Data Mining in Microarray Data Analysis

Observations and Challenges in the Information Age

• A huge volume of information has been and is being digitized and stored in the computer.

• Due to the volume of digitized information, effectively exploitation of information is beyond the capability of human being without the aid of intelligent computer software.

Page 3: Applications of Data Mining in Microarray Data Analysis

An Example of Data Mining

• Given the data set shown on next slide, can we figure out a set of rules that predict the classes of objects?

Page 4: Applications of Data Mining in Microarray Data Analysis

Data Set

Data Class Data Class Data Class

( 15,33)

O ( 18,28)

× ( 16,31)

O

( 9 ,23)

× ( 15,35)

O ( 9 ,32)

×

( 8 ,15)

× ( 17,34)

O ( 11,38)

×

( 11,31)

O ( 18,39)

× ( 13,34)

O

( 13,37)

× ( 14,32)

O ( 19,36)

×

( 18,32)

O ( 25,18)

× ( 10,34)

×

( 16,38)

× ( 23,33)

× ( 15,30)

O

( 12,33)

O ( 21,28)

× ( 13,22)

×

Page 5: Applications of Data Mining in Microarray Data Analysis

Distribution of the Data Set

。。

10 15 20

30

。。。 。。

。 。。

××

××

×

×

×

×

×

×

××

×

×

Page 6: Applications of Data Mining in Microarray Data Analysis

Rule Based on Observation

.

0

30

253015 22

Xclass

else

class

, thenand y

yxIf

Page 7: Applications of Data Mining in Microarray Data Analysis

Rule Generated by a RBF(Radial Basis Function)

Network Based Learning Algorithm

Let and

If then prediction=“O”.

Otherwise prediction=“X”.

2o

2o

210

12o

o 2

1)( i

icv

i i

evf

.

2

1)(

2

214

12x

x

2x

x

j

jcv

j j

evf

),()( xo vfvf

Page 8: Applications of Data Mining in Microarray Data Analysis

(15,33)

(11,31)

(18,32)

(12,33)

(15,35)

(17,34)

(14,32)

(16,31)

(13,34)

(15,30)

1.723 2.745 2.327 1.794 1.973 2.045 1.794 1.794 1.794 2.027

ico

io

(9,23) (8,15)(13,37)

(16,38)

(18,28)

(18,39)

(25,18)

(23,33)

(21,28)

(9,32)(11,38)

(19,36)

(10,34)

(13,22)

6.458 10.08 2.939 2.745 5.451 3.287 10.86 5.322 5.070 4.562 3.463 3.587 3.232 6.260

jcx

jx

Page 9: Applications of Data Mining in Microarray Data Analysis

Identifying Boundary of Different Classes of Objects

Page 10: Applications of Data Mining in Microarray Data Analysis

Boundary Identified

Page 11: Applications of Data Mining in Microarray Data Analysis

Data Mining /Knowledge Discovery

• The main theme of data mining is to discover unknown and implicit knowledge in a large dataset.

• There are three main categories of data mining algorithms:• Classification;• Clustering;• Mining association rule/correlation analysis.

Page 12: Applications of Data Mining in Microarray Data Analysis

Data Classification

• In a data classification problem, each object is described by a set of attribute values and each object belongs to one of the predefined classes.

• The goal is to derive a set of rules that predicts which class a new object should belong to, based on a given set of training samples. Data classification is also called supervised learning.

Page 13: Applications of Data Mining in Microarray Data Analysis

Instance-Based Learning

• In instance-based learning, we take k nearest training samples of a new instance (v1, v2, …, vm) and assign the new instance to the class that has most instances in the k nearest training samples.

• Classifiers that adopt instance-based learning are commonly called the KNN classifiers.

Page 14: Applications of Data Mining in Microarray Data Analysis

Example of the KNN

• If an 1NN classifier is employed, then the prediction of “” = “X”.

• If an 3NN classifier is employed, then prediction of “” = “O”.

Page 15: Applications of Data Mining in Microarray Data Analysis

Applications of Data Classification in Bioinformatics

• In microarray data analysis, data classification is employed to predict the class of a new sample based on the existing samples with known class.

Page 16: Applications of Data Mining in Microarray Data Analysis

• For example, in the Leukemia data set, there are 72 samples and 7129 genes.• 25 Acute Myeloid Leukemia(AML) samples.

• 38 B-cell Acute Lymphoblastic Leukemia samples.

• 9 T-cell Acute Lymphoblastic Leukemia samples.

Page 17: Applications of Data Mining in Microarray Data Analysis

Model of Microarray Data SetsGene1 Gene2 ‧‧‧‧‧‧ Genen

Sample1

Sample2

Samplem.),( RjiM

Page 18: Applications of Data Mining in Microarray Data Analysis

Alternative Data Classification Algorithms

• Decision tree (Q4.5 and Q5.0);• Instance-based learning(KNN);• Naïve Bayesian classifier;• Support vector machine(SVM);

• Novel approaches including the RBF network based classifier that we have recently proposed.

Page 19: Applications of Data Mining in Microarray Data Analysis

Accuracy of Different Classification Algorithms

Data setclassification algorithms

RBF SVM 1NN 3NN

Satimage

(4335,2000)92.30 91.30 89.35 90.6

Letter

(15000,5000)97.12 97.98 95.26 95.46

Shuttle

(43500,14500)99.94 99.92 99.91 99.92

Average 96.45 96.40 94.84 95.33

Page 20: Applications of Data Mining in Microarray Data Analysis

Comparison of Execution Time(in seconds)

RBF without data reduction

RBF with data reduction SVM

Cross validation

Satimage 670 265 64622

Letter 2825 1724 386814

Shuttle 96795 59.9 467825

Make classifier

Satimage 5.91 0.85 21.66

Letter 17.05 6.48 282.05

Shuttle 1745 0.69 129.84

Test

Satimage 21.3 7.4 11.53

Letter 128.6 51.74 94.91

Shuttle 996.1 5.85 2.13

Page 21: Applications of Data Mining in Microarray Data Analysis

More InsightsSatimage Letter Shuttle

# of training samples in the original data set 4435 15000 43500

# of training samples after data reduction is applied 1815 7794 627

% of training samples remaining 40.92% 51.96% 1.44%

Classification accuracy after data reduction is applied 92.15 96.18 99.32

# of support vectors in identified by LIBSVM 1689 8931 287

Page 22: Applications of Data Mining in Microarray Data Analysis

Data Clustering

• Data clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Data clustering is also called unsupervised learning.

Page 23: Applications of Data Mining in Microarray Data Analysis

The Agglomerative Hierarchical Clustering

Algorithms• The agglomerative hierarchical clustering

algorithms operate by maintaining a sorted list of inter-cluster distances.

• Initially, each data instance forms a cluster.

• The clustering algorithm repetitively merges the two clusters with the minimum inter-cluster distance.

Page 24: Applications of Data Mining in Microarray Data Analysis

• Upon merging two clusters, the clustering algorithm computes the distances between the newly-formed cluster and the remaining clusters and maintains the sorted list of inter-cluster distances accordingly.

• There are a number of ways to define the inter-cluster distance:• minimum distance (single-link);• maximum distance (complete-link);• average distance;• mean distance.

Page 25: Applications of Data Mining in Microarray Data Analysis

An Example of the Agglomerative Hierarchical

Clustering Algorithm• For the following data set, we will get

different clustering results with the single-link and complete-link algorithms.

1

23 4

5

6

Page 26: Applications of Data Mining in Microarray Data Analysis

Result of the Single-Link algorithm

1

23 4

5

6

1 3 4 5 2 6

Result of the Complete-Link algorithm

1

23 4

5

61 3 2 4 5 6

Page 27: Applications of Data Mining in Microarray Data Analysis

Remarks

• The single-link and complete-link are the two most commonly used alternatives.

• The single-link suffers the so-called chaining effect.

• On the other hand, the complete-link also fails in some cases.

Page 28: Applications of Data Mining in Microarray Data Analysis

Example of the Chaining Effect

Single-link (10 clusters)

Complete-link (2 clusters)

Page 29: Applications of Data Mining in Microarray Data Analysis

Effect of Bias towards Spherical Clusters

Single-link (2 clusters) Complete-link (2 clusters)

Page 30: Applications of Data Mining in Microarray Data Analysis

K-Means: A Partitional Data Clustering Algorithm

• The k-means algorithm is probably the most commonly used partitional clustering algorithm.

• The k-means algorithm begins with selecting k data instances as the means or centers of k clusters.

Page 31: Applications of Data Mining in Microarray Data Analysis

• The k-means algorithm then executes the following loop iteratively until the convergence criterion is met.• repeat {

• assign every data instance to the closest cluster based on the distance between the data instance and the center of the cluster;

• compute the new centers of the k clusters;

• } until(the convergence criterion is met);

Page 32: Applications of Data Mining in Microarray Data Analysis

• A commonly-used convergence criterion is

.cluster ofcenter theis where

,2

ii

C Cpi

Cm

mpEi i

Page 33: Applications of Data Mining in Microarray Data Analysis

Illustration of the K-Means Algorithm---(I)

initial center

initial center initial center

Page 34: Applications of Data Mining in Microarray Data Analysis

Illustration of the K-Means Algorithm---(II)

x

x

x

new center after 1st iteration

new center after 1st iteration

new center after 1st iteration

Page 35: Applications of Data Mining in Microarray Data Analysis

Illustration of the K-Means Algorithm---(III)

new center after 2nd iteration

new center after 2nd iteration

new center after 2nd iteration

Page 36: Applications of Data Mining in Microarray Data Analysis

A Case in which the K-Means Algorithm Fails

• The K-means algorithm may converge to a local optimal state as the following example demonstrates:

InitialSelection

Page 37: Applications of Data Mining in Microarray Data Analysis

Remarks

• As the examples demonstrate, no clustering algorithm is definitely superior to other clustering algorithms with respect to clustering quality.

Page 38: Applications of Data Mining in Microarray Data Analysis

Applications of Data Clustering in Microarray Data Analysis

• Data clustering has been employed in microarray data analysis for• identifying the genes with similar expressions;

• identifying the subtypes of samples.

Page 39: Applications of Data Mining in Microarray Data Analysis

Feature Selection in Microarray Data Analysis

• In microarray data analysis, it is highly desirable to identify those genes that are correlated to the classes of samples.

• For example, in the Leukemia data set, there are 7129 genes. We want to identify those genes that lead to different disease types.

Page 40: Applications of Data Mining in Microarray Data Analysis

• Furthermore, Inclusion of features that are not correlated to the classification decision may result in lower classification accuracy or poor clustering quality.

• For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.

Page 41: Applications of Data Mining in Microarray Data Analysis

• It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly.

x=10 x

y

Page 42: Applications of Data Mining in Microarray Data Analysis

Univariate Analysis in Feature Selection

• In the univariate analysis, the importance of each feature is determined by how objects of different classes are distributed in this particular axis.

• Let and denote the feature values of class-1 and class-2 objects, respectively.

• Assume that the feature values of both classes of objects follow the normal distribution.

mvvv ,...,, 21 nvvv ,...,, 21

Page 43: Applications of Data Mining in Microarray Data Analysis

• Then,

is a t-distribution with degree of freedom = (m+n-2), where

If the t statistic of a feature is lower than a threshold, then the feature is deleted.

,11

211 22

nmnmsnsm

vvT

.1

1 and

1

1

;1

and 1

1

22

1

22

11

n

ii

m

ii

n

ii

m

ii

vvn

svvm

s

vn

vvm

v

Page 44: Applications of Data Mining in Microarray Data Analysis

Multivariate Analysis

• The univariate analysis is not able to identify crucial features in the following example.

Page 45: Applications of Data Mining in Microarray Data Analysis

• Therefore, multivariate analysis has been developed. However, most multivariate analysis algorithms that have been proposed suffer high time complexity and may not be applicable in real-world problems.

Page 46: Applications of Data Mining in Microarray Data Analysis

Summary

• Data clustering and data classification have been widely used in microarray data analysis.

• Feature selection is the most challenging issue as of today.