robust prediction of cancer disease using pattern classification of microarray gene-expression

30
Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression Data Presented by- Md. Mushfiqur Rahman Researcher Bioinformatics Lab. Dept. of Statistics, R.U. E-mail: [email protected] Md. Matiur Rahaman 1,2 , Md. Mushfiqur Rahman 2 , Md. Nurul Haque Mollah 2 and Ming Chen 1 1. Department of Bioinformatics, College of Life Sciences, Zhejiang University, Zijingang Campus, Hangzhou 310058, China. 2. Bioinformatics Lab, Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh. International Conference on Applied Statistics (ICAS) The Institute of Statistical Research and Training (ISRT) University of Dhaka, Dhaka 27-29 December 2014 Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU. 1 Welcome to presentation on

Upload: md-rahman

Post on 15-Apr-2017

249 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Robust Prediction of Cancer Disease Using Pattern

Classification of Microarray Gene-Expression Data

Presented by-

Md. Mushfiqur Rahman

Researcher

Bioinformatics Lab.

Dept. of Statistics, R.U.

E-mail: [email protected]

Md. Matiur Rahaman1,2, Md. Mushfiqur Rahman2, Md. Nurul Haque Mollah2 and Ming Chen1

1. Department of Bioinformatics, College of Life Sciences, Zhejiang University, Zijingang Campus, Hangzhou 310058, China.

2. Bioinformatics Lab, Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh.

International Conference on Applied Statistics (ICAS)

The Institute of Statistical Research and Training (ISRT)

University of Dhaka, Dhaka 27-29 December 2014 Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.

1

Welcome to presentation

on

Page 2: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Outlines

1. Introduction to Gene-Expression Data.

2. Robust Classifier.

3. Performance Investigation of Robust Classifiers using

Simulated Data.

4. Performance Investigation using Simulated Gene-

Expression Profile for Prediction of Cancer Disease.

5. Performance Investigation using Real Gene-Expression

Profile for Prediction of Cancer Disease.

6. Conclusion.

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

2

Page 3: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Introduction to Gene-Expression Data

• Expression level of genes in an individual that is measured through

Microarray is called Gene-Expression data. Each data point produced by a

DNA microarray hybridization experiment represents the ratio of expression

levels of a particular gene under two different experimental conditions.

Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.

3

Gene Expression

Page 4: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Microarray Technology and Gene Expression Data

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

4

Page 5: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Example of Gene-Expression Data

Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.

5

Genes

mRNA samples

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ...

Gene expression level of gene i in mRNA sample j

= Log( Red intensity / Green intensity)

Page 6: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

A Complete workflow for Gene-Expression data analysis

Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.

6

• Workflow for real microarray gene expression data classification-

Page 7: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Hierarchical Clustering

Partition-Based

Clustering

Divisive Methods

(Top - Down)

Agglomerative methods

(Bottom - Up)

1. Single Linkage Clustering / Nearest Neighbor Technique

2. Complete Linkage Clustering

3. Average Linkage Clustering

4. Ward's Hierarchical Clustering

5. Centroid Method

6. Median Method

7. And so on

Different Classification

Unsupervised classification

(Clustering)

Supervised classification

1.Bayes classifier.

2.Maximum likelihood classifier.

3. FLDA (Fisher Linear

Discriminate Analysis)

4. SVM (Support Vector Machines)

5. Decision Trees

6. K-NN (K-Nearest Neighbors)

7. AdaBoost .

8. Robust Classifier (Proposed)

9.And so on.

1. K-Means Clustering

2. Fuzzy Clustering

3. Model Based Clustering

4. And so on

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

7

Page 8: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Bayes Classifier

Bayes classifier: Classify objects to a class with probability.

Foundation: Based on Bayes Theorem.

A short note on Bayes classifier under normal populations

Let π1 ,…, πm be m normal populations .

Let {xi (k) ~ , i=1,2, …, Nk ; k=1,2, …, m} be the training data set.

Objective is to classify a new data vector (or test data vector) x into one of

k populations π1, … , πm .

Let the prior probability of be qk which is known.

Then the posterior probability of is defined by,

Where, fk (x) = be the pdf of πk .

),( )( VN kp

kx

kx

),( )( VN kp

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

8

Page 9: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Bayes Classifier (Cont…)

Then the classification region Rk is defined for classifying x to the population

Πk as follows:

Discriminant

function

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

9 This is known as Bayes classifier to classify an object x to the population Πk

Page 10: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

10

Bayes Classifier (Cont…)

• Traditional Bayes procedure may produce misleading results in presence of outliers in

the training dataset or test dataset or in both datasets.

• To improve the results, one can replace MLEs by the robust estimators like MVE

(Rousseeuw et al.,1985) , MCD (Rousseeuw et al.,1985) and OGK (Maronna and

Zama 2002) estimators.

• But the performance of this robust procedures are not so good in the case of high

dimensional dataset.

Also these estimators may not control the influence of contaminated test vector (x).

• To overcome this problem, an attempt is made to Robustify the Bayes procedures by

minimum β−divergence method (Mollah et al., 2007, 2010).Which is our proposed

method.

Page 11: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Robust Bayes classifier

Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.

11

• The minimum β-divergence estimator 𝜇 𝛽(𝑘)

and 𝑉 𝛽(𝑘)

for the mean vector μ(k)

and the covariance matrix V(k) respectively are obtained iteratively as

follows:

𝜇𝑟+1(𝑘)

= 𝜙𝛽 𝒙𝑖

(𝑘);𝜇𝑟

(𝑘),𝑉𝑟

(𝑘)𝒙𝑖(𝑘)𝑛𝑘

𝑖=1

𝜙𝛽 𝒙𝑖(𝑘)

;𝜇𝑟(𝑘)

,𝑉𝑟(𝑘)𝑛𝑘

𝑖=1

and, 𝑉𝑟+1(𝑘)

= 𝜙𝛽 𝒙𝑖

(𝑘);𝜇𝑟

(𝑘),𝑉𝑟

(𝑘)𝜓(𝒙𝑖

(𝑘);𝜇𝑟

(𝑘))

𝑛𝑘𝑖=1

𝛽+1 −1 𝜙𝛽 𝒙𝑖(𝑘)

;𝜇𝑟(𝑘)

,𝑉𝑟(𝑘)𝑛𝑘

𝑖=1

where,

• 𝜙𝛽 𝒙𝑖(𝑘)

; 𝜇𝑟(𝑘)

, 𝑉𝑟(𝑘)

= 𝑒𝑥𝑝 −𝛽

2(𝒙𝑖

𝑘−𝜇𝑟

(𝑘))𝑇𝑉𝑟

(𝑘)−1(𝒙𝑖

𝑘−𝜇𝑟

(𝑘)) is β-

weight function & 𝜓(𝒙𝑖(𝑘)

; 𝜇𝑟(𝑘)

) = (𝒙𝑖𝑘−𝜇𝑟

(𝑘)) (𝒙𝑖

𝑘−𝜇𝑟

(𝑘))𝑇

• β=0, these estimators reduces to classical non-iterative estimates.

Page 12: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Robust Bayes Classifier (Cont…)

Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.

12

• Step-1: First, we calculate β-weight for the test vector (x) using the β-weight function-

and then we construct a criteria to test the data vector is contaminated or not as

follows:

• The 𝛽- weight function plays the significant role for robustification of Bayes classifier as discussed follow-

Page 13: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Robust Bayes Classifier (Cont…)

Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.

13

Page 14: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Robust Bayes Classifier (Cont…)

14

Step 2: : If the unclassified data vector x is contaminated by outliers, we calculate the absolute difference between the contaminated vector and each mean vector as-

𝐝𝑘𝑖 = abs 𝒙𝑖 − 𝜇 𝑖,𝛽𝑘

; 𝑖 = 1,2, … , 𝑝,

Compute sum of the smallest r components of dk as

Sk = dk(1) + dk(2) + . . . + dk(r)

where r=round(p/2). Then find the tentative class or population for the unclassified data vector x as-

k =𝑎𝑟𝑔𝑚𝑖𝑛𝑆𝑘

𝑘

Then some or all components of the unclassified contaminated data vector x corresponding to dk(r+1), dk(r+2), ... ,dk(p) are assumed to be corrupted by outliers.

Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.

Page 15: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation of Robust Classifiers using Simulated Data

Both contamination

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

15

No contamination

Page 16: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Application of the Proposed Method for Gene Expression

Data Analysis

Gene Expression Data Generating Model

No

wak

an

d T

ibsh

ira

ni

(20

08)

Bio

sta

tist

ics.

9, 3, 46

7-4

83

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

16

Page 17: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease

Two Class Gene Classification (Absence of Outliers)

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

17

Page 18: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease

Two Class Gene Classification (Presence of Outliers)

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

18

Page 19: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease

Three Class Gene Classification (Absence of Outliers)

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

19

Page 20: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease

Three Class Gene Classification (Absence of Outliers)

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

20

Page 21: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease

(No Contamination)

Box Plot For Cancer Individuals Classification

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

21

Page 22: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease

(Train Data Contamination)

Box Plot For Cancer Individuals Classification

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

22

Page 23: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease

(Test Data Contamination)

Box Plot For Cancer Individuals Classification

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

23

Page 24: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Simulated Gene-Expression Profile for Prediction of Cancer Disease

(Both Data Contamination)

Box Plot For Cancer Individuals Classification

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

24

Page 25: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Real Gene Expression Data Analysis Head and Neck Cancer Data

(Kuriakose et al., 2004)

12,625 genes , 22 Normal Patient, 22 Cancer Patient

594 DE Genes of

12,625 Genes,

Calculated

by

EBarrays Method

Training gene-set ½ of DE

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

25

Page 26: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Real Gene-Expression Profile for Prediction of Cancer Disease

(In absence of outlier)

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

26

Page 27: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Performance Investigation using Real Gene-Expression Profile for Prediction of Cancer Disease

(In Presence of outlier)

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU.

27

Page 28: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Conclusion

Bayes procedure is a popular tool for classification. However, the

traditional Bayes procedure is very much sensitive to outliers. So

we discuss a robustification of Bayes procedure by β-divergence

(Mollah et al., 2007, 2010).

We compare our proposed method with some popular

classification methods (SVM, KNN, AdaBoost, those are use for

Microarray gene expression data analysis) using simulated datasets

and we observe that the performance of our proposed method is

better than all comparable methods as early mentioned.

We have checked the performance of proposed method in

simulated and real both gene-expression data analysis. From the

above discussion simulation and real data results shows that the

proposed method significantly improves the performance over the

traditional Bayes methods in presence of outliers; otherwise, it

keeps equal performance.

Supported by HEQEP (CP-3603.R3-W2) and Bioinformatics Lab., Dept. of Statistics, RU

28

Page 29: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Anderson,T.W.(2003): An Introduction to Multivariate Statistical Analysis,Wiley Interscience

Johnson, R.A., Wichern, D.W. (2007): Applied multivariate statistical analysis, Sixth edition, Prentice-Hall.

Mollah,M.N.H., Minami,M. and Eguchi, S. (2007): Robust prewhitening for ICA by minimizing beta-

divergence and its application to FastICA. Neural processing Letters,25(2), pp. 91-110.

Mollah, M.N.H.,Sultana,N., Minami, M. and Eguchi, S. (2010): Robust extraction of local structures by the

minimum β-divergence method. Neural Networks, 23, pp. 226-238.

Wang,S.,Gui,j. and Li,X. (2008): Factor analysis for cross-platform tumer classification based on gene

expression profiles. Journal of Circuits,Systems,and Computers, 19, pp. 243-258.

Wuju L. and Momiao X.(2002): Tumor classification system based on gene expression profile.

Bioinformatics, 18(2): pp. 325-326.

Wright G.,Tan B., Rosenwald A., Hurt E., Wiestner A. and Staudt L. (2003): A gene expression-based

method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci USA,

2003, 100:9991-9996.

Nowak, G. and Tibshirani, R. (2008) Complementary Hierarchical Clustering. Biostatistics. 9, 3, 467-483.

Veer, L.J. et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530-

536.

References

Supported by HEQEP (CP-3603.R3.W2) and Bioinformatics Lab., Dept. of Statistics, RU.

29

Page 30: Robust Prediction of Cancer Disease Using Pattern Classification of Microarray Gene-Expression

Thank you for Listening.

Supported by HEQEP (CP-3603.R3.W2) and

Bioinformatics Lab., Dept. of Statistics,

University of Rajshahi.