myths and statistical principles in dna microarray research richard simon, d.sc. chief, biometric...

50
Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics National Cancer Institute

Post on 19-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Myths and Statistical Principles in DNA Microarray Research

Richard Simon, D.Sc.

Chief, Biometric Research Branch

Head, Molecular Statistics & Bioinformatics

National Cancer Institute

Page 2: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

• All cells of a multi-cellular organism contain essentially the same DNA

• Cells differ in function based on the spectra of which genes are expressed and the level of expression

• Proteins do the work of cells and gene expression determines the intra-cellular concentration of proteins

• mRNA is an intermediate product of gene expression; a gene is transcribed into a mRNA molecule which is then translated into a protein molecule

Page 3: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics
Page 4: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics
Page 5: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Types of DNA Microarrays

• mRna transcript quantification

• Genomic DNA sequence determination– SNP identification– Genotyping

• Detecting gene deletions or gene duplications

Page 6: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Types of Microarrays

• DNA microarrays

• Tissue microarrays

• Protein microarrays

Page 7: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics
Page 8: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics
Page 9: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

[Affymetrix] Hybridization Array

Page 10: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Biology in Transition

• Biotechnology – Restriction enzymes

– Ligases

– Polymerases

– PCR

• Instruments, Tools, Reagents and Information Resources of Major Impact– DNA sequencing

– Functional whole genomic assays

Page 11: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

How to Deal With the Plethora of Data

• Development of software tools

• Training of biologists to use tools

• Collaboration with mathematical & computational scientists

• Training of mathematical & computational scientists

Page 12: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Bioinformatics

• An ambiguous term that helps further confuse people who are sometimes already confused

• Refers to a range of activities all of which involve multi-disciplinary collaboration among biological, mathematical, computational scientists and software engineers

• Organizations searching for structures that will support quality inter-disciplinary research in bioinformatics

Page 13: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Organizing for Bioinformatics

• Collaborative, not service oriented

• Enable extensive interaction and education

• Enable scientists to be stimulated by important problems and to accomplish organizational and personal goals in solving them

Page 14: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Molecular Statistics & Bioinformatics Section

• Utilize mathematical and computational sciences in conjunction with data from genomics & high thruput technologies to elucidate the biological basis of cancer – translating this to effective means of

eradicating cancer

• Train statisticians, mathematicians, physical and biological scientists in cancer computational biology

Page 15: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Microarray Research

• Collaborative data analysis

• Methodology development

• Software development

Page 16: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Microarray Myths• That the greatest challenge is managing the mass of

micro-array data• That pattern-recognition or data mining are the most

appropriate paradigm for the analysis of micro-array data

• That pre-packaged analysis tools are a substitute for collaboration with statistical scientists in complex problems

• That statistical collaboration can be a service function• That statisticians can be effective collaborators

without substantial knowledge of biology and microarray technology

Page 17: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Applications of DNA Microarrays to Cancer Research

• Identify genes and pathways involved in oncogenesis– Transgenic mouse models– Profiling pre-cancerous lesions

• Identifying molecular targets for– therapeutics– early detection

Page 18: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Applications of DNA Microarrays to Cancer Research

• Diagnostic classification– For identifying disease subsets with distinctive

pathogenesis– For selecting therapy

• Large cell lymphoma

• Stage I breast cancer

Page 19: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

DNA Microarray Analytics

• Design issues– Arrays

– Specimens• Labeling

• Replication

• Image analysis– Pixels to feature

• Feature analysis– Background

adjustment

– Normalization

– Features to genes

– Normalization

• Analysis of biological objectives

Page 20: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Method of Analysis Should Be Tailored to Objectives

• Class discovery– Identifying expression profiles characteristic of

non-predefined subsets of tumors

• Class/phenotype prediction– Identifying expression profiles that distinguish

predefined subsets of tumors

Page 21: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Components of Class Prediction

• Establish that expression “profiles” differ to a statistically significant degree and that differences observed are not due to examination of thousands of genes

• Identify genes that account for the differences between classes

• Develop multi-gene classifier to predict the class for a new sample and estimate the mis-classification rates

Page 22: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Do Expression Profiles Differ for Two Defined Classes of Arrays?

• Not a clustering problem– Global similarity measures generally used for

clustering arrays may not distinguish classes

• Supervised vs unsupervised methods

• Requires multiple biological samples from each class

Page 23: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Do Expression Profiles Differ for Two Defined Classes of Arrays?

• Global test– Number of genes significantly differentially expressed

among classes at specified nominal significance level

– Cross-validated mis-classification rate

• Multiple comparison adjustment for finding differentially expressed genes– Experiment-wise error

– Univariate screening with p<0.001 threshold

– False discovery rate

Page 24: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

training set

test set

spec

imen

s

log-expression ratios

spec

imen

s

log-expression ratios

full data set

Non-cross-validated Prediction

Cross-validated Prediction (Leave-one-out method)

1. Prediction rule is built using full data set.2. Rule is applied to each specimen for class

prediction.

1. Full data set is divided into training and test sets (test set contains 1 specimen).

2. Prediction rule is built using the training set.

3. Rule is applied to the specimen in the test set for class prediction.

4. Process is repeated until each specimen has appeared once in the test set.

Page 25: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Prediction on Simulated Null Data

Generation of Gene Expression Profiles

• 14 specimens (Pi is the expression profile for specimen i)

• Log-ratio measurements on 6000 genes

• Pi ~ MVN(0, I6000)

• Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)?

Prediction Method

• Compound covariate prediction (discussed later)

• Compound covariate built from the log-ratios of the 10 most differentially expressed genes.

Page 26: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Percentage of simulated data setswith m or fewer misclassifications

mNon-cross-validated

class predictionCross-validatedclass prediction

0 99.85 0.601 100.00 2.702 100.00 6.203 100.00 11.204 100.00 16.905 100.00 24.256 100.00 34.007 100.00 42.558 100.00 53.859 100.00 63.60

10 100.00 74.5511 100.00 83.5012 100.00 91.1513 100.00 96.8514 100.00 100.00

Page 27: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Exact Permutation Test

Premise: Under the null hypothesis of no systematic difference in expression profiles between the two classes, it can be assumed that assignment of class labels to expression profiles is purely coincidental.

Performing the test

1. Consider every possible permutation of the class labels among the gene expression profiles.

2. Determine the proportion of the permutations that result in a misclassification error rate less than or equal to the observed error rate.

3. This proportion is the achieved significance level in a test of thenull hypothesis.

Page 28: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Examining all permutations is computationally burdensome.Instead, a Monte Carlo method is used…

• nperm permutations of the labels are randomly generated.

• The proportion of these permutations that have m or fewer misclassifications is an estimate of the achieved significance level in a test of the null hypothesis.

• nperm is chosen such that the variability in the estimate is less than an acceptable level.

• If the true proportion of permutations with m 2 is 0.05, nperm= 2000 ensures the coefficient of variation of the estimate of the achieved significance level is less than 0.1.

Monte Carlo Permutation Test

Page 29: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Gene-Expression Profiles in Hereditary Breast Cancer

• Breast tumors studied:7 BRCA1+ tumors8 BRCA2+ tumors7 sporadic tumors

• Log-ratios measurements of 3226 genes for each tumor after initial data filtering

cDNA MicroarraysParallel Gene Expression Analysis

RESEARCH QUESTIONCan we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?

Page 30: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

The Compound Covariate Predictor (CCP)• We consider only genes that are differentially expressed between

the two groups (using a two-sample t-test with small ).

• The CCP– Motivated by J. Tukey, Controlled Clinical Trials, 1993

– Simple approach that may serve better than complex multivariate analysis

– A compound covariate is built from the basic covariates (log-ratios)

tj is the two-sample t-statistic for gene j.

xij is the log-ratio measure of sample i for gene j.

Sum is over all differentially expressed genes.

• Threshold of classification: midpoint of the CCP means for the two classes.

j

ijji xtCCP

Page 31: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Classification of hereditary breast cancers with the compound covariate predictor

Class labels

Number ofdifferentially

expressed genesm = number of

misclassifications

Proportion of randompermutations with m orfewer misclassifications

BRCA1+ vs. BRCA1

9 1 (0 BRCA1+, 1 BRCA1) 0.004

BRCA2+ vs. BRCA2

11 4 (3 BRCA2+, 1 BRCA2) 0.043

Page 32: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Accuracy of class prediction as selection stringency increases

BRCA1Classification

BRCA2Classification

α ngene m ngene m10-2 182 3 212 410-3 53 2 49 310-4 9 1 11 4

Page 33: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Advantages of Compound Covariate Classifier

• Good feature selection

• Does not over-fit data– Incorporates influence of multiple predictive

variables without attempting to select the best small subset of variables

– Does not attempt to model the multivariate interactions among the predictors and outcome

Page 34: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Extensions

• Adjustment for covariates

• Paired samples

• Survival data

• Other classification methods

• More than 2 classes

Page 35: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

1-2

82

1-2

64

2-3

2-1

8

2-5

1-2

3

1-6

P

2-1

1

2-1

4

1-6

2

2-5

6

2-8

2

2-1

102-5

12-6

7

2-7

5 2-6

12-9

4

2-7

7

2-1

85

1-6

9

2-8

9

1-9

5

2-1

13

1-8

61-8

1

1-8

4

1-9

P

1-9

6

1-9

9P

1-1

04P

1-1

2P 1-4

48

2-1

75

2-2

65

2-2

0

1-1

3

2-3

2

1-7

6U

1-7

8

1-8

8

1-9

1

2-2

2

2-1

0

2-1

14

1-2

592-1

42

2-1

17

2-1

03

2-1

25

2-1

00

2-1

29

2-3

0

2-9

3

2-1

16

2-1

30

1-1

27P

2-9

0

1-1

2P

N

1-1

3N

1-1

4N

1-6

PN

1-1

1N

1-9

PN

1-1

2P

N

1-8

1N

0.2

0.4

0.6

0.8

1.0

1-c

en

tere

d c

orr

ela

tion

4922x58 stage-4 tumor arrays and normals

Page 36: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

282

264

3

18

5

23

6P

11

1462

56 82

110

51

67

75

61

94

77

185

69

89

95 11

3

86

81

84

9P

96

99P

104P12P

448 1

75

265 20

13

3276U

78

88

91

22

10

114

259

14211

7

103

125

100

12930

9311

6

130

127P

90

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1-c

en

tere

d c

orr

ela

tion

Genes are significantly associated with survival

Page 37: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

surv

iva

l

0 500 1000 1500

0.0

0.2

0.4

0.6

0.8

1.0

Survival Analysis by Clustering Patients With Genes Significantly Associated with Survival

Genes are chosen at .001 significance level

good prognosis, N=34poor prgnosis, N=24

Page 38: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

surv

iva

l

0 500 1000 1500

0.0

0.2

0.4

0.6

0.8

1.0

Cross-validation Survival Analysis by Clustering Patients with Genes Significantly Associated with SurvivalWithheld observation is classified by the K nearest neighbor rule (K=5)

Genes are chosen at .001 significance level

good prognosis, N=32poor prognosis, N=26

Page 39: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Class Discovery

• For determining whether a set of tumors is homogeneous with regard to expression profile

Page 40: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Class Discovery Methods

• Cluster analysis

• Multi-dimensional Scaling

Page 41: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

0.2

0.3

0.4

0.5

0.6

0.7

0.8

11

2

9

12

6

25

10

4

13

1 3

28

29

26

21

7

23

30

22

24

8

31

18

27

5

17

19

20

14

15

16

1 -

corr

elat

ion

Melanoma Gene Expression Data

19 tumor cluster of interest

Q: Can gene expression profiles of melanoma be used to distinguish sub-classes of disease? (M. Bittner et al., Nature Genetics Aug 2000)

Page 42: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics
Page 43: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Validation of Clusters

• Clustering algorithms find clusters, even when they are spurious

• Clusters found may change with re-assaying tumors or selection of new tumors

Page 44: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Clustering Arrays

• Cluster significance

• Cluster reproducibility

Page 45: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Cluster reproducibility

• Add perturbation noise to original data• Re-cluster perturbed data to assess stability of

original clusters• D: Proportion of pairs of samples in a specified

cluster of the original data that are in separate clusters after perturbation

• R: Average number of specimens lost or gained in a specified cluster || CP(C) - CP(C) ||

Page 46: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Melanoma Data:mn-error Method - Individual Clusters

kCluster

Membership m n7 5-24 0.00 0.00

8 5 0.00 5.13

8 6-24 0.00 0.27

0.2

0.3

0.4

0.5

0.6

0.7

0.8

11

2

9

12

6

25

10

4

13

1 3

28

29

26

21

7

23

30

22

24

8

31

18

27

5

17

19

20

14

15

16

Page 47: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Test of Cluster Significance

• Multivariate Gaussian null hypothesis• Project to subspace determined by first three principal

components• Compute EDF of nearest neighbor Euclidean distances

between samples• Compare the NN EDF observed to that expected under

the null distribution using a squared difference discrepancy metric

• Estimate null distribution by sampling from 3D Gaussian distribution with mean and covariance matrix corresponding to observed data

Page 48: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

BRB ArrayTools:An integrated package for the

analysis of DNA microarray data

http://linus.nci.nih.gov/BRB-ArrayTools.html

Page 49: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

BRB ArrayToolsDesign Objectives

• Easy user interface– Excel front-end

• Ease of data loading– integrated

• Drill-down linkage to genomic databases

• Educating biologists in microarray data analysis

• Powerful analytic & visualization tools

• Easily extensible– R backend

• Portable– Non-proprietary

• Ease of development– R back-end

Page 50: Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics

Collaborators

• Molecular Statistics & Bioinformatics– Kevin Dobbin– Lisa McShane– Amy Peng– Michael Radmacher– Joanna Shih– George Wright– Yingdong Zhao