summer inst. of epidemiology and biostatistics, 2009: gene expression data analysis 8:30am-12:00pm...

103
Summer Inst. Of Epidemiology and Summer Inst. Of Epidemiology and Biostatistics, 2009: Biostatistics, 2009: Gene Expression Data Analysis Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 8:30am-12:00pm in Room W2017 Carlo Colantuoni – [email protected] http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/ GEA2009.htm

Upload: imogene-morgan

Post on 20-Jan-2016

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Summer Inst. Of Epidemiology and Summer Inst. Of Epidemiology and Biostatistics, 2009:Biostatistics, 2009:

Gene Expression Data AnalysisGene Expression Data Analysis

8:30am-12:00pm in Room W20178:30am-12:00pm in Room W2017

Carlo Colantuoni – [email protected]

http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2009.htm

Page 2: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Class OutlineClass Outline• Basic Biology & Gene Expression Analysis Technology

• Data Preprocessing, Normalization, & QC

• Measures of Differential Expression

• Multiple Comparison Problem

• Clustering and Classification

• The R Statistical Language and Bioconductor

• GRADES – independent project with Affymetrix data.

http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2009.htm

Page 3: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Class Outline - DetailedClass Outline - Detailed• Basic Biology & Gene Expression Analysis Technology

– The Biology of Our Genome & Transcriptome– Genome and Transcriptome Structure & Databases– Gene Expression & Microarray Technology

• Data Preprocessing, Normalization, & QC– Intensity Comparison & Ratio vs. Intensity Plots (log transformation)– Background correction (PM-MM, RMA, GCRMA)– Global Mean Normalization– Loess Normalization– Quantile Normalization (RMA & GCRMA)– Quality Control: Batches, plates, pins, hybs, washes, and other artifacts– Quality Control: PCA and MDS for dimension reduction

• Measures of Differential Expression– Basic Statistical Concepts– T-tests and Associated Problems– Significance analysis in microarrays (SAM) [ & Empirical Bayes]– Complex ANOVA’s (limma package in R)

• Multiple Comparison Problem– Bonferroni– False Discovery Rate Analysis (FDR)

• Differential Expression of Functional Gene Groups– Functional Annotation of the Genome– Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum– Gene Set Enrichment Analysis (GSEA)– Parametric Analysis of Gene Set Enrichment (PAGE)– geneSetTest– Notes on Experimental Design

• Clustering and Classification– Hierarchical clustering– K-means– Classification

• LDA (PAM), kNN, Random Forests• Cross-Validation

• Additional Topics– The R Statistical Language– Bioconductor– Affymetrix data processing example!

Page 4: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

DAY #3:DAY #3:

Measures of Differential Expression:Review of basic statistical conceptsT-tests and associated problemsSignificance analysis in microarrays

(SAM)(Empirical Bayes)Complex ANOVA’s (“limma” package in

R)

Multiple Comparison Problem:BonferroniFDR

Differential Expression of Functional Gene Groups

Notes on Experimental Design

Page 5: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Slides from Rob Scharpf

Page 6: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Fold-Change?Fold-Change?T-Statistics?T-Statistics?

Some genes are more variable than othersSome genes are more variable than others

Page 7: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Slides from Rob Scharpf

Page 8: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Slides from Rob Scharpf

Page 9: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Slides from Rob Scharpf

Page 10: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Slides from Rob Scharpf

Page 11: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Slides from Rob Scharpf

distribution of

distribution of

Page 12: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Slides from Rob Scharpf

Page 13: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Slides from Rob Scharpf

X1-X2 is normally distributed if X1 and X2 are normally distributed – is this the case in microarray data?

Page 14: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Problem 1Problem 1: T-statistic not t-distributed. : T-statistic not t-distributed. ImplicationImplication: p-values/inference incorrect: p-values/inference incorrect

Page 15: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

P-values by permutationP-values by permutation

• It is common that the assumptions used to derive the statistics are not approximate enough to yield useful p-values (e.g. when T-statistics are not T distributed.)

• An alternative is to use permutations.

Page 16: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

pp-values by permutations-values by permutations

We focus on one gene only. For the bth iteration, b = 1, , B;

• Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”.

• For each gene, calculate the corresponding two sample t-statistic, tb.

After all the B permutations are done:

• p = # { b: |tb| ≥ |tobserved| } / B

• This does not yet address the issue of multiple tests!

Page 17: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

The volcano plot shows, for a particular test, negative The volcano plot shows, for a particular test, negative log p-value against the effect size (M).log p-value against the effect size (M).

Another problem with t-testsAnother problem with t-tests

Page 18: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Remember this?Remember this?

Page 19: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Problem 2Problem 2: t-statistic bigger for genes: t-statistic bigger for genes with smaller standard with smaller standard

error estimates.error estimates.ImplicationImplication: Ranking might not be optimal: Ranking might not be optimal

Page 20: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Problem 2Problem 2

• With low N’s SD estimates are unstable

• Solutions:

– Significance Analysis in Microarrays (SAM)

– Empirical Bayes methods and Stein estimators

Page 21: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Significance analysis in Significance analysis in microarrays (SAM)microarrays (SAM)

• A clever adaptation of the t-ratio to borrow information across genes

• Implemented in Bioconductor in the siggenes package

Significance analysis of microarrays applied to the ionizing radiation response, Tusher et al., PNAS 2002

Page 22: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

SAM d-statisticSAM d-statistic

• For gene i :

di y i x isi s0

y i

x i

is

0s

mean of sample 1

mean of sample 2

Standard deviation of repeated measurements for gene i

Exchangeability factor estimated using all genes

Page 23: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Minimize the average CV across all genes.

Page 24: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Scatter plots of relative difference (d) vs standard Scatter plots of relative difference (d) vs standard deviation (s) of repeated expression measurementsdeviation (s) of repeated expression measurements

Random fluctuationsin the data, measured by balanced permutations(for cell line 1 and 2)

Relative difference fora permutation of the datathat was balanced between cell lines 1 and 2.

A fix for this problem:

Page 25: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

SAM produces a modified T-statistic (d), and has an approach to the multiple

comparison problem.

Page 26: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Selected genes:Selected genes:Beyond expected distributionBeyond expected distribution

Page 27: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

• An advantage of having tens of thousands of genes is that we can try to learn about typical standard deviations by looking at all genes

• Empirical Bayes gives us a formal way of doing this

• “Shrinkage” of variance estimates toward a “prior”: moderated t-statistics – eliminates extreme stats due to small variances.

• Implemented in the limma package in R. In addition, limma provides methods for more complex experimental designs beyond simple, two-sample designs.

eBayes: Borrowing StrengtheBayes: Borrowing Strength

Page 28: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 29: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

The Multiple The Multiple Comparison ProblemComparison Problem

(some slides courtesy of John Storey)

Page 30: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Hypothesis TestingHypothesis Testing

• Test for each gene:

Null Hypothesis: no differential expression.

• Two types of errors can be committed

– Type I error or false positive (say that a gene is differentially expressed when it is not, i.e., reject a true null hypothesis).

– Type II error or false negative (fail to identify a truly differentially expressed gene, i.e.,fail to reject a false null hypothesis)

Page 31: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Hypothesis testingHypothesis testing

• Once you have a given score for each gene, how do you decide on a cut-off?

• p-values are most common.

• How do we decide on a cut-off when we are looking at many 1000’s of “tests”?

• Are 0.05 and 0.01 appropriate? How many false positives would we get if we applied these cut-offs to long lists of genes?

Page 32: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Multiple Comparison ProblemMultiple Comparison Problem

• Even if we have good approximations of our p-values, we still face the multiple comparison problem.

• When performing many independent tests, p-values no longer have the same interpretation.

Page 33: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Bonferroni ProcedureBonferroni Procedure

= 0.05= 0.05# Tests = 1000# Tests = 1000

= 0.05 / 1000 = 0.00005 = 0.05 / 1000 = 0.00005oror

p = p * 1000p = p * 1000

Page 34: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Bonferroni ProcedureBonferroni Procedure

Too conservative.

How else can we interpret many 1000’s of observed statistics?

Instead of evaluating each statistic individually, can we assess a list of

statistics: FDR (Benjamini & Hochberg 1995)

Page 35: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

FDRFDR

• Given a cut-off statistic, FDR gives us an estimate of the proportion of hits in our list of differentially expressed genes that are false.

Null = Equivalent Expression; Alternative = Differential Expression

Page 36: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

False Discovery RateFalse Discovery Rate• The “false discovery rate” measures the proportion of false

positives among all genes called significant:

• This is usually appropriate because one wants to find as many truly differentially expressed genes as possible with relatively few false positives

• The false discovery rate gives an estimate of the rate at which further biological verification will result in dead-ends

tsignifican called#

positives false#

Page 37: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

-0.4 -0.2 0.0 0.2 0.4

01

23

Distribution of Observed (black) and Permuted (red+blue) Correlations (r)

Correlation (r)

Den

sity

Permuted

Distribution of Statistics

Observed

Statistic

N=90

Page 38: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

-0.45 -0.40 -0.35 -0.30

0.00

0.05

0.10

0.15

0.20

Distribution of Observed (black) and Permuted (red+blue) Correlations (r)

Correlation (r)

Den

sity

Permuted

Observed=

Distribution of Statistics

FDR =False Pos.

Total Pos.

PermutedObserved

Statistic

Page 39: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

Distribution of p-values from Observed (Black) and Permuted Data

p-value

Den

sity

Distribution of p-values

Permuted

Observed

p-value

N=90

Page 40: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

FDR = False Positives/Total Positive Calls

This FDR analysis requires enough samples in each condition to estimate a statistic for each

gene: observed statistic distribution.

And enough samples in each condition to permute many times and recalculate this

statistic: null statistic distribution.

What if we don’t have this?

Page 41: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

FDR = 0.05Beyond ±0.9

Page 42: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

FDR = 0.05Beyond ±0.9

Page 43: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 44: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 45: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

False Positive Rate False Positive Rate versus False Discovery Rateversus False Discovery Rate

• False positive rate is the rate at which truly null genes are called significant

• False discovery rate is the rate at which significant genes are truly null

tsignifican called#

positives false#FDR

nulltruly #

positives false#FPR

Page 46: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

False Positive Rate False Positive Rate and and P-valuesP-values

• The p-value is a measure of significance in terms of the false positive rate (aka Type I error rate)

• P-value is defined to be the minimum false positive rate at which the statistic can be called significant

• Can be described as the probability a truly null statistic is “as or more extreme” than the observed one

Page 47: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

False Discovery Rate False Discovery Rate and and Q-valuesQ-values

• The q-value is a measure of significance in terms of the false discovery rate

• Q-value is defined to be the minimum false discovery rate at which the statistic can be called significant

• Can be described as the probability a statistic “as or more extreme” is truly null

Page 48: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Power and Sample Size Power and Sample Size Calculations are HardCalculations are Hard

• Need to specify:– (Type I error rate, false positives) or FDR– (stdev: will be sample- and gene-specific)– Effect size (how do we estimate?)– Power (1-, =Type II error rate)– Sample Size

• Some papers:– Mueller, Parmigiani et al. JASA (2004)– Rich Simon’s group Biostatistics (2005)– Tibshirani. A simple method for assessing sample

sizes in microarray experiments. BMC Bioinformatics. 2006 Mar 2;7:106.

Page 49: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 50: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Beyond Individual Genes:Functional Gene Groups

• Borrow statistical power across entire

dataset

• Beyond threshold enrichment

• Integrate preexisting biological knowledge

-0.4 -0.2 0.0 0.2 0.4

01

23

Distribution of Observed (black) and Permuted (red+blue) Correlations (r)

Correlation (r)

Den

sity

Correlation of Age with Gene Expression

Page 51: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Functional Annotation of Lists of Genes

KEGGPFAM

SWISS-PROTGO

DRAGONDAVID/EASEMatchMiner

BioConductor (R)

Page 52: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 53: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 54: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 55: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 56: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 57: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 58: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 59: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Gene Cross-Referencing and Gene Annotation Tools In BioConductor

(in the R statistical language)

annotate package

Microarray-specific “metadata” packagesDB-specific “metadata” packages

AnnBuilder package

Page 60: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Annotation Tools In BioConductor:annotate package

Functions for accessing data in metadata packages.

Functions for accessing NCBI databases.

Functions for assembling HTML tables.

Page 61: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Annotation Tools In BioConductor:Annotation for Commercial Microarrays

Array-specific metadata packages

Page 62: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Annotation Tools In BioConductor:Functional Annotation with other DB’s

GO metadata package

Page 63: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Annotation Tools In BioConductor:Functional Annotation with other DB’s

KEGG metadata package

Page 64: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 65: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Is their enrichment in our list of differentially expressed genes for a particular functional gene

group or pathway?

Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups

Page 66: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups

Page 67: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups

The argument lower.tail will indicate if you are looking for over- or under- representation of differentially expressed genes within a particular functional group (using lower.tail=F for over-representation).

Page 68: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Can we use more of our data than Threshold Enrichment (that only uses

the top of our gene list)?

Page 69: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

EXP#1

Swiss-Prot

PFAM

KEGG

Functional Gene Subgroups within An Experiment

Page 70: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Statistics for Analysis of Differential Expression of Gene Subgroups

Is THIS …

… Different from THIS?

Page 71: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Over-Expression of a Group of Functionally Related Genes

p<7.42e-08

T statistic

Page 72: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Statistical Tests:

2

Kolmogorov-SmirnovProduct of ProbabilitiesGSEAPAGEgeneSetTest (Wilcoxon rank sum)

Is THIS …

… Different from THIS?

Conceptually Distinct from Threshold Enrichment and the Hypergeometric test!

Page 73: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

histogrambins

E

O

2

ED =

(O-E)2______

2 is the sum of D values where:

All Genes

Subset of Interest

Page 74: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

All Genes

Subset of Interest

Kolmogorov-Smirnov

Page 75: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

All Genes

Subset of Interest

Product of Individual Probabilities

Page 76: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

What shape/type of distributions would each of these tests be sensitive to?

All statistics

Statistics from gene subgroup

Page 77: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Gene Set Enrichment Analysis (GSEA)

Subramanian et al, 2005 PNAS

Page 78: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Gene Set Enrichment Analysis (GSEA)

Page 79: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Gene Set Enrichment Analysis (GSEA)

Page 80: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Gene Set Enrichment Analysis (GSEA)

Page 81: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Gene Set Enrichment Analysis (GSEA)

Page 82: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Gene Set Enrichment Analysis (GSEA)

Page 83: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Parametric Analysis of Gene Set Enrichment (PAGE)

Kim et al, 2005 BMC Bioinformatics

Page 84: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Parametric Analysis of Gene Set Enrichment (PAGE)

Page 85: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu
Page 86: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Z =Sm-

/m0.5

Page 87: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

The test statistic used for the gene-set-test is the mean of the statistics in the set. If ranks.only is TRUE the only the ranks of the statistics are used. In this case the p-value is obtained from a Wilcoxon test. If ranks.only is FALSE, then the p-value is obtained by simulation using nsim random selected sets of genes.

Arguement: alternative = “mixed” or “either” : fundamentally different questions.

Test whether a set of genes is enriched for differential expression.

Usage:geneSetTest(selected,statistics,alternative="mixed",type="auto",ranks.only=TRUE,nsim=10000)

geneSetTest(limma)

A simple method in Bioconductor

Page 88: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Wilcoxon test

Page 89: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Analysis of Gene Networks

Page 90: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Large Protein Interaction Network

Network Regulated in Sample #1

Page 91: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Network Regulated in Sample #1

Network Regulated in Sample #2

Large Protein Interaction Network

Page 92: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Network Regulated in Sample #1

Network Regulated in Sample #2

Network Regulated in Sample #3

Large Protein Interaction Network

Page 93: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Networkof Interest

Network Regulated in Sample #1

Network Regulated in Sample #2

Network Regulated in Sample #3

Large Protein Interaction Network

Page 94: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Additional Notes on Experimental Design

Page 95: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Old-School Experimental Old-School Experimental Design: RandomizationDesign: Randomization

Page 96: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Dissection of tissue

RNA Isolation

Amplification

Probelabelling

Hybridization

Biological Replicates

Technical Replicates

Replicates in a mouse model:

Page 97: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Common question in Common question in experimental designexperimental design

• Should I pool mRNA samples across subjects in an effort to reduce the effect of biological variability (or cost)?

Page 98: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Two simple designsTwo simple designs

• The following two designs have roughly the same cost:– 3 individuals, 3 arrays– Pool of three individuals, 3 technical

replicates

• To a statistician the second design seems obviously worse. But, I found it hard to convince many biologist of this.– 3 pools of 3 animals on individual arrays?

Page 99: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Cons of Pooling EverythingCons of Pooling Everything• You can not measure within class variation

• Therefore, no population inference possible

• Mathematical averaging is an alternative way of reducing variance.

• Pooling may have non-linear effects

• You can not take the log before you average:E[log(X+Y)] ≠ E[log(X)] + E[log(Y)]

• You can not detect outliers

*If the measurements are independent and identically distributed

Page 100: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Cons specific to microarraysCons specific to microarrays

• Different genes have dramatically different biological variances.

• Not measuring this variance will result in genes with larger biological variance having a better chance of being considered more important

Page 101: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Higher variance: larger fold changeHigher variance: larger fold change

We compute fold change for each gene (Y axis)From 12 individuals we estimate gene specific variance (X axis)

If we pool we never see this variance.

Page 102: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Remember this?Remember this?

Page 103: Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu

Useful Books:

“Statistical analysis of gene expression microarray data”

– Speed.

“Analysis of gene expression data”– Parmigianni

“Bioinformatics and computational biology solutions using R”

- Irizarry