rnaseq: experimental design & statistics for differential ... · normal models for count data...

63
RNASeq: Experimental Design & Statistics for Differential Expression (and a tiny bit of ChipSeq) Blythe Durbin-Johnson, Ph.D.

Upload: others

Post on 07-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

RNASeq: Experimental Design & Statistics for Differential Expression

(and a tiny bit of ChipSeq)

Blythe Durbin-Johnson, Ph.D.

Page 2: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Outline

• Hypothesis testing, p-values, and power

• Multiple testing

• Experimental design and replication

• Statistical models for RNAseq data

• Visualization techniques

• Statistics behind IDR

Page 3: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

HYPOTHESIS TESTING, P-VALUES, AND POWER

Page 4: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Hypothesis Testing

• Test “null hypothesis” of no effect against “alternative hypothesis”

• Calculate test statistic, reject null if test statistic large relative to what one would expect under null distribution

Page 5: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

P-Values

• P-value = probability of seeing a test statistic as large or larger than your test statistic when the null hypothesis is true

• Typically reject null if P < 0.05

–This is purely a historical convention

–Nothing magic happens at the P = 0.05 threshold

Page 6: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

A P-Value is NOT

• …the probability that the null hypothesis is true

• …the probability that an experiment will not be replicated

• …a direct measure of the size or importance of an effect

• …a measure of biological/clinical significance

Page 7: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Power

• Power = probability of rejecting null hypothesis for a given effect size

• Depends on:

–Effect size (difference between groups)

–Sample size

–Amount of variability in data

–Hypothesis test being used

–How “significance” is defined

Page 8: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Power and P-Values

• Under the null hypothesis, p-values uniformly distributed between 0 and 1

–Expect 5% to be less than 0.05, on average

• Under alternatives, higher probability of smaller p-values (higher power), but still can theoretically get any p-value between 0 and 1

Page 9: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Power Example

• Simulate two groups of normally distributed data with means 0, 0.5, 1, and 2 standard deviations apart

• Conduct two-sample t-test

• Repeat 5000 times, look at distribution of p-values

• Repeat for various sample sizes

Page 10: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like
Page 11: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like
Page 12: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

MULTIPLE TESTING

Page 13: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multiple Testing Example • Patient samples treated with different radiation

doses and observed over time

• Illumina microarray experiment, 16,801 genes used in analysis

• Four replicates per patient/time/dose

• All samples used in this example were replicates from same patient, untreated

• T-tests gene by gene comparing replicates 1 and 3 to replicates 2 and 4

• 196 genes with P < 0.05

Page 14: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multiple Testing Example

• Entered list of genes with P < 0.05 into DAVID’s functional annotation tool

– http://david.abcc.ncifcrf.gov

• Overrepresented terms (P < 0.05) included disease mutation, mutagenesis site, and 79 others

• If you were doing radiation research, would you be excited about this?

Page 15: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multiple Testing Example

• We know there is no difference between the “groups”

• What is going on?

Page 16: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multiple Testing Example

• Expect P < 0.05 about 5% of the time under null hypothesis

• (We see 196/16801 = 1.1% of genes with P < 0.05, but our data aren’t perfectly normal and our p-values are correlated)

• When conducting multiple tests, need to make adjustments to avoid spurious results

Page 17: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Familywise Error Rate

number declared

non-significant

number declared

significant total

true null

hypotheses U V m0

false null

hypotheses T S m - m0

m - R R m

FWER = P(V ≥ 1)

FWER = Probability of ANY false positives

Multiple Testing

Page 18: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

One way of controlling FWER: set α’ = α/n (Bonferroni Correction) Problems: 1. Very conservative, even for FWER

control. 2. Is the FWER really what we want to

control?

Multiple Testing

Page 19: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

False Discovery Rate (FDR)

FDR = E[V/R]

number declared

non-significant

number declared

significant total

true null

hypotheses U V m0

false null

hypotheses T S m - m0

m - R R m

(Benjamini and Hochberg, 1995)

Multiple Testing

Page 20: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

False Discovery Rate (FDR)

FDR = E[V/R]

FWER = P(V ≥ 1) control this

not this

number declared

non-significant

number declared

significant total

true null

hypotheses U V m0

false null

hypotheses T S m - m0

m - R R m

(Benjamini and Hochberg, 1995)

Multiple Testing

Page 21: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multiple Testing

• False Discovery Rate-controlling procedure: (Benjamini and Hochberg, 1995)

1. Sort p-values from smallest to largest (1 to m), let k be the rank

2. Select a desired FDR α

3. Find the largest rank k’ where P(k) ≤ (k/m)*α

4. Null hypotheses 1 through k’ are rejected

Page 22: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multiple Testing

• Note that the gene with the smallest p-value is still tested using α/m (like Bonferroni)

• The number of genes/transcripts included in the analysis still matters

• Filtering can help (but don’t filter based on treatment/group membership)

Page 23: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multiple Testing Example (Revisited)

• Recall example of testing differential expression between 2 pairs of replicates in a microarray experiment

• No genes are differentially expressed at FDR-level 0.1

Page 24: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

EXPERIMENTAL DESIGN AND REPLICATION

Page 25: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Replicated and Unreplicated Designs

biological heterogeneity

Why Replicate? mut wt

Page 26: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Replicated and Unreplicated Designs

Unreplicated Design mut wt

Here, groups differ, but single replicates from each group very similar

Page 27: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Replicated and Unreplicated Designs

Unreplicated Design mut wt

Here, groups are similar, but outlying observation from group on right makes it look like there’s a big difference in unreplicated experiment

Page 28: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Why Replicate?

• Single biological replicate may not be representative of a whole group

• Power to test for differential expression is limited

• Cannot estimate within-group variability directly

– Have to assume that most genes aren’t differentially expressed, use between-group variability as surrogate for within-group variability

• Unreplicated experiments not recommended

Page 29: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

How Many Replicates?

• Depends on how big of a fold change you want to detect

• Theoretically, for any sample size there is some difference detectable with 80% power

• (This difference might be unrealistically huge)

• BUT very small sample sizes cause other problems besides lack of power

Page 30: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

How Many Replicates?

• Too few replicates lack of generalizability

• Rely heavily on other genes to estimate variability

• With only 2 replicates, false discovery rate inflated (Sonenson and Delorenzi, 2013)

• Increasing number of replicates even from 2 to 3 helps with FDR inflation

Page 31: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

How Many Replicates?

• Resources of course limit numbers of replicates

• An undersized experiment that misleads may be worse than no experiment

– This is particularly true of n = 1

Page 32: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Technical Replicates vs. Biological Replicates

• Biological variability > > technical variability

• Technical replicates are not a substitute for biological replicates

Page 33: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Technical Replicates vs. Biological Replicates

• Treating technical replicates as biological replicates underestimates variability, inflates Type I error rate –Do not treat technical replicates like

biological replicates –Do not treat repeated measures on the

same experimental unit like independent observations

Page 34: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

STATISTICAL MODELS FOR RNASEQ DATA

Page 35: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Models for Count Data

• RNAseq data typically consist of counts for each gene/transcript in each sample

• Generally use special models for count data (or transform data in ways that address variance structure)

Page 36: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Poisson Models

• Count data are often modeled with a Poisson model

• For comparing two groups A and B:

μ = mean count

log(μ) = intercept + β*I(Group = B)

β is log fold-change B/A

• Poisson model assumes variance = mean

Variance(count) = μ

• This is a strong assumption!

Page 37: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Negative Binomial Model

• Negative binomial distribution does not assume variance is equal to mean

Mean(count) = μ

Var(count) = μ + φ μ2

• Φ is “dispersion parameter”

• If Φ = 0 we have a Poisson model

Page 38: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Negative Binomial Model

• Negative binomial model can be derived as a mixture of different Poisson distributions

• RNAseq Data:

–Each Poisson distribution in mixture represents “shot noise” or within-sample variability

–Using mixture of Poisson distributions allows model for biological variability

Page 39: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Negative Binomial Model

• Is this the true data-generating model?

– Unlikely

– Requires variance >= mean, easy to imagine situation where this doesn’t hold

• “All models are wrong, but some models are useful”

--George Box

• Usefulness of NB model is open question

Page 40: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Estimating the Dispersion Parameter

• Negative binomial modeling requires the variability to be estimated separately from the mean

• RNAseq data rarely have enough replicates to do this gene by gene

• Borrow information from other genes to estimate dispersion

Page 41: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Estimating the Dispersion Parameter

• Methods of estimating the dispersion parameter

– Use empirical Bayes methods to “squeeze” local estimate of dispersion towards overall dispersion

– Model dispersion parameter as a function of the mean

– Some combination of these

Page 42: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Normal Models for Count Data

• edgeR, DESeq, Cuffdiff2 model data directly using negative binomial distribution

• Data may not be negative binomial

• Even for data that are truly negative binomial, testing based on asymptotic (n ∞) theory

• Asymptotic theory may not work for small sample sizes

Page 43: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Normal Models for Count Data

• limma-voom calculates variance weights for log2(CPM) so they can be modelled like continuous, normally-distributed data

• Performs well against negative binomial models in comparison papers

• Normal approximation works best for larger counts

Page 44: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Caveats

• Take results with a grain of salt

– Follow up with PCR on different samples

• Different methods can produce very different lists of DE genes

• Most methods produce large numbers of false discoveries*

• No existing method is perfect

*D.M. Rocke, personal communication

Page 45: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

More Complicated Models

• Negative binomial modeling is not limited to comparison of two groups

• Can fit models that are analogous to regression or ANOVA for ordinary linear models

Page 46: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

More Complicated Models

• Allows modeling of RNAseq data from multifactorial experimental designs

–Can be more powerful than looking at one experimental condition at a time

–Can look at interaction between multiple experimental conditions

–Can look at continuous changes in expression as function of e.g. age

Page 47: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multifactorial Model Example

• RNAseq data

• Two genotypes of plant (CM and SP)

• Two experimental conditions (N and G)

• 3 replicates

Page 48: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multifactorial Model Example

• Model fitted that includes “main effects” for genotype and condition plus “interaction effects”

Page 49: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multifactorial Model Example: Interpretation of Parameters

• Genotype CM, condition G

• Genotype CM, condition N

• Genotype SP, condition G

• Genotype SP, condition N

Page 50: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

VISUALIZATION TECHNIQUES

Page 51: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Distances

• Many unsupervised clustering/visualization techniques are based on distance matrices

• Distance between two points can be defined many ways – Euclidean distance

– 1 – abs(correlation)

– Mahalanobis (covariance-scaled) distance

– Maximum distance

• Euclidean distance most common

Page 52: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Euclidean Distance

Page 53: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Euclidean Distance

• For Euclidean distance to be meaningful, data must be scaled so that each dimension (gene) has the same variance

Diamond and X are closest Circle and diamond are closest

Page 54: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

• MDS takes distance matrix, recreates data in two dimensions while preserving distances

Multidimensional Scaling Plots

http://statlab.bio5.org/foswiki/pub/Main/RBioconductorWorkshop2012/Day6_demo.pdf

Page 55: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Multidimensional Scaling Plots

Page 56: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Hierarchical Clustering

• Hierarchical clustering starts by treating each sample as its own cluster

• The “closest” clusters are merged successively until only one cluster remains

• Produces tree with series of nested clusterings rather than one set of clusters

• Plots of these trees are called “dendrograms”

Page 57: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Hierarchical Clustering

Page 58: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Heat Maps

• Data are plotted with color corresponding to numeric value

• Dendrograms of rows (genes) and columns (samples) displayed on sides

• Rows/columns are reordered by their means, this tends to create blocks of color

Page 59: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Statistics Behind IDR

• IDR = Irreproducible Discovery Rate

– Li, Brown, Huang, and Bickel 2011

• Way of assessing reproducibility of top ranked signals between replicate experiments

• Used to assess reproducibility of Chip-Seq

• Applicable to any high-throughput method that outputs a ranked list

Page 60: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Statistics Behind IDR

• IDR = P(signal not reproducible)

• Model assumes data are mixture of real and spurious signals

• Ranks of real signals will be correlated between replicate experiments

• Ranks of spurious signals will be uncorrelated

Page 61: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

The irreproducible discovery rate (IDR) framework for assessing reproducibility of ChIP-seq

data sets.

Landt S G et al. Genome Res. 2012;22:1813-1831

© 2012, Published by Cold Spring Harbor Laboratory Press

Page 62: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Conclusions • While data from unreplicated RNA Seq

experiments can be analyzed, not recommended – 3 or more biological replicates recommended

– 10 for non-inbred organisms

– For complex experimental designs, consider degrees of freedom

• Multiple testing increases risk of significant p-values when no difference exists, adjust by FDR or other method

Page 63: RNASeq: Experimental Design & Statistics for Differential ... · Normal Models for Count Data •limma-voom calculates variance weights for log2(CPM) so they can be modelled like

Conclusions (Continued)

• A list of DE genes is a first step, not absolute truth – Follow up experiments

• Thoughtful experimental design and use of statistics is as important for genomics data as for any other kind of data