using simulated data to optimise experimental design and analysis for rna sequencing (conrad...
TRANSCRIPT
![Page 1: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/1.jpg)
Using Simulated Data to Optimise Experimental Design
and Analysis for RNA-Sequencing.
Conrad Burden Mathematical Sciences Institute Australian National University
Canberra
![Page 2: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/2.jpg)
RNA-Seq: Using high-throughput sequencing technology to sequence cDNA that has been reverse-transcribed from RNA to get information about a sample’s RNA content. If the sample is mRNA from a cell, it detects which genes are expressed. Useful for: 1. Expression profiling 2. Detecting differential expression
![Page 3: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/3.jpg)
Extract RNA Library prep Sequencing
RNA cDNA
• Extract mRNA from total RNA • Randomly fragment • Reverse transcribe to cDNA • Ligate sequencing adaptor • Size select to ~ 200 bases • Amplify with PCR
Sequence and map to reference genome to get a digital count of fragments sampled from each gene
![Page 4: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/4.jpg)
Extract RNA Library prep Sequencing
RNA cDNA
Biological variaGon Technical variaGon
Poisson noise
Overdispersion
Final count for each gene is overdispersed Poisson
![Page 5: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/5.jpg)
Extract RNA Library prep Sequencing
RNA cDNA (conc = R)
1. For a given gene, let R = molar concentraGon of cDNA in ‘library’ for a given gene of interest, with E(R) = q; Var(R) = v. 2. Consider q as a proxy for the ‘transcript abundance’ of this gene. 3. Sequencer counts K for this gene given R is Poisson: K|R ~ Pois(λR). 1, 2 and 3 imply
E(K) = μ, Var(K) = μ(1 + φμ), where μ = λq, φ = v/q2. φ is called the overdispersion.
(count = K) transcript abundance ~ q
![Page 6: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/6.jpg)
Extract RNA Library prep Sequencing
RNA cDNA (conc = R) (count = K) transcript abundance ~ q
Moreover, if
λR ~ Gamma(mean = μ, variance = φμ)
Then
K ~ NegBin(mean = μ, variance = μ(1 + φμ)
If λ, μ and φ can be esGmated from the data, q = μ/ λ gives an esGmate of the abundance of this transcript.
![Page 7: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/7.jpg)
(Data: human lymphoblastoid cell lines from J.K. Pickrell et al., Nature 464 768–772.)
SyntheGc Poisson vs. Poisson Same cDNA library, different sequencers
Same biol. source, different cDNA libraries Different biol. reps.
![Page 8: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/8.jpg)
Gene
Condi)on A Condi)on B ... etc.
Rep 1
Rep 2
...etc
Rep 1
Rep 2
...etc
ENSG00000209432 4 6 ... 35 45 ...
ENSG00000209432 0 0 ... 2 1 ... ENSG00000209432 110 96 ... 177 203 ...
ENSG00000209432 1268 1089 ... 9246 9873 ...
ENSG00000212678 148 201 ... 112 93 ...
... etc.
typically > 10,000 genes or transcript isoforms
n reps per condiGon
different condiGons or biol. samples
Data from an RNA-‐Seq experiment to detect differenGal expression typically looks like this:
![Page 9: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/9.jpg)
Gene
Condi)on A Condi)on B ... etc.
Rep 1
Rep 2
...etc
Rep 1
Rep 2
...etc
ENSG00000209432 4 6 ... 35 45 ...
ENSG00000209432 0 0 ... 2 1 ... ENSG00000209432 110 96 ... 177 203 ...
ENSG00000209432 1268 1089 ... 9246 9873 ...
ENSG00000212678 148 201 ... 112 93 ...
... etc.
typically > 10,000 genes or transcript isoforms
n reps per condiGon
different condiGons or biol. samples
Data from an RNA-‐Seq experiment to detect differenGal expression typically looks like this:
Which genes are differenGally expressed?
![Page 10: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/10.jpg)
R packages for assessing differenGal expression based on the negaGve binomial distribuGon:
• DESeq: S. Anders and W. Huber, Gen. Biol. 11:R106 (2010)
• edgeR: M. Robinson, D. McCarthy and G. Smyth, Bioinf 26:139 (2010)
• (also NBPseq: Y. Di, et al., SAGMB 10:24 (2011) and
TSPM: P. Auer and R. Doerge: SAGMB 10:26 (2011))
![Page 11: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/11.jpg)
They differ in how they esGmate the overdispersion (φ) for each gene from a limited number of replicates:
• DESeq: dispersion φ esGmated for each gene as the greater of a per-‐gene maximum likelihood esGmate and a parametric fit to
φ = a + b/μ
• edgeR: dispersion φ esGmated per gene from a likelihood funcGon condiGoned on sum across condiGons, then squeezed towards a common-‐to-‐all genes dispersion using empirical Bayes
![Page 12: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/12.jpg)
p-‐values under the null hypothesis
(μ/λ)condiGon A = (μ/λ)condiGon B calculated under the approximaGon that the total counts in each condiGon is NB, and condiGoned on the sum of counts
KA = counts (cond. A)
K B = cou
nts (cond
. B)
(a, b)
![Page 13: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/13.jpg)
p-‐values under the null hypothesis
(μ/λ)condiGon A = (μ/λ)condiGon B calculated under the approximaGon that the total counts in each condiGon is NB, and condiGoned on the sum of counts
KA = counts (cond. A)
K B = cou
nts (cond
. B)
(a, b)
kA
Prob
(KA =
a|K
A + K
B = a + b) (1-‐sided) p-‐value
is the sum of these probabiliGes
a
![Page 14: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/14.jpg)
p-‐values under the null hypothesis
(μ/λ)condiGon A = (μ/λ)condiGon B calculated under the approximaGon that the total counts in each condiGon is NB, and condiGoned on the sum of counts
KA = counts (cond. A)
K B = cou
nts (cond
. B)
(a, b)
kA
Prob
(KA =
a|K
A + K
B = a + b) (2-‐sided) p-‐value
is the sum of these probabiliGes
a
![Page 15: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/15.jpg)
Robles et al., BMC Genomics (2012) 13:484
![Page 16: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/16.jpg)
Test DESeq and edgeR using simulated data TesGng null hypothesis:
1. Start with Pickrell et al. dataset of 69 sequenced cDNA libraries from HapMap project (i.e. a table of RNA-‐Seq counts for 69 biological replicates of ~60,000 transcript isoforms).
2. Use max. likelihood to produce from this a set of NB parameters (μi, φi) for i = 1, ..., ~60,000 represenGng a ‘typical’ range of means and overdispersions for our syntheGc transcriptome.
3. Construct a syntheGc dataset of counts: • n reps of ‘control’ counts Kijcontrol ~ NB(μi, φi) , j = 1, ... n • n reps of ‘treatment’ counts Kijtreatment ~ NB(μi, φi)
![Page 17: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/17.jpg)
Null hypothesis: (no up-‐ or down-‐regulaGon) n = 3 reps vs. 3 reps expect flat p-‐value distribuGon.
Synthetic data: 3 rep vs. 3 rep
Per
cent
of T
otal
0
2
4
6
8
10
0.0 0.2 0.4 0.6 0.8 1.0
NBP all NBP high
0.0 0.2 0.4 0.6 0.8 1.0
NBP low
DESeq all DESeq high
0
2
4
6
8
10DESeq low
0
2
4
6
8
10edgeR all
0.0 0.2 0.4 0.6 0.8 1.0
edgeR high edgeR lowall t’cripts >100 counts <100 counts
Percen
tage of total
![Page 18: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/18.jpg)
DESeq null p-values: synthetic data 3 vs. 3
p-value
Density
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
67
![Page 19: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/19.jpg)
Right-‐hand spike is an arGfact of calculaGng p-‐values from a discrete distribuGon -‐ could be ‘fixed’ by replacing the discrete distribuGon by a conGnuous distribuGon
a
Prob
(KA =
a|K
A + K
B = k
A + k
B)
2-‐sided p-‐value is the sum of these probs
kA
2-‐sided p-‐value is the shaded area
a Prob
(KA =
a|K
A + K
B = k
A + k
B)
kA
chose a point randomly in the interval (kA − ½, kA+ ½)
![Page 20: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/20.jpg)
DESeq null p-values: synthetic data 3 vs. 3
p-value
Density
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
67
original spectrumspike removed
![Page 21: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/21.jpg)
Remaining deviaGon from a uniform distribuGon is from having to esGmate the parameters μ and φ for each transcript
DESeq null p-values: synthetic data 3 vs. 3
p-value
Density
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
67
original spectrumspike removedparameters not estimated
![Page 22: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/22.jpg)
Null hypothesis: α = 0 (no up-‐ or down-‐regulaGon) n = 3 reps vs. 3 reps expect flat p-‐value distribuGon.
Synthetic data: 3 rep vs. 3 rep
Per
cent
of T
otal
0
2
4
6
8
10
0.0 0.2 0.4 0.6 0.8 1.0
NBP all NBP high
0.0 0.2 0.4 0.6 0.8 1.0
NBP low
DESeq all DESeq high
0
2
4
6
8
10DESeq low
0
2
4
6
8
10edgeR all
0.0 0.2 0.4 0.6 0.8 1.0
edgeR high edgeR lowall t’cripts >100 counts <100 counts
Percen
tage of total
ArGfact of p-‐value calculaGon for discrete data
UnderesGmate of dispersion
OveresGmate of dispersion
![Page 23: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/23.jpg)
!"!!#
!"$!#
%"!!#
%"$!#
&"!!#
&"$!#
'"!!#&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
,-.,/# 012,3# 45677582,3#
!"#$
"#%&''$
FPR = percentage of transcripts reported as differenGally expressed under the null hypothesis for n reps vs. n reps at α = 1% significance
(Li et al., BiostaDsDcs (2012) 13:523)
![Page 24: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/24.jpg)
!"!!#
!"$!#
%"!!#
%"$!#
&"!!#
&"$!#
'"!!#&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
&(&#
'('#
)()#
*(*#
+(+#
%&(%&#
,-.,/# 012,3# 45677582,3#
!"#$
"#%&''$
FPR = percentage of transcripts reported as differenGally expressed under the null hypothesis for n reps vs. n reps at α = 1% significance
Overdispersion underesGmated
underconservaGve
Overdispersion overesGmated
overconservaGve
(Li et al., BiostaDsDcs (2012) 13:523)
![Page 25: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/25.jpg)
TesGng the power to detect differenGal expression • How many replicates is appropriate?
(biological reps or library prep reps if reps are from the same biological source)
• What sequencing depth?
• Is mulGplexing (via barcodes) worthwhile?
![Page 26: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/26.jpg)
• SyntheGc dataset to test the power of DESeq and edgeR to detect differenGal expression
1. Use max. likelihood esGmates of (μi, φi) from Pickrell data again
2. Construct a syntheGc dataset of counts: • n reps of ‘control’ counts Kijcontrol ~ NB(μi, φi) , j = 1, ... n • n reps of ‘treatment’ counts Kijtreatment ~ NB(μi θi, φi) where
θi = (1 + Xi) for 7.5% of the transcripts (up-‐regulated) θi = (1 + Xi)-‐1 for a further 7.5% (down-‐regulated) θi = 1 for the remainder,
with Xi ~ i.i.d. exponenGal random variables, parameter 1.
![Page 27: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/27.jpg)
Define a gene to be ‘effecGvely differenGally expressed’ if
θi < 1/1.2 or θi > 1.2
EffecGvely DE
EffecGvely non-‐DE
85% unchanged
![Page 28: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/28.jpg)
Control for false discovery rate FDR = FP/(TP + FP)
using the Benjamini-‐Hochberg adjusted p-‐value padj < α Finally, measure a false posiGve rate
and a true posiGve rate
Do this for a range of coverage depths and # replicates
FPR =# of effectively non-DE transcripts with padj <α
total # of effectively non-DE transcripts
TPR =# of effectively DE transcripts with padj <α
total # of effectively DE transcripts
![Page 29: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/29.jpg)
With 7.5% up-‐regulated and 7.5% down-‐regulated: DESeq
TPR = TP/(TP + FN) (x 100%)
= ‘sensiGvity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion 100% coverage ≈ 107 reads
![Page 30: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/30.jpg)
With 7.5% up-‐regulated and 7.5% down-‐regulated: edgeR
TPR = TP/(TP + FN) (x 100%)
= ‘sensiGvity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion 100% coverage ≈ 107 reads
![Page 31: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/31.jpg)
With 7.5% up-‐regulated and 7.5% down-‐regulated: edgeR
TPR = TP/(TP + FN) (x 100%)
= ‘sensiGvity’ 1. TPR increases with
number of reps n 2. TPR decreases with
coverage depth 3. MulGplexing (more reps,
less coverage, keeping n Gmes depth constant) improves TPR (grey curve)
4. edgeR has slightly beyer sensiGvity than DESeq
![Page 32: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/32.jpg)
With 7.5% up-‐regulated and 7.5% down-‐regulated: DESeq
FPR = FP/(TN + FP) (x 100%)
= 1 – ‘specificity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion
n =12
n =2
n =4
![Page 33: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/33.jpg)
With 7.5% up-‐regulated and 7.5% down-‐regulated: edgeR
FPR = FP/(TN + FP) (x 100%)
= 1 – ‘specificity’ using Benjamini-‐Hochberg adjusted p-‐value padj ≤ 0.01 as a significance criterion
n =12
n =2
n =4
1. MulGplexing (more reps, less coverage, keeping n Gmes depth constant) improves specificity (grey curve)
2. DESeq has slightly beyer specificity than edgeR
![Page 34: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/34.jpg)
With 7.5% up-‐regulated and 7.5% down-‐regulated: edgeR
FPR = FP/(TN + FP) (x 100%)
= 1 – ‘specificity’ using Fold change > 2 as a criterion for detecGng differenGal expression
(not recommended)
n =12
n =2
n =4
FPR increases with decreasing coverage depth because more transcripts have very low counts and Poisson shot noise can easily induce a spurious doubling of counts
![Page 35: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/35.jpg)
To summarise • Have tested the performance of NegaGve Binomial based R packages for
detecGng differenGal expression using syntheGc data. • Under null hypothesis, DESeq’s performance is consistently more
conservaGve than edgeR across # of replicates, and closer to expected significance level for small numbers of reps. edgeR is closer for high numbers of reps.
• With 15% of transcripts differenGally expressed, for both edgeR and
DESeq: – sensiGvity (= TPR) improves with number of replicates, as expected – sensiGvity declines with decreased sequencing depth, as expected – sensiGvity beyer for edgeR than DESeq – but mulGplexing (decreasing sequencing depth while increasing # of
replicates with same total amount of ‘read estate’) increases sensiGvity markedly
![Page 36: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/36.jpg)
To summarise
Recommend • The more (independent!) replicates the beyer
• It’s OK to sacrifice sequencing read depth by mulGplexing
![Page 37: Using Simulated Data to Optimise Experimental Design and Analysis for RNA Sequencing (Conrad Burden)](https://reader033.vdocument.in/reader033/viewer/2022052619/55504d74b4c90580748b52b9/html5/thumbnails/37.jpg)
Acknowledgements
Sue Wilson, Australian NaGonal University and University of New South Wales Jen Taylor, Division of Plant Industry, CSIRO Sumaira Qureshi, MathemaGcal Sciences InsGtute, Australian NaGonal University Jose Robles, Division of Plant Industry, CSIRO Stuart Stephen, Division of Plant Industry, CSIRO