scrna-seq - differential expression analyses
TRANSCRIPT
scRNA-seqDifferential expression analyses
Olga Dethlefsenolgadethlefsennbisse
NBIS National Bioinformatics Infrastructure Sweden
February 2019
Olga (NBIS) scRNA-seq DE February 2019 1 46
Outline
Outline
Introduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DE
Common methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there
Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Introduction
Introduction
Olga (NBIS) scRNA-seq DE February 2019 3 46
Introduction
What does differential expression mean to you
httpswwwmenticom
Olga (NBIS) scRNA-seq DE February 2019 4 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
Outline
Introduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DE
Common methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there
Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Introduction
Introduction
Olga (NBIS) scRNA-seq DE February 2019 3 46
Introduction
What does differential expression mean to you
httpswwwmenticom
Olga (NBIS) scRNA-seq DE February 2019 4 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DE
Common methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there
Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Introduction
Introduction
Olga (NBIS) scRNA-seq DE February 2019 3 46
Introduction
What does differential expression mean to you
httpswwwmenticom
Olga (NBIS) scRNA-seq DE February 2019 4 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there
Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Introduction
Introduction
Olga (NBIS) scRNA-seq DE February 2019 3 46
Introduction
What does differential expression mean to you
httpswwwmenticom
Olga (NBIS) scRNA-seq DE February 2019 4 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Introduction
Introduction
Olga (NBIS) scRNA-seq DE February 2019 3 46
Introduction
What does differential expression mean to you
httpswwwmenticom
Olga (NBIS) scRNA-seq DE February 2019 4 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Introduction
Introduction
Olga (NBIS) scRNA-seq DE February 2019 3 46
Introduction
What does differential expression mean to you
httpswwwmenticom
Olga (NBIS) scRNA-seq DE February 2019 4 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE February 2019 2 46
Introduction
Introduction
Olga (NBIS) scRNA-seq DE February 2019 3 46
Introduction
What does differential expression mean to you
httpswwwmenticom
Olga (NBIS) scRNA-seq DE February 2019 4 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Introduction
Olga (NBIS) scRNA-seq DE February 2019 3 46
Introduction
What does differential expression mean to you
httpswwwmenticom
Olga (NBIS) scRNA-seq DE February 2019 4 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
What does differential expression mean to you
httpswwwmenticom
Olga (NBIS) scRNA-seq DE February 2019 4 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 5 46
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE February 2019 6 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE February 2019 7 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 25 06 78
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE February 2019 8 46
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE February 2019 9 46
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
Common methods
Olga (NBIS) scRNA-seq DE February 2019 10 46
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE February 2019 11 46
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
Generic non-parametric methodseg Wilcoxon rank-sum test Kruskal-Wallis Kolmogorov-Smirnov testnon-parametric tests generally convert observed expression values toranks amp test whether the distribution of ranks for one group aresignficantly different from the distribution of ranks for the other groupsome non-parametric methods fail in the presence of a large number oftied values such as the case for dropouts (zeros) in single-cell RNA-seqexpression dataif the conditions for a parametric test hold then it will typically be morepowerful than a non-parametric test
Olga (NBIS) scRNA-seq DE February 2019 12 46
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
developed for bulk RNA-seqeg edgeR DE-seq2compare estimates of mean-expression (sample size)based on negative binomial distributioncan be assessed by datasets where RNA-seq data has beeen validatedby RT-qPCR
Olga (NBIS) scRNA-seq DE February 2019 13 46
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
developed for scRNA-seq
eg MAST SCDE Monocle Pagoda D3E etclarge number of samples (ie cells) for each group we are comparing insingle-cell experiments Thus we can take advantage of the wholedistribution of expression values in each group to identify differencesbetween groupswe usually do not have a defined set of experimental conditions insteadwe try to identify the cell groups by using an unsupervised clusteringapproach
Olga (NBIS) scRNA-seq DE February 2019 14 46
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE February 2019 15 46
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 16 46
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE February 2019 17 46
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 18 46
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE February 2019 19 46
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE February 2019 20 46
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE February 2019 21 46
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE February 2019 22 46
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE February 2019 23 46
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
micro = mu
δ2 = mu + mu2
size
mu mean expression size and the dispersion which is inversely related to the variance NB fits bulk RNA-seq data very welland it is used for most statistical methods designed for such data In addition it has been show to fit the distribution of moleculecounts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al 2014 Islam et al 2011)
Olga (NBIS) scRNA-seq DE February 2019 24 46
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Common distributions
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
020
040
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 10 20 30 40
010
020
0micro = mu lowast (1 minus d)
δ2 = micro lowast (1 minus d) lowast (1 + d lowast micro + microsize)
d dropout rate The dropout rate of a gene is strongly correlated with the mean expression of the gene Different zero-inflatednegative binomial models use different relationships between mu and d and some may fit mu and d to the expression of eachgene independently Implemented in MAST SCDE
Olga (NBIS) scRNA-seq DE February 2019 25 46
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Common distributions
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100
020
040
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 60 100 140
050
150
micro = g lowast a(a + b)
δ2 = g2 lowast a lowast b((a + b + 1) lowast (a + b)2)
a the rate of activation of transcription b the rate of inhibition of transcription and g the rate of transcript production whiletranscription is active at the locus Differential expression methods may test each of the parameters for differences across groupsor only one (often g) Implemented in BPSCMay be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or librarydepth depending on the particular DE algorithm
Olga (NBIS) scRNA-seq DE February 2019 26 46
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 27 46
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE February 2019 28 46
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE February 2019 29 46
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
Performance
Olga (NBIS) scRNA-seq DE February 2019 30 46
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE February 2019 31 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE February 2019 32 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE February 2019 33 46
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE February 2019 34 46
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE February 2019 35 46
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE February 2019 36 46
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE February 2019 37 46
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE February 2019 38 46
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE February 2019 39 46
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE February 2019 40 46
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE February 2019 41 46
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Summary
What to remember from this hour
httpswwwmenticom amp 25 06 78
Olga (NBIS) scRNA-seq DE February 2019 42 46
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE February 2019 43 46
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE February 2019 44 46
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE February 2019 45 46
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE February 2019 46 46
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-