scrna-seq - differential expression analyses2018/05/21 · known from bulk rna-seq and microarray...
TRANSCRIPT
scRNA-seqDifferential expression analyses
Olga Dethlefsenolgadethlefsennbisse
NBIS National Bioinformatics Infrastructure Sweden
May 2018
Olga (NBIS) scRNA-seq DE May 2018 1 43
Outline
Outline
Introduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DE
Common methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there
Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
Letrsquos get to know each other
httpswwwmenticom
Olga (NBIS) scRNA-seq DE May 2018 3 43
Introduction
Introduction
Olga (NBIS) scRNA-seq DE May 2018 4 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
Outline
Introduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DE
Common methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there
Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
Letrsquos get to know each other
httpswwwmenticom
Olga (NBIS) scRNA-seq DE May 2018 3 43
Introduction
Introduction
Olga (NBIS) scRNA-seq DE May 2018 4 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DE
Common methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there
Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
Letrsquos get to know each other
httpswwwmenticom
Olga (NBIS) scRNA-seq DE May 2018 3 43
Introduction
Introduction
Olga (NBIS) scRNA-seq DE May 2018 4 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out there
Performance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
Letrsquos get to know each other
httpswwwmenticom
Olga (NBIS) scRNA-seq DE May 2018 3 43
Introduction
Introduction
Olga (NBIS) scRNA-seq DE May 2018 4 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is best
Practicalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
Letrsquos get to know each other
httpswwwmenticom
Olga (NBIS) scRNA-seq DE May 2018 3 43
Introduction
Introduction
Olga (NBIS) scRNA-seq DE May 2018 4 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real life
Summary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
Letrsquos get to know each other
httpswwwmenticom
Olga (NBIS) scRNA-seq DE May 2018 3 43
Introduction
Introduction
Olga (NBIS) scRNA-seq DE May 2018 4 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
OutlineIntroduction what is so special about scRNA-seq DECommon methods what is out therePerformance how do we know what is bestPracticalities what to do in real lifeSummary what to remember from this hour
Olga (NBIS) scRNA-seq DE May 2018 2 43
Outline
Letrsquos get to know each other
httpswwwmenticom
Olga (NBIS) scRNA-seq DE May 2018 3 43
Introduction
Introduction
Olga (NBIS) scRNA-seq DE May 2018 4 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Outline
Letrsquos get to know each other
httpswwwmenticom
Olga (NBIS) scRNA-seq DE May 2018 3 43
Introduction
Introduction
Olga (NBIS) scRNA-seq DE May 2018 4 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Introduction
Olga (NBIS) scRNA-seq DE May 2018 4 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 5 43
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Figure Simplified scRNA-seq workflow [adapted from Wikipedia]
Olga (NBIS) scRNA-seq DE May 2018 6 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
adapted from Wu et al 2017
Differential expression meanstaking read count data amp
performing statistical analysis to discoverquantitative changes in expression levels betweenexperimental groups
ie to decide whether for a given gene anobserved difference in read counts is significant(greater than what would be expected just due tonatural random variation)
Differential expression is an old problemknown from bulk RNA-seq and microarray studies
in fact building on one of the most commonstatistical problems ie comparing groups forstatistical differences
Olga (NBIS) scRNA-seq DE May 2018 7 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Differential expression is an old problem
So what is all the commotion about
httpswwwmenticom amp 70 52 87
scRNA-seq special characteristicshigh noise levels (technical and biological factors)low library sizeslow amount of available mRNAs results in amplification biases anddropout events3rsquo bias partial coverage and uneven depth (technical)stochastic nature of transcription (biological)multimodality in gene expression presence of multiple possiblecell states within a cell population (biological)
Olga (NBIS) scRNA-seq DE May 2018 8 43
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Introduction
Rbm17 Rragc Slc1a3 Slc22a20 Smarcd1
Mybpc1 Nars Ndufa3 Nono Pgam2
Crispld2 Fbxw13 Hbxip Katna1 Lcorl
1300018J18Rik Arid2 Bend3 Ccdc104 Ccnt1
00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100 00 25 50 75 100
00
01
02
03
0
1
2
3
00
01
02
03
00
01
02
03
00
01
02
03
000
005
010
015
020
025
00
02
04
06
00
01
02
03
04
05
00
01
02
03
00
01
02
03
00
02
04
00
05
10
15
00
01
02
03
00
05
10
15
20
00
02
04
000
005
010
015
0
1
2
3
4
00
05
10
15
00
01
02
03
04
05
00
01
02
03
04
value
dens
ity
Based on tutorial data
Olga (NBIS) scRNA-seq DE May 2018 9 43
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
Common methods
Olga (NBIS) scRNA-seq DE May 2018 10 43
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
Simplified scRNA-seq workflow [adopted from httphemberg-labgithubio
Olga (NBIS) scRNA-seq DE May 2018 11 43
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
Genericparametric tests eg t-testnon-parametric tests eg Kruskal-Wallis
RNA-seq basededgeRlimmaDEseq2
scRNA-seq specificMAST SCDE MonocleD3E Pagoda
Olga (NBIS) scRNA-seq DE May 2018 12 43
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods
Miao and Zhang 2016
Olga (NBIS) scRNA-seq DE May 2018 13 43
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methodsSupplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 14 43
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
More detailed examples
Olga (NBIS) scRNA-seq DE May 2018 15 43
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
MAST
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 16 43
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
SCDE
models the read counts for each gene using a mixture of a NB negative binomial and aPoisson distribution
NB distribution models the transcripts that are amplified and detected
Poisson distribution models the unobserved or background-level signal of transcripts thatare not amplified (eg dropout events)
subset of robust genes is used to fit via EM algorithm the parameters to the mixture ofmodels
For DE the posterior probability that the gene shows a fold expression difference betweentwo conditions is computed using a Bayesian approach
Olga (NBIS) scRNA-seq DE May 2018 17 43
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Monocole
Originally designed for ordering cells by progress through differentiation stages(pseudo-time)
The mean expression level of each gene is modeled with a GAM generalized additivemodel which relates one or more predictor variables to a response variable as
g(E(Y )) = β0 + f1(x1) + f2(x2) + + fm(xm) where Y is a specific gene expression level xi arepredictor variables g is a link function typically log function and fi are non-parametric functions
(eg cubic splines)
The observable expression level Y is then modeled using GAM
E(Y ) = s(ϕt (bx si )) + ε where ϕt (bx si ) is the assigned pseudo-time of a cell and s is a cubicsmoothing function with three degrees of freedom The error term ε is normally distributed with amean of zero
The DE test is performed using an approx χ2 likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 18 43
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Letrsquos stop for a minute
Olga (NBIS) scRNA-seq DE May 2018 19 43
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
The key
Outcomei = (Modeli) + errori
we collect data on a sample from a much larger population
statistics lets us to make inferences about the population from which sample wasderived
we try to predict the outcome given a model fitted to the data
Olga (NBIS) scRNA-seq DE May 2018 20 43
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
The key
t = x1minusx2
sp
radic1
n1+ 1
n2
height [cm]
Fre
quen
cy
165 170 175 180
010
3050
Olga (NBIS) scRNA-seq DE May 2018 21 43
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Generic recipemodel data eg gene expressionfit model to the data andor data to the modelestimate model parametersuse model for prediction andor inference
Olga (NBIS) scRNA-seq DE May 2018 22 43
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
MAST (revisited)
uses generalized linear hurdle model
designed to account for stochastic dropouts and bimodal expression distribution in whichexpression is either strongly non-zero or non-detectable
The rate of expression Z and the level of expression Y are modeled for each gene gindicating whether gene g is expressed in cell i (ie Zig = 0 if yig = 0 and zig = 1 ifyig gt 0)
A logistic regression model for the discrete variable Z and a Gaussian linear model for thecontinuous variable (Y|Z=1)
logit(Pr (Zig = 1)) = XiβDg
Pr (Yig = Y |Zig = 1) = N(XiβCg σ
2g) where Xi is a design matrix
Model parameters are fitted using an empirical Bayesian framework
Allows for a joint estimate of nuisance and treatment effects
DE is determined using the likelihood ratio test
Olga (NBIS) scRNA-seq DE May 2018 23 43
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Generic recipemodel eg gene expression with random errorfit model to the data andor data to the model estimate modelparametersuse model for prediction andor inference
Important implicationthe better model fits to the data the better statistics
Olga (NBIS) scRNA-seq DE May 2018 24 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
5 10 15 20
020
4060
8010
012
0
Zerominusinflated NB
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
PoissonminusBeta
Read Counts
Fre
quen
cy
0 20 40 60 80 120
010
020
030
0
Olga (NBIS) scRNA-seq DE May 2018 25 43
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Common methods More detailed examples
Common distributions
Negative Binomial
Read Counts
Fre
quen
cy
0 5 10 15 20
010
020
030
040
050
0
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10
050
100
150
Negative Binomial
Read Counts
Fre
quen
cy
0 2 4 6 8 10 12
050
100
150
200
Olga (NBIS) scRNA-seq DE May 2018 26 43
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
Performance
Olga (NBIS) scRNA-seq DE May 2018 27 43
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Supplementary Table 2 Evaluated dicrarrerential expression methods together with package versions and thetype of input values provided to each of them Note that ldquoraw countsrdquo here refers to length-scaled TPMswhich are on the scale of the raw counts but are unacrarrected by dicrarrerential isoform usage [10] CPM valuesare calculated with edgeR and Census counts with monocle
Short name Method Software version InputAvailablefrom
Reference
BPSC BPSC BPSC 09901 CPM GitHub [11]
D3E D3E D3E 10 raw counts GitHub [12]
DESeq2 DESeq2 DESeq2 1141 raw counts Bioconductor [13]
DESeq2betapFALSE DESeq2 without beta prior DESeq2 1141 raw counts Bioconductor [13]
DESeq2census DESeq2 DESeq2 1141 Census counts Bioconductor [13]
DESeq2nofiltDESeq2 without the built-in in-dependent filtering
DESeq2 1141 raw counts Bioconductor [13]
DEsingle DEsingle DEsingle 010 raw counts GitHub [14]
edgeRLRT edgeRLRT edgeR 3191 raw counts Bioconductor [15ndash17]
edgeRLRTcensus edgeRLRT edgeR 3191 Census counts Bioconductor [15ndash17]
edgeRLRTdeconvedgeRLRT with deconvolutionnormalization
edgeR 3191scran 120
raw counts Bioconductor [15 17 18]
edgeRLRTrobustedgeRLRT with robust disper-sion estimation
edgeR 3191 raw counts Bioconductor [15ndash17 19]
edgeRQLF edgeRQLF edgeR 3191 raw counts Bioconductor [15 16 20]
edgeRQLFDetRateedgeRQLF with cellular detec-tion rate as covariate
edgeR 3191 raw counts Bioconductor [15 16 20]
limmatrend limma-trend limma 33013 log2(CPM) Bioconductor [21 22]
MASTcpm MAST MAST 105 log2(CPM+1) Bioconductor [23]
MASTcpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(CPM+1) Bioconductor [23]
MASTtpm MAST MAST 105 log2(TPM+1) Bioconductor [23]
MASTtpmDetRateMAST with cellular detectionrate as covariate
MAST 105 log2(TPM+1) Bioconductor [23]
metagenomeSeq metagenomeSeqmetagenomeSeq1160
raw counts Bioconductor [24]
monocle monocle (tobit) monocle 220 TPM Bioconductor [25]
monoclecensus monocle (Negative Binomial) monocle 220 Census counts Bioconductor [25 26]
monoclecount monocle (Negative Binomial) monocle 220 raw counts Bioconductor [25]
NODES NODESNODES0009010
raw countsAuthor-providedlink
[27]
ROTScpm ROTS ROTS 120 CPM Bioconductor [28 29]
ROTStpm ROTS ROTS 120 TPM Bioconductor [28 29]
ROTSvoom ROTS ROTS 120voom-transformedraw counts
Bioconductor [28 29]
SAMseq SAMseq samr 20 raw counts CRAN [30]
scDD scDD scDD 100 raw counts Bioconductor [31]
SCDE SCDE scde 220 raw counts Bioconductor [32]
SeuratBimod Seurat (bimod test) Seurat 1407 raw counts GitHub [33 34]
SeuratBimodnofiltSeurat (bimod test) without theinternal filtering
Seurat 1407 raw counts GitHub [33 34]
SeuratBimodIsExpr2Seurat (bimod test) with internalexpression threshold set to 2
Seurat 1407 raw counts GitHub [33 34]
SeuratTobit Seurat (tobit test) Seurat 1407 TPM GitHub [25 33]
ttest t-test stats (R v 33)TMM-normalizedTPM
CRAN [16 35]
voomlimma voom-limma limma 33013 raw counts Bioconductor [21 22]
Wilcoxon Wilcoxon test stats (R v 33)TMM-normalizedTPM
CRAN [16 36]
3
Nature Methods doi101038nmeth4612
Performance
Olga (NBIS) scRNA-seq DE May 2018 28 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
No ground truth ie no independently validated truth is available fortesting
Known data
using data we know something aboutto get positive controls
Simulated data
null-data sets by re-samplingmodeling data sets based on variousdistributions
Comparing between methods andscenarios
Comparing numbers of DEs incl as afunction of group size
Investigating results
How does the expression anddistributions of detected DEs look like
Olga (NBIS) scRNA-seq DE May 2018 29 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
adapted from Wikipedia
Olga (NBIS) scRNA-seq DE May 2018 30 43
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
False positives (type I error) vs false negatives (type II error)Sensitivity and specificityPrecision and recall
Dal Molin Baruzzo and Di Camillo 2017 2 conditions of 100 cells each simulated with 10 000genes out of which 2 000 set to DEs (based on NB and bimodal distributions)
Olga (NBIS) scRNA-seq DE May 2018 31 43
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
Consistency
Miao et al 2017
Olga (NBIS) scRNA-seq DE May 2018 32 43
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Performance
And so much more
Soneson and Robinson 2018
Bias robustness and scalability in single-celldifferential expression analysis
36 statistical approaches for DE analysisto compare the expression levels in thetwo groups of cells
based on 9 data sets with 11 - 21separate instances (sample size effect)
extensive evaluation metrics incl numberof genes found characteristics of the falsepositive detections robustness ofmethods similarities between methodsetc
conquer a collection of consistentlyprocessed analysis-ready publicscRNA-seq data sets
Olga (NBIS) scRNA-seq DE May 2018 33 43
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Practicalities
Olga (NBIS) scRNA-seq DE May 2018 34 43
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Getting to know your dataExample data 46078 genes x 96 cells22229 genes with no expression at all
Read Counts
Fre
quen
cy
0 500 1000 1500
050
0015
000
0 counts
Fre
quen
cy
0 20 40 60 800
2000
4000
6000
Olga (NBIS) scRNA-seq DE May 2018 35 43
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Choosing DE methods
Soneson and Robinson 2018
Olga (NBIS) scRNA-seq DE May 2018 36 43
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Rembering the bigger picture
Stegle Teichmann and Marioni 2015
QC filtering
Cell-cycle phase
Normalization of cell-specific biases
Confounding factors incl batcheffects
Detection rate ie the fraction ofdetected genes per cell
Imputations strategies for dropoutvalues
What is pragmatic programminglanguage platform speedcollaborative workflows etc
Olga (NBIS) scRNA-seq DE May 2018 37 43
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Practicalities
Staying critical
Olga (NBIS) scRNA-seq DE May 2018 38 43
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Summary
What to remember from this hour
httpswwwmenticom amp 70 52 87
Olga (NBIS) scRNA-seq DE May 2018 39 43
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Summary
Growing field
Angerer et al 2017
Olga (NBIS) scRNA-seq DE May 2018 40 43
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Summary
Growing field
httpswwwscrna-toolsorgtools
Zappia Phipson and Oshlack 2018
Olga (NBIS) scRNA-seq DE May 2018 41 43
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Summary
SummaryscRNA-seq is a rapidly growing field
DE is a common task so many newer and better methods will be developed
understanding basic statistical concepts enables one to think more like a statistician tochoose and evaluate methods given data set
staying critical staying updated staying connected
Olga (NBIS) scRNA-seq DE May 2018 42 43
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-
Bibliography
Wu Zhijin et al 2017 ldquoTwo-phase differential expression analysis for single cell RNA-seqrdquoBioinformatics 00 (00) 1ndash9 ISSN 1367-4803 doi101093bioinformaticsbty329
Miao Zhun and Xuegong Zhang 2016 ldquoDifferential expression analyses for single-cellRNA-Seq old questions on new datardquo Quantitative Biology 4 (4) 243ndash260 ISSN 20954697doi101007s40484-016-0089-7
Soneson Charlotte and Mark D Robinson 2018 ldquoBias robustness and scalability in single-celldifferential expression analysisrdquo Nature Methods 15 (4) 255ndash261 ISSN 15487105doi101038nmeth4612 httpdxdoiorg101038nmeth4612
Dal Molin Alessandra Giacomo Baruzzo and Barbara Di Camillo 2017 ldquoSingle-cellRNA-sequencing Assessment of differential expression analysis methodsrdquo Frontiers inGenetics 8 (MAY) ISSN 16648021 doi103389fgene201700062
Miao Zhun et al 2017 ldquoDEsingle for detecting three types of differential expression in single-cellRNA-seq datardquo no May 1ndash2 ISSN 1367-4803 doi101093bioinformaticsbty332arXiv 103549
Stegle Oliver Sarah A Teichmann and John C Marioni 2015 ldquoComputational and analyticalchallenges in single-cell transcriptomicsrdquo Nature reviews Genetics 16 (January 2014)133ndash145
Angerer Philipp et al 2017 ldquoSingle cells make big data New challenges and opportunities intranscriptomicsrdquo Current Opinion in Systems Biology 485ndash91 ISSN 24523100doi101016jcoisb201707004httpdxdoiorg101016jcoisb201707004
Zappia Luke Belinda Phipson and Alicia Oshlack 2018 ldquoExploring the single-cell RNA-seqanalysis landscape with the scRNA-tools databaserdquo bioRxiv 206573 doi101101206573httpswwwbiorxivorgcontentearly20180323206573
Olga (NBIS) scRNA-seq DE May 2018 43 43
- Outline
- Introduction
- Common methods
-
- More detailed examples
-
- Performance
- Practicalities
- Summary
- Bibliography
-