comparison of tools to detect differential expression in...
TRANSCRIPT
Comparison of tools to detect differential expression in
RNA-seq studies
Fatemeh Seyednasrollah, Asta Laiho, Laura L. Elo
University of Turku
Turku Center for Biotechnology
Computational Biomedicine Group
1
Contents Biological Background
Computational Background
Experimental design
Results
Conclusions
2
For more information please refer to:
'Comparison of software packages for detecting differential expression in RNA-seq studies‘ Fatemeh Seyednasrollah, Asta Laiho, Laura L. Elo “Briefings in Bioinformatics”
Genome • Genetic material is encoded by DNA (RNA in some
viruses) molecules and determines the instruction of
functional elements of living organisms.
• Coding elements: A, T, C, G (4n)
3 http://ndla.no/nb/node/127042?fag=52234
RNA: Phenotype indicator
• Complementary rule: A binds T and C binds G
4 http://hyperphysics.phy-astr.gsu.edu/hbase/organic/transcription.html
Gene Expression: From DNA to
Proteins
5 1999 Addison Wesley Longman Inc.
Sequencing
• Process of determining the precise order of
nucleotides within a DNA strand.
• Next Generation Sequencing(NGS): massively
parallel
6
RNA-sequencing
RNA sequencing:
The most recent powerful technology in
“Gene Expression Profiling” studies or
“Transcriptomics”
Overall Mechanism:
Detection and quantification of any genomic
features of interest in terms of “read counts”
7
RNA-seq in a Nutshell
Millions of short
reads from
transcripts are
produced,
sequenced and
then
reassemble
and/or
mapped to the
reference of
origination using
mapping
algorithms.
8
Computational Background
There is a linear relation between the read counts
and expression level of biological feature of
interest.
Counts are positive integers (discrete distribution).
Library size: total number of reads for a specific
sample and is determined with sequencing depth.
9
Analysis Pipeline Quality control(fastq files)
fastQC
Alignment
TopHat2
Expression level quantification
HTSeq
Normalization
Package default/TMM
Statistical analysis
Eight state-of-the-art
methods
10
Millions of short reads
Quality Control
Mapping /reads reassembling
Summarization: Table of counts
Normalization
DE testing
Flowchart is based on: “From RNA-seq reads to differential expression results”, Alicia Oshlack et al, 2010, Genome Biology
Table of Counts?
11
Matrix of data with genomic features as rows and experiment samples as column Is the difference across the conditions greater than what we expect from normal biological variation? Can we detect reliable differentially expressed biomarkers?
Differential Expression Analysis
Normalization
Technical biases: different platforms
Length biases
RNA composition biases
Statistical modeling
Parametric or non-parametric
Testing the differential expression
Investigating the significance of differentiation
across different conditions
12
Data Complexity Accurate estimation of variance
Data sampling: Individual level
RNA extraction level
RNA sequencing level
Few number of samples
Biological samples leads to “overdispersion”
13
Poisson
Negative Binomial
Anders and Huber Genome Biology 2010 11:R106 doi:10.1186/gb-2010-11-10-r106
Variance and Statistical Testing
Null hypothesis:
The expression of gene “i” is equal across different
conditions.
If the null hypothesis is rejected, is it because of
natural biological variation (estimated variance) or is it
because of the experimental condition difference?
Main challenge:
Accurate estimation of different types of variance
14
Methods Algorithms
15
Research Question
Various statistical methods are available
Yet, neither an optimized method for all datasets,
nor a clear instruction for choosing the best
methodology has been reported.
To assist in making a biologically
and statistically meaningful
decision, we present a systematic
practical pipeline comparison
of eight state-of-the-art
computational methods.
16
Data sets
Datasets: Generate by Illumina platform Publicly available to make the analysis
reproducible
Different level of heterogeneity
Different organisms
large number of samples
17
28 Female
28 Male
Human
10 C57BL/6J strain
10 DBA/2J strain
Mouse
Human data set: lymphoblastoid cell lines of 56 unrelated Nigerian individuals
Performance Criteria
Estimation Criteria:
Number of detections and their consistency
False positive discoveries
Correlation between methods
Run times
FDR < 0.05
18
Experimental Design
19
Select initial N samples from each distinct groups randomly
Run the statistical analysis
Add x more samples as the input until all
Repeat the task for ten times
For the false discoveries, we did the same but selection procedure happened within the same group like sampling within female samples
Experimental Design
20
D2 strain
Randomly 2 samples
3 samples
6 samples
8 samples
10 samples
B6 strain
Randomly 2 samples
3 samples
6 samples
8 samples
10 samples
Add more
Mouse data set experimental design -> try to simulate wide range of possible situations
Data sets Intrinsic Properties Mouse Data set : more homogenous
Human Data set : heterogeneous plus outliers
(male)
21
Sp
earm
an C
orr
elat
ion
Results : Number of Detections
22
Number of detections increase as the sample sizes increase, exception: NOIseq and Cuffdiff2.0.0 (poor detection)
At some points the curve starts to be stabilized especially in human data set Moderate methods : DESeq (more conservative) and Limma Data dependent methods : EBSeq and baySeq
Results: Methods consistency
23
Limma and DESeq among the top methods
Results : Relative False Discoveries
24
Number of False discoveries decreases by increasing sample sizes , especially in less heterogeneous data set (mouse)
In general, Limma, DESeq and baySeq performs well in mock comparisons EBSeq, SAMseq and edgeR find more number of false positives
Results : Methods Overlaps
(mouse)
25
Do you need running combinatorial analysis? Results only consider on significantly differential expressed genes (FDR <0.05)
Results : Methods Ranking Correlations
26
Results consider
on 1952 genes
which were
among the top
1000 ranked
genes within all
methods for
Mouse data set
and
corresponding
spearman rank
correlations
Results : Run Time Run Time : The most efficient method has the least
run time
27
Conclusions
Which methods you can rely on?
Limma and DESeq represent higher performance and level of consistency under different conditions
Do you have small number of biological replicates? We do not recommend non-parametric approaches like SAMseq
Try combinational analysis to verify the results
Do you have more than five replicates?
Avoid using NOIseq and Cuffdiff
Do you use edgeR in your analyses?
be aware of possibility of inconsistency between results and high potential risk of false discoveries
Do you have heterogeneous data set?
baySeq can be a powerful method
And … more important of all: Investigate thoroughly input data set properties in advance next you can choose the statistical method
28
29