comparison of tools to detect differential expression in...

Comparison of tools to detect differential expression in

RNA-seq studies

Fatemeh Seyednasrollah, Asta Laiho, Laura L. Elo

University of Turku

Turku Center for Biotechnology

Computational Biomedicine Group

1

Contents Biological Background

Computational Background

Experimental design

Results

Conclusions

2

For more information please refer to:

'Comparison of software packages for detecting differential expression in RNA-seq studies‘ Fatemeh Seyednasrollah, Asta Laiho, Laura L. Elo “Briefings in Bioinformatics”

Genome • Genetic material is encoded by DNA (RNA in some

viruses) molecules and determines the instruction of

functional elements of living organisms.

• Coding elements: A, T, C, G (4n)

3 http://ndla.no/nb/node/127042?fag=52234

RNA: Phenotype indicator

• Complementary rule: A binds T and C binds G

4 http://hyperphysics.phy-astr.gsu.edu/hbase/organic/transcription.html

Gene Expression: From DNA to

Proteins

5 1999 Addison Wesley Longman Inc.

Sequencing

• Process of determining the precise order of

nucleotides within a DNA strand.

• Next Generation Sequencing(NGS): massively

parallel

6

RNA-sequencing

RNA sequencing:

The most recent powerful technology in

“Gene Expression Profiling” studies or

“Transcriptomics”

Overall Mechanism:

Detection and quantification of any genomic

features of interest in terms of “read counts”

7

RNA-seq in a Nutshell

Millions of short

reads from

transcripts are

produced,

sequenced and

then

reassemble

and/or

mapped to the

reference of

origination using

mapping

algorithms.

8

Computational Background

There is a linear relation between the read counts

and expression level of biological feature of

interest.

Counts are positive integers (discrete distribution).

Library size: total number of reads for a specific

sample and is determined with sequencing depth.

9

Analysis Pipeline Quality control(fastq files)

fastQC

Alignment

TopHat2

Expression level quantification

HTSeq

Normalization

Package default/TMM

Statistical analysis

Eight state-of-the-art

methods

10

Millions of short reads

Quality Control

Mapping /reads reassembling

Summarization: Table of counts

Normalization

DE testing

Flowchart is based on: “From RNA-seq reads to differential expression results”, Alicia Oshlack et al, 2010, Genome Biology

Table of Counts?

11

Matrix of data with genomic features as rows and experiment samples as column Is the difference across the conditions greater than what we expect from normal biological variation? Can we detect reliable differentially expressed biomarkers?

Differential Expression Analysis

Normalization

Technical biases: different platforms

Length biases

RNA composition biases

Statistical modeling

Parametric or non-parametric

Testing the differential expression

Investigating the significance of differentiation

across different conditions

12

Data Complexity Accurate estimation of variance

Data sampling: Individual level

RNA extraction level

RNA sequencing level

Few number of samples

Biological samples leads to “overdispersion”

13

Poisson

Negative Binomial

Anders and Huber Genome Biology 2010 11:R106 doi:10.1186/gb-2010-11-10-r106

Variance and Statistical Testing

Null hypothesis:

The expression of gene “i” is equal across different

conditions.

If the null hypothesis is rejected, is it because of

natural biological variation (estimated variance) or is it

because of the experimental condition difference?

Main challenge:

Accurate estimation of different types of variance

14

Methods Algorithms

15

Research Question

Various statistical methods are available

Yet, neither an optimized method for all datasets,

nor a clear instruction for choosing the best

methodology has been reported.

To assist in making a biologically

and statistically meaningful

decision, we present a systematic

practical pipeline comparison

of eight state-of-the-art

computational methods.

16

Data sets

Datasets: Generate by Illumina platform Publicly available to make the analysis

reproducible

Different level of heterogeneity

Different organisms

large number of samples

17

28 Female

28 Male

Human

10 C57BL/6J strain

10 DBA/2J strain

Mouse

Human data set: lymphoblastoid cell lines of 56 unrelated Nigerian individuals

Performance Criteria

Estimation Criteria:

Number of detections and their consistency

False positive discoveries

Correlation between methods

Run times

FDR < 0.05

18

Experimental Design

19

Select initial N samples from each distinct groups randomly

Run the statistical analysis

Add x more samples as the input until all

Repeat the task for ten times

For the false discoveries, we did the same but selection procedure happened within the same group like sampling within female samples

Experimental Design

20

D2 strain

Randomly 2 samples

3 samples

6 samples

8 samples

10 samples

B6 strain

Randomly 2 samples

3 samples

6 samples

8 samples

10 samples

Add more

Mouse data set experimental design -> try to simulate wide range of possible situations

Data sets Intrinsic Properties Mouse Data set : more homogenous

Human Data set : heterogeneous plus outliers

(male)

21

Sp

earm

an C

orr

elat

ion

Results : Number of Detections

22

Number of detections increase as the sample sizes increase, exception: NOIseq and Cuffdiff2.0.0 (poor detection)

At some points the curve starts to be stabilized especially in human data set Moderate methods : DESeq (more conservative) and Limma Data dependent methods : EBSeq and baySeq

Results: Methods consistency

23

Limma and DESeq among the top methods

Results : Relative False Discoveries

24

Number of False discoveries decreases by increasing sample sizes , especially in less heterogeneous data set (mouse)

In general, Limma, DESeq and baySeq performs well in mock comparisons EBSeq, SAMseq and edgeR find more number of false positives

Results : Methods Overlaps

(mouse)

25

Do you need running combinatorial analysis? Results only consider on significantly differential expressed genes (FDR <0.05)

Results : Methods Ranking Correlations

26

Results consider

on 1952 genes

which were

among the top

1000 ranked

genes within all

methods for

Mouse data set

and

corresponding

spearman rank

correlations

Results : Run Time Run Time : The most efficient method has the least

run time

27

Conclusions

Which methods you can rely on?

Limma and DESeq represent higher performance and level of consistency under different conditions

Do you have small number of biological replicates? We do not recommend non-parametric approaches like SAMseq

Try combinational analysis to verify the results

Do you have more than five replicates?

Avoid using NOIseq and Cuffdiff

Do you use edgeR in your analyses?

be aware of possibility of inconsistency between results and high potential risk of false discoveries

Do you have heterogeneous data set?

baySeq can be a powerful method

And … more important of all: Investigate thoroughly input data set properties in advance next you can choose the statistical method

28

comparison of tools to detect differential expression in...

Documents