rnaseq de methods review applied bioinformatics journal club

Post on 11-May-2015

483 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Applied Bioinformatics Journal ClubWednesday, March 5

Background

• Comparison of commonly used DE software packages– Cuffdiff– edgeR– DESeq– PoisssonSeq– baySeq– limma

• Two benchmark datasets– Sequencing Quality

Control (SEQC) dataset• Includes qRT-PCR for

1,000 genes

– Biological replicates from 3 cell lines as part of ENCODE project

Focus of paper:Comparison of elevant measures for DE detection

• Normalization of count data

• Sensitivity and specificity of DE detection

• Genes expressed in one condition but no expression in the other condition

• Sequencing depth and number of replicates

Theoretical background

• Count matrix—number of reads assigned to gene i in sequencing experiment j

• Length bias when measuring gene expression by RNA-seq– Reduces the ability to

detect differential expression among shorter genes

• Differential gene expression consists of 3 components:– Normalization of counts

– Parameter estimation of the statistical model

– Tests for differential expression

Normalization

• Commonly used– RPKM– FPKM– Biases—proportional

representation of each gene is dependent on expression levels of other genes

• DESeq-scaling factor based normalization– median of ratio for each gene

of its read count over its geometric mean across all samples

• Cuffdiff—extension of DESeq normalization– Intra-condition library scaling– Second scaling between

conditions– Also accounts for changes in

isoform levels

Normalization

• edgeR– Trimmed means of M

values (TMM)– Weighted average of

subset of genes (excluding genes of high average read counts and genes with large differences in expression)

• baySeq– Sum gene counts to

upper 25% quantile to normalize library size

• PoissonSeq– Goodness of fit estimate

to define a gene set that is least differentiated between 2 conditions, and then used to compute library normalization factors

Normalization

• limma (2 normalization procedures)– Quantile normalization

Sorts counts from each sample and sets the values to be equal to quantile mean from all samples– Voom: LOWESS regression to estimate mean

variance relation and transforms read counts to log form for linear modeling

Statistical modeling of gene expression

• edgeR and DESeq– Negative binomial distribution (estimation of

dispersion factor)• edgeR– Estimation of dispersion factor as weighted

combination of 2 components• Gene specific dispersion effect and common dispersion

effect calculated for all genes

Statistical modeling of gene expression

• DESeq– Variance estimate into a combination of Poisson

estimate and a second term that models biological variability

• Cuffdiff– Separate variance models for single isoform and

multiple isoform genes• Single isoform—similar to DESeq• Multiple isoform– mixed model of negative binomial

and beta distributions

Statistical modeling of gene expression

• baySeq– Full Bayesian model of negative binomial

distributions– Prior probability parameters are estimated by

numerical sampling of the data• PoissonSeq– Models gene counts as a Poisson variable– Mean of distribution represented by log-linear

relationship of library size, expression of gene, and correlation of gene with condition

Test for differential expression

• edgeR and DESeq– Variation of Fisher exact test modified for negative

binomial distribution– Returns exact P value from derived probabilities

• Cuffdiff– Ratio of normalized counts between 2 conditions

(follows normal distribution)– t-test to calculate P value

Test for differential expression

• limma– Moderated t-statistic of modified standard error

and degrees of freedom• baySeq– Estimates 2 models for every gene• No differential expression• Differential expression

– Posterior likelihood of DE given the data is used to identify differentially expressed genes (no P value)

Test for differential expression

• PoissonSeq– Test for significance of correlation term – Evaluated by score statistics which follow a Chi-

squared distribution (used to derive P values)

• Multiple hypothesis corrections– Benjamini-Hochberg– PoissonSeq—permutation based FDR

Results

• Normalization and log expression correlation

• Differential expression analysis

• Evaluation of type I errors

• Evaluation of genes expressed in one condition

• Impact of sequencing depth and replication on DE detection

5

5

top related