negative binomial additive model for rna-seq data analysis · and bbseq [3], which is based on...

Negative Binomial Additive Model for RNA-Seq Data

Analysis

Xu Ren and Pei Fen Kuan⇤

Department of Applied Mathematics and Statistics,

Stony Brook University,

Stony Brook, NY 11794, USA

April 4, 2019

SUMMARY. High-throughput sequencing experiments followed by di↵erential expression

analysis is a widely used approach for detecting genomic biomarkers. A fundamental step

in di↵erential expression analysis is to model the association between gene counts and co-

variates of interest. Existing models assume linear e↵ect of covariates, which is restrictive

and may not be su�cient for some phenotypes. In this paper, we introduce NBAMSeq, a

flexible statistical model based on the generalized additive model and allows for information

sharing across genes in variance estimation. Specifically, we model the logarithm of mean

gene counts as sums of smooth functions with the smoothing parameters and coe�cients

estimated simultaneously within a nested iterative method. The variance is estimated by

the Bayesian shrinkage approach to fully exploit the information across all genes. Based

on extensive simulation and case studies of RNA-Seq data, we show that NBAMSeq o↵ers

improved performance in detecting nonlinear e↵ect and maintains equivalent performance in

detecting linear e↵ect compared to existing methods. Our proposed NBAMSeq is available

for download at https://github.com/reese3928/NBAMSeq and in submission to Biocon-

ductor repository.

KEY WORDS: Bayesian shrinkage; Di↵erential expression analysis; Generalized additive

model; RNA-Seq; Spline model.

⇤Correspondence to [email protected]

1

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 4, 2019. ; https://doi.org/10.1101/599811doi: bioRxiv preprint

https://doi.org/10.1101/599811

1 Introduction

In recent years, RNA-Seq experiment has become the state-of-the-art method for quantifying

mRNAs levels by measuring gene expression digitally in biological samples. An RNA-Seq

experiment usually starts with isolating RNA sequences from biological samples using the Il-

lumina Genome Analyzer, the most commonly used platform for generating high-throughput

sequencing data. These mRNA sequences are reverse transcribed into cDNA fragments. To

reduce the sequencing cost and increase the speed of reading the cDNA fragments (typically

a few thousands bp), these fragments are sheared into short reads (50-450bp). These reads

are mapped back to the original reference genome and the number of read counts mapping

to each gene/transcript region are computed. RNA-Seq experiment is usually summarized

as a count table with each row representing a gene/transcript and each column representing

a sample.

An important aspect of statistical inference in RNA-Seq data is the di↵erential expression

(DE) analysis, which is to perform statistical test on each gene to ascertain whether it is

DE or not. Several methods have been developed for DE test using gene counts data,

including DESeq2 [1], edgeR [2], which are based on negative binomial regression model

and BBSeq [3], which is based on beta-binomial regression model. DESeq2 [1] performs

DE analysis in a three-step procedure. The normalization factors for sequencing depth

adjustment are first estimated using median-to-ratio method. Both the coe�cients and

the dispersion parameters in negative binomial distribution are estimated by the Bayesian

shrinkage approach to e↵ectively borrow the information across all genes. Finally, the Wald

tests or the likelihood ratio tests are performed to identify DE genes. edgeR [2] also uses the

negative binomial distribution to model gene counts. It assumes that the variance of gene

counts depends on two dispersion parameters, namely the negative binomial dispersion and

the quasi-likelihood dispersion. The negative binomial dispersion is estimated by fitting a

mean-dispersion trend across all genes whereas the quasi-likelihood dispersion are estimated

by Bayesian shrinkage approach [4]. The DE statistical tests are conducted based on either

the likelihood ratio tests [5, 6] or the quasi-likelihood F-tests [7] [8]. Both the DESeq2 [1] and

edgeR [2] are widely used due in part to their shrinkage estimators which can improve the

stability of DE test in a large range of RNA-Seq data or other technologies that generates read

counts genomics data. BBSeq [3] is an alternative tool for DE analysis. It assumes that the

count data follows a beta-binomial distribution where the beta distribution is parameterized

in a way such that the variance accounts for over dispersion. The parameters are estimated

by the maximum likelihood approach and DE analysis is based on either the Wald test or

the likelihood ratio test. Both the negative binomial and beta-binomial regression models

2


https://doi.org/10.1101/599811

belong to the generalized linear model (GLM) class. GLM is popular owing to the ease of

implementation and interpretation, by assuming the link function is a linear combination of

covariates. However, as illustrated in the motivating dataset below, the linearity assumption

is not always appropriate and nonlinear models for linking the phenotype to gene expression

in RNA-Seq data may be important. Stricker et al. [9] proposed GenoGAM, which is a

generalized additive model for ChIP-Seq data. However, GenoGAM was developed for a

di↵erent purpose, i.e., the nonlinear component was to smooth the read count frequencies

across genome, and not for relating a nonlinear association between the phenotype and read

counts.

In this paper, we introduce NBAMSeq, a generalized additive model for RNA-Seq data.

NBAMSeq brings together the negative binomial additive model [10] and information sharing

across genes for variance estimation which enhances the power for detecting DE genes, since

treating each gene independently su↵ers from power loss due to the high uncertainty of

variance, especially when the sample size is small [1]. By borrowing information across

genes, NBAMSeq is able to model the nonlinear association while improving the accuracy

of variance estimation and shows significant gain in power for detecting DE genes when the

nonlinear relationship exists.

2 Motivating Datasets

We demonstrate the existence of nonlinear relationship between gene counts and covari-

ates/phenotypes of interest on two real RNA-Seq datasets. The first dataset is obtained

from The Cancer Genome Atlas (TCGA) consortium, a rich repository consisting of high-

throughput sequencing omics data for multiple types of cancers; downloadable from the

Broad GDAC Firehose. For ease of exposition, we focus on the Glioblastoma multiforme

(GBM), an aggressive form of brain cancer. The dataset consists of RNA-Seq count data on

20,501 genes and 158 samples generated using the Illumina HiSeq2000 platform, along with

the clinical information of these samples. 16,092 genes were retained after filtering for genes

with more than 10 samples having count per million (CPM) less than one. Gene counts are

normalized using median-of-ratios as described in DESeq [11] and DESeq2 [1] to account for

di↵erent library sizes across all samples. Age has been identified as an important prognos-

tic marker for malignant cancers including GBM [12]. To better understand the prognostic

value of age in GBM, here we aim to identify age specific gene expression signature. For

each individual gene i, three models were fitted: (i) base model: yij = �0 + �1x1j + �2x2j,

(ii) linear model: yij = �0 + �1x1j + �2x2j + �3x3j, (iii) cubic spline regression with 3 knots:

yij = �0+�1x1j+�2x2j+f(x3j), where yij denotes logarithm transformed normalized counts

3


https://doi.org/10.1101/599811

8.5

9.0

9.5

10.0

10.5

20 40 60 80

age

log(c

ounts

+1)

MTPAP

0.0

2.5

5.0

7.5

10.0

20 40 60 80

age

log(c

ounts

+1)

RANBP17

10

12

14

20 40 60 80

age

log(c

ounts

+1)

CNOT2

10

11

12

13

14

20 40 60 80

age

log(c

ounts

+1)

CPSF6

8.5

9.0

9.5

10.0

10.5

20 40 60 80

age

log(c

ounts

+1)

MTPAP

0.0

2.5

5.0

7.5

10.0

20 40 60 80

age

log(c

ounts

+1)

RANBP17

10

12

14

20 40 60 80

age

log(c

ounts

+1)

CNOT2

10

11

12

13

14

20 40 60 80

age

log(c

ounts

+1)

CPSF6

Figure 1: Nonlinear relationship between gene counts and age.

of sample j and x1j, x2j, x3j denotes gender, race, age respectively. Race is a binary variable

indicating if the sample is Caucasian or non-Caucasian. We then compared the three models

using ANOVA F-test, and the p-values are adjusted using the Benjamini &Hochberg [13]

false discovery rate (FDR). The test statistics, degrees of freedom of F-statistics, number of

significant genes at FDR < 0.05 are summarized in Web Table 1, where SSE1, SSE2, SSE3

denote sum of squared error for model (i), (ii), (iii) respectively. 250 genes are significant

when comparing model (i) to model (iii). We further investigated the p-values of these 250

genes when comparing model (i) to model (ii) and found that 30 out of these 250 genes

have FDR > 0.2 (see Web Table 2), implying that a simple linear regression is unable to

identify these 30 genes even if a less stringent FDR threshold is used. Moreover, 9 genes

are significant when comparing model (ii) to model (iii). Figure 1 shows the scatterplots of

counts versus age of the most significant genes. For visualization, a local regression is fitted

on each scatterplot, suggesting significant nonlinear relationships between normalized gene

counts and age. Although only GBM is shown here, we observed similar nonlinear age e↵ect

for other types of cancer from Broad GDAC Firehose database (see Web Table 3).

The second dataset comes from our study of gene expression associated with posttrau-

matic stress disorder (PTSD) in the World Trade Center (WTC) responders [14]. The dataset

consists of RNA-Seq read counts of 25,830 genes in whole blood profiled in 324 WTC male

responders along with their clinical information. The phenotype of interest is the Posttrau-

matic Stress Disorder Checklist-Specific Version (PCL), a self-report questionnaire assessing

4


https://doi.org/10.1101/599811

the severity of WTC-related DSM-IV PTSD symptoms [15]. To better understand the non-

linear e↵ect of PTSD severity on gene expression, here we aim to identify gene expression

signature related to PCL score. The gene filtering criterion and normalization method are

similar to the framework used in the first motivating dataset and samples with missing clini-

cal information were omitted from the analysis. 16,193 genes and 321 samples were retained

for DE analysis. In addition, cell type proportions have been implicated in the analysis of

whole blood samples. The proportions of CD8T, CD4T, natural killer, B-cell and mono-

cytes were estimated as described in our previous paper [16]. Similar to the first motivating

dataset, the base model, linear model, and cubic spline regression model were fitted on each

gene to study the e↵ect of PCL score. Age, race, cell proportions of CD8T, CD4T, natural

killer, B-cell and monocytes were adjusted in each model. When comparing the cubic spline

regression model with base model using ANOVA F-test, 6 genes (AP1G1, NFYA, SCAF11,

SPTLC2, TAOK1, ZNF106) are significant at FDR < 0.05. Web Figure 1 shows the scat-

terplots of counts versus PCL score of these 6 genes with a local regression fitted on each

scatterplot. These plots show a consistent pattern, i.e., gene expression first decrease and

then gradually increase with the minimum at around PCL score 30.

These observations indicate that the relationship between gene counts and covariate of

interest may be nonlinear, and it can be challenging to capture the true underlying non-

linear relationship using parametric methods. Thus, more flexible statistical methods to

characterize the nonlinear pattern in RNA-Seq data will improve the DE analysis of com-

plex phenotypes. To this end, we propose a negative binomial additive model to capture the

nonlinear association in RNA-Seq data analysis.

3 Model Specification

3.1 Sequencing Depth Normalization

For RNA-Seq data, the raw counts of each sample are proportional to its sequencing depth

i.e. the total number of reads. The raw counts are not directly comparable across samples

and therefore normalization factors are needed to adjust for gene counts. Several methods

have been developed for RNA-Seq data normalization including total count, upper quar-

tile [17], median, quantile [18, 19], conditional quantile [20], reads per kilobase per million

mapped reads (RPKM) [21], median-of-ratios [11], and trimmed mean of M values [22]. A

comprehensive comparisons of these normalization methods are provided in Dillies et al.

[23] and Li et al. [24]. By comparing the spearman correlation between normalized counts

and quantitative reverse transcription polymerase chain reaction (qRT-PCR), Li et al. [24]

5


https://doi.org/10.1101/599811

showed that RPKM yields the best normalization results when alignment accuracy is low.

However, by considering the e↵ect of normalization method on DE analysis, Dillies et al.

[23] showed that RPKM is ine↵ective and not recommended. They also showed that the

median-of-ratios [11] and trimmed mean of M values [22] maintain low false-positive rate

without compromising the power in DE analysis.

Throughout this paper, we adapt the median-of-ratios used by DESeq [11], DEXSeq [25]

and DESeq2 [1] for sequencing depth adjustment. Let Kij be the read count for gene i in

sample j. The normalization factor of sample j is defined as:

sj =i:K⇤

i 6=0median

Kij

K⇤i

, whereK⇤i =

⇣ mY

j=1

Kij

⌘1/m

(1)

where m is the total number of samples.

3.2 Generalized Additive Model

For gene i and sample j, we assume the gene count Kij follows the generalized additive

model:

Kij ⇠ NB(mean = µij, dispersion=↵i)

log(µij) = log(sj) +X

r

fr(xjr)

where µij = E(Kij) is the mean count and ↵i is dispersion parameter that relates the mean

to the variance by Var(Kij) = µij + ↵iµ2ij, and xjr’s are the r covariates of interest. The

logarithm of normalization factor log(sj) is an o↵set in the model and sj is computed by

Equation (1) in Section 3.1. fr(·), r = 1, 2, ..., R represents the spline or smooth function

which capture the nonlinear associations between covariate r and gene expression. Since our

statistical goal is to conduct inference on genes to ascertain whether they are DE with respect

to the covariate/phenotype of interest, we choose fr(·) to be spline functions to facilitate the

inference, i.e., fr(xjr) =PQ

q=1 �rqBrq(xjr), where �rq are unknown coe�cients and B(·) areB-spline basis functions. The coe�cients � = (�11, ..., �1Q, ..., �R1, ..., �RQ)T are estimated

by the penalized log-likelihood maximization to avoid overfitting. In particular, the model

is estimated by maximizing

L(�) = l(�)� 1

2

X

r

�r�TSr� (2)

with respect to �, where l(·) is the log-likelihood function of negative binomial distribution,

�r is smoothing parameter which controls the smoothness of fr(·), and Sr is a positive

6


https://doi.org/10.1101/599811

semidefinite matrix.

3.3 Model Fitting

Several methods have been proposed for fitting the generalized additive model. Hastie and

Tibshirani [26, 27] proposed a general backfitting algorithm, whereas Thurston et al [10]

adapted the backfitting algorithm for negative binomial distribution. In the backfitting

algorithm, given the smoothing parameters �r and dispersion parameters ↵i, the coe�cients

� can be computed by the penalized iteratively reweighted least squares. However, the most

challenging part is in the estimation of the smoothing parameters [28], which is not easy to be

incorporated in the backfitting algorithm. Wood et al [29, 30, 31] proposed a nested iterative

algorithm which estimates the smoothing parameters and coe�cients simultaneously. Their

algorithm estimates the smoothing parameters using the cross validation framework in the

outer iteration and the coe�cients using penalized iteratively reweighted least squares in the

inner iteration. We adapted the algorithm implemented in the mgcv package [31, 30, 32, 28,

33] for model fitting, in which various options for model selection criteria and computational

strategies are provided. Specifically in NBAMSeq, we adapted the gam function from mgcv

[31, 30, 32, 28, 33] for model fitting, in for fitting the generalized additive model on each

gene.

The smoothing parameter tuning criteria can be divided into two types, namely (a) the

prediction error based criteria including AIC, cross-validation or generalized cross-validation

(GCV), and (b) the likelihood based criteria including maximum likelihood or restricted max-

imum likelihood [30]. The likelihood based methods o↵er some advantageous over the predic-

tion error based methods since they impose a larger penalty on overfitting and exhibit faster

convergence of smoothing parameters estimation [34]. In NBAMSeq, we use the restricted

maximum likelihood (REML) method. Maximizing the penalized log-likelihood (2) can be

viewed as a Bayesian approach with Gaussian prior � ⇠ N(0,S�� ), where S� =

P�rSr and

S�� is the inverse or pseudoinverse [35] of S�. Under the Bayesian framework, estimating the

smoothing parameters can be viewed as maximizing the log-marginal likelihood [29, 30, 31]

V (�) = log

Zf(y|�)f(�)d�. (3)

The Laplacian approximation of (3) is given by:

V⇤(�) ' L(�) +

1

2log |S�|+ � 1

2log |XTWX+ S�|+

M

2log(2⇡),

where � is the maximizer of (2), X is the model matrix of covariates i.e. X = [X1,X2

, ...,XR]

7


https://doi.org/10.1101/599811

with Xrjk = Brk(xjr). |S�|+ denotes the product of positive eigenvalues of S�, M is the

number of zero eigenvalues of S�, andW is the diagonal weight matrix. For negative binomial

distribution, the link function is logarithm link and the variance is given by µij +↵iµ2ij, thus

the diagonal elements of W is given by wjj = 1(1/µij)2(µij+↵iµ2

ij)= 1

1/µij+↵i. Let �i1, ...,�iR

denote the smoothing parameters of gene i. The nested iteration implemented in mgcv

[31, 30, 32, 28, 33] can be summarized as Algorithms 1 and 2.

Algorithm 1 Outer iteration for � and ↵i

1: Initialize ⇢ = (log↵i, log �i1, ..., log �iR) and �.2: Calculate � by Algorithm 2.3: Calculate @V

@⇢xand @2V

@⇢x@⇢yfor all x, y.

4: Denote gradient by rV and the Hessian matrix by H, where Hxy = @2V@⇢x@⇢y

. Calculate

� = �H�1rV .5: If V (⇢+�) < V (⇢), set � to be �/2 until V (⇢+�) > V (⇢).6: Set ⇢ to be ⇢+�.7: Repeat step 2 to step 6 until @V

@⇢x' 0 for all x and �H is positive semidefinite.

Algorithm 2 Inner iteration for �1: Initialize �.2: Update � by �new = (XTWX + S�)�1XTWz, where z is a vector with each element

zj = log(µij/sj) + (Kij � µij)/µij.3: Update W by wjj =

11/µij+↵i

.4: Repeat step 2 and 3 until the change of � below threshold.

3.4 Dispersion Estimation

Information sharing across genes in the estimation of dispersion parameter has been shown to

increase accuracy compared to treating each gene independently, especially when the number

of replicates is small. Numerous methods have been proposed to incorporate information

sharing, and most of them were based on the empirical Bayesian approach which shrinks the

gene-wise dispersion estimates toward a common dispersion parameter across all genes.

Following DESeq2 [1], we apply the empirical Bayesian shrinkage dispersion parame-

ter estimate. For each individual gene i, we assume a log-normal prior for the dispersion

parameters ↵i:

log(↵i) ⇠ N(log↵tr(µi), �2d).

where dispersion trend relates dispersion to mean of normalized µi =1m

PjKij

sjcounts by

↵tr(µi) = b1µi

+ b0 and �2d is the prior variance, which is assumed to be constant across all

8


https://doi.org/10.1101/599811

genes. A Gamma regression is fitted to estimate the parameters b1 and b0. The prior variance

�2d is estimated by subtracting the expected sampling variance of log-dispersion estimator

from the variance estimate of residual log(↵i)� log↵tr(µi) as described in DESeq2 [1]

We first obtain gene-wise dispersion estimates ↵i by Algorithms 1 and 2. These gene-wise

dispersion estimates are used to fit the dispersion trend and estimate the variance in prior

distribution. The final dispersion parameter is estimated by maximizing

lCR(↵)�(log↵� log↵tr(µi))2

2�2d

, (4)

where lCR is Cox-Reid adjusted log-likelihood [36]. Note that maximizing (4) can be viewed

as a maximize a posterior (MAP) approach. The entire workflow implemented in NBAMSeq

is summarized in algorithm 3.

Algorithm 3 Workflow of NBAMSeq1: Estimate normalization factors sj.2: Estimate smoothing parameters and gene-wise dispersions by Algorithms 1 and 2.3: Estimate dispersion trend and prior variance using gene-wise dispersion estimates from

Step 2.4: Calculate MAP estimate of dispersions.5: Estimate coe�cients � by algorithm 2 using the smoothing parameters in Step 2 and

dispersion parameters in Step 4.

4 Inference

Our main objective is to identify the genes that exhibit either linear or nonlinear association

with the covariate/phenotype of interest. We formulate this as a hypothesis testH0 : fr(xr) =

0 versus H1 : fr(xr) 6= 0, where fr(·) is a smooth function of the covariate of interest. Since

we use a Bayesian framework in the coe�cients estimation, we have f(�|y) / f(y|�)f(�).In addition, � ⇠ N(0,S�

� ), thus log f(�|y) / log f(y|�)� 12�

TS��. By applying a second

order Taylor expansion to �, we obtain

log f(y|�)� 1

2�TS��

= log f(y|�)� 1

2�

TS�� +

1

2(� � �)T(D2 log f(y|�)� S�)(� � �),

(5)

where D2 log f(y|�) is Hessian matrix of the log-likelihood of negative binomial distribution

with respect to �. We use the expected information matrix to replace the Hessian matrix

9


https://doi.org/10.1101/599811

(see Web Appendix A) and equation (5) simplifies to

log f(y|�)� 1

2�

TS�� 1

2(� � �)T(XTWX+ S�)(� � �). (6)

The large sample limit of (6) is the probability density function of multivariate normal

distribution, thus �|y ⇠ N(�, (XTWX + S�)�1) and V� = (XTWX + S�)�1 is Bayesian

covariance matrix of �. Let fr be the values of fr(xr) evaluated at the observed values of xr

and X denote the matrix such that fr = X�. Wood [37] showed that fr ⇠ N(fr, XV�XT)

approximately and under the null hypothesis the test statistics fT

r (XV�XT)�1fr follows a

�2 distribution with the degrees of freedom equals to the e↵ective degrees of freedom (edf)

corresponding to variable xr if the edf is an integer. If the edf of xr is not an integer, Wood

[37] showed that the distribution of fT

r (XV�XT)�1fr can be approximated by a non-central

�2 distribution using the methods of Liu et al [38]. Specifically, the degrees of freedom

and the non-central parameter in �2 distribution are determined so that the skewness of

fT

r (XV�XT)�1fr and target �

2 distribution are equal, whereas the di↵erence between the

kurtosis of fT

r (XV�XT)�1fr and target �2 distribution is minimized. We utilize this result

to calculate a p-value for each gene. The p-values are further adjusted using FDR procedure

to account for multiple hypothesis testing.

5 Simulation

To evaluate the performance of NBAMSeq, we designed simulation studies in which NBAM-

Seq was compared with DESeq2 [1] and edgeR [2], two of the most popular softwares for

identifying DE genes in RNA-Seq data. Two scenarios were considered, the first was to evalu-

ate the accuracy of dispersion estimates and power of detecting DE genes when the nonlinear

e↵ect is true association, the second was to investigate the robustness of NBAMSeq when

the linear e↵ect is the true association.

In both scenarios, for gene i and sample j, we assumed that the gene counts follow

negative binomial distribution. i.e. Kij ⇠ NB(mean=µij, dispersion=↵i), µij = sjqij. For

simplicity and without loss of generality, we set all normalization factors sj to be 1. Since

the dispersion parameter decreases with increasing mean counts, we assumed that the true

mean dispersion relationship is ↵i = a/µi + 0.05. The performance under di↵erent degrees

of dispersion (a = 1, 3, 5) were studied. In practice, although a could be larger than 5, the

larger value of a is compensated by the larger mean of normalized counts. The magnitude of

the dispersion ↵i is in similar scale regardless of a, and thus only a = 1, 3, 5 were studied. The

covariate of interest xj was simulated from a uniform distribution U(u1, u2). For simplicity,

10


https://doi.org/10.1101/599811

we used u1 = 20 and u2 = 80 because 20 to 80 covers the range of age and PCL score in our

motivating datasets.

5.1 Scenario I

The aim of this scenario was two-fold: (i) to evaluate the accuracy of NBAMSeq dispersion

estimates when nonlinear e↵ect is significant, and (ii) to investigate the power and error rate

control of NBAMSeq.

We simulated several count matrices which contain the same number of genes (n =15,000)

across di↵erent sample sizes (m = 15, 20, 25, 30, 35, 40). We randomly selected 5% of the

15,000 genes to be DE. Within each count matrix, the mean of negative binomial distribution

µij was simulated from

⌘ij = log2(µij) =

8<

:bi0 + bi1B1(xj) + bi2B2(xj) + ...+ bi5B5(xj), i 2 {DE indices}

bi0 + c, otherwise

and µij was calculated by µij = 2⌘ij , where B(·) are cubic B-Spline basis functions to induce anonlinear e↵ect on DE genes, and c is a constant to ensure that the count distribution of DE

and non-DE genes comparable. The intercepts bi0 were simulated from normal distribution,

and the other coe�cients bi1, bi2, ..., bi5 were simulated from a uniform distribution. The

mean and standard deviation of normal distribution, as well as the lower/upper bound of

uniform distribution were chosen such that the distribution of µi =1m

Pmj=1 µij across all

genes mimicked our second motivating dataset. Finally, the count of gene i in sample j

was simulated from a negative binomial distribution with mean µij and dispersion ↵i. The

simulation was repeated 50 times for each combination of dispersion and sample size.

To evaluate the accuracy of dispersion estimates, three methods were compared: (a) DE-

Seq2 [1]; (b) edgeR [2]; (c)NBAMSeq. To illustrate that information sharing across genes

produces more accurate dispersion estimates, the gene-wise dispersion estimated by NBAM-

Seq (NBAMSeq gene-wise) without incorporating prior dispersion trend was also included

in the comparison. The average mean squared error (MSE) across 50 repetitions was calcu-

lated and the results are shown in Figure 2. Among the DE genes in which the nonlinear

e↵ect is significant, NBAMSeq outperforms DESeq2 [1] and edgeR [2] across di↵erent sample

sizes. On the other hand, among the non-DE genes, the MSE of edgeR [1] is lower than DE-

Seq2 [2] across all degrees of dispersion and achieves comparable performance to NBAMSeq.

NBAMSeq also has lower MSE than NBAMSeq (gene-wise), which shows that the Bayesian

shrinkage estimation is better than gene-wise dispersion estimation. This simulation scenario

shows that NBAMSeq is able to maintain good dispersion accuracy for non-DE genes and

11


https://doi.org/10.1101/599811

0.00

0.01

0.02

0.03

15 20 25 30 35 40

MS

E (

DE

genes)

a = 1

0.00

0.01

0.02

0.03

15 20 25 30 35 40

MS

E (

non−

DE

genes)

a = 1

0.000

0.025

0.050

0.075

0.100

15 20 25 30 35 40

MS

E (

DE

genes)

a = 3

0.000

0.025

0.050

0.075

0.100

15 20 25 30 35 40

MS

E (

non−

DE

genes)

a = 3

0.00

0.05

0.10

0.15

0.20

15 20 25 30 35 40

sample size

MS

E (

DE

genes)

a = 5

0.00

0.05

0.10

0.15

0.20

15 20 25 30 35 40

sample size

MS

E (

non−

DE

genes)

a = 5

DESeq2 edgeR NBAMSeq NBAMSeq(gene−wise)

Figure 2: Scenario I: MSE of dispersion estimates.

shows improvement in the accuracy over DESeq2 [1] and edgeR [2] for DE genes when the

nonlinear association is significant.

Figure 3 shows the average true positive rate (TPR), empirical false discovery rate (FDR),

empirical false non-discovery rate (FNR) and area under curve (AUC) of DESeq2 [1], edgeR

[2] and NBAMSeq. For edgeR [2], two approaches for DE are available, namely the quasi-

likelihood (QL) and the likelihood ratio test (LRT) approach for calculating the test statistics

as described in Lund et al [7]. The authors further showed that detecting DE genes using

QL approach controls FDR better than the LRT approach. When calculating TPR, FDR,

and FNR, genes are declared to be significant if nominal FDR is < 0.05. The TPR results

indicate that our proposed method NBAMSeq has the highest power for detecting nonlinear

DE genes regardless of the degrees of dispersion and sample sizes. In addition, NBAMSeq

controls the FDR at 0.05 in all cases except in high dispersion (a = 5) and low sample size

(m = 15) in which the FDR is slightly inflated (empirical FDR 0.0576). The FNR results

show that NBAMSeq has lower FNR compared to other methods. To ascertain that the

12


https://doi.org/10.1101/599811

0.25

0.50

0.75

1.00

15 20 25 30 35 40

TP

R

a = 1

0.000

0.025

0.050

0.075

0.100

15 20 25 30 35 40

FD

R

a = 1

0.00

0.01

0.02

0.03

0.04

0.05

15 20 25 30 35 40

FN

R

a = 1

0.7

0.8

0.9

1.0

15 20 25 30 35 40

AU

C

a = 1

0.25

0.50

0.75

1.00

15 20 25 30 35 40

TP

R

a = 3

0.000

0.025

0.050

0.075

0.100

15 20 25 30 35 40

FD

R

a = 3

0.00

0.01

0.02

0.03

0.04

0.05

15 20 25 30 35 40

FN

Ra = 3

0.7

0.8

0.9

1.0

15 20 25 30 35 40

AU

C

a = 3

0.25

0.50

0.75

1.00

15 20 25 30 35 40

sample size

TP

R

a = 5

0.000

0.025

0.050

0.075

0.100

15 20 25 30 35 40

sample size

FD

R

a = 5

0.00

0.01

0.02

0.03

0.04

0.05

15 20 25 30 35 40

sample size

FN

R

a = 5

0.7

0.8

0.9

1.0

15 20 25 30 35 40

sample size

AU

C

a = 5

DESeq2 edgeR_LRT edgeR_QL NBAMSeq

Figure 3: Scenario I: TPR, FDR, FNR, AUC of DESeq2, edgeR and NBAMSeq.

13


https://doi.org/10.1101/599811

power of NBAMSeq is not sensitive to the nominal FDR cuto↵ for DE genes, the AUCs

comparison shows NBAMSeq is consistently more powerful in all cases. The observation

also shows that the performance of DESeq2 [1] and edgeR [2] LRT approach are comparable,

whereas edgeR [2] QL approach is conservative as shown by the higher FNR. Additional

comparisons of the QL approach for NBAMSeq and Gaussian additive models on logarithm

transformed count data are provided in Web Figure 2 and 3.

5.2 Scenario II

The goal of this scenario was to evaluate the robustness of NBAMSeq when the true associ-

ation is a linear e↵ect. The simulation setup in this scenario is similar to Scenario I except

that ⌘ij is given by:

⌘ij = log2(µij) =

8<

:bi1xj + c1, i 2 {linear DE indices}

bi0 + c2, otherwise

where c1 and c2 are constants to ensure that the DE and non-DE genes counts distribution

are comparable to avoid bias in detecting DE genes. The linear coe�cient bi1 was simulated

from normal distribution. Similar to Scenario I, the mean and standard deviation of normal

distribution were chosen such that the distribution of µi mimicked the distribution of mean

normalized counts in real data. The simulation was repeated 50 times and the average MSE

of dispersion estimates is shown in Web Figure 4. The accuracy of NBAMSeq is comparable

to the accuracy of DESeq2 [1] and edgeR [2] in almost all cases. This simulation shows that

although NBAMSeq is designed to detect nonlinear e↵ect, the method is robust and yields

accurate dispersion estimates when the true association is linear.

To evaluate the power and error rate control of our proposed model NBAMSeq in de-

tecting DE genes with linear e↵ect, we compare the TPR, FDR, FNR, and AUC at nominal

FDR 0.05. Web Figure 5 illustrates that NBAMSeq is a robust approach and maintains

comparable performance to other methods.

6 Real Data Analysis

In this section, we applied DESeq2 [1] and our proposed method NBAMSeq to our motivating

datasets (Section 2), namely the TCGA GBM dataset and WTC dataset. In TCGA GBM

dataset, the association between age and each gene expression were investigated, adjusting

for sex and race. At FDR 0.05, 845 and 1,439 genes were detected by DESeq2 [1] and

14


https://doi.org/10.1101/599811

Table 1: KEGG pathways selected by NBAMSeq.ID Description pvalue p.adjusthsa04310 Wnt signaling pathway 0.000252 0.033951hsa04216 Ferroptosis 0.000334 0.033951hsa04060 Cytokine-cytokine receptor interaction 0.000344 0.033951hsa04974 Protein digestion and absorption 0.000463 0.034281

our proposed NBAMSeq, respectively. 818 genes were in common between DESeq2 [1] and

NBAMSeq, 621 genes were unique to NBAMSeq and 27 genes were unique to DESeq2 [1]

(Figure 4). To account for the fact that significant genes are sensitive to the FDR threshold,

for the genes unique to NBAMSeq, we investigated their adjusted p-value in DESeq2 [1]

(Figure 5A), and vice versa for genes unique to DESeq2 (Figure 5B). A large proportion

of the 621 genes unique to NBAMSeq have FDR above 0.1 in DESeq2 [1], implying that

DESeq2 [1] is unable to detect the nonlinear genes even if a less stringent FDR threshold is

used. However, for the 27 genes unique to DESeq2 [1], their adjusted p-values in NBAMSeq

are < 0.1, implying that NBAMSeq is able to detect all these 27 genes at FDR < 0.1. Next,

we performed the pathway analysis on DE genes using the hypergeometric tests implemented

in the clusterProfiler software [39] for both the Kyoto Encyclopedia of Genes and Genomes

(KEGG) [40] and Gene Ontology [41] gene sets. Significant pathways were chosen at FDR

0.05. Gene sets with fewer than 15 genes or greater than 500 genes were omitted from the

analysis. For the KEGG pathways, no gene set was significant among DESeq2 DE genes

whereas 4 genes sets were significant among NBAMSeq DE genes (Table 1). This result is

particularly interesting as mutations in Wnt signaling pathway is implicated in a variety of

developmental defects in animals and associated with aging [42, 43]. For the Gene Ontology,

a total of 5,000 gene sets were tested, 45 and 78 gene sets were identified by DESeq2 and

NBAMSeq, respectively and 34 gene sets were in common between them. A full list of

significant gene sets identified by these two models are provided in Web Table 4 and 5.

Specifically, among the top gene sets unique to NBAMSeq DE genes (Web Table 6) are

extracellular matrix pathways and catabolic process pathways. Changes to extracellular

matrix changes with aging has been implicated in the progression of cancer [44], whereas

Barzilai et al. [45] showed that metabolic pathways are critical in aging process. These

observations suggest that gene sets detected by the NBAMSeq provide additional insights

to the development of cancer and aging as a risk factor.

In the WTC dataset, the relationship between PCL and gene expression were investi-

gated, adjusting for age, race and cell proportions (CD8T, CD4T, natural killer, B-cell,

monocytes). At FDR 0.05, 836 and 775 genes were detected by DESeq2 and NBAMSeq,

respectively (Web Figure 6). Unlike the TCGA GBM dataset, the genes detected by the two

15


https://doi.org/10.1101/599811

621

27

818

NBAM

DESeq2

Figure 4: Number of significant genes by DESeq2 and NBAMSeq (GBM dataset).

0

50

100

150

200

0.00 0.25 0.50 0.75 1.00

padj

count

A

0

1

2

3

4

5

0.05 0.06 0.07 0.08 0.09 0.10

padj

count

B

0

50

100

150

200

0.00 0.25 0.50 0.75 1.00

padj

count

A

0

1

2

3

4

5

0.05 0.06 0.07 0.08 0.09 0.10

padj

count

B

Figure 5: Histogram of adjusted p-values. The blue dashed line is FDR cut-o↵ 0.1.

16


https://doi.org/10.1101/599811

methods have a large overlap, implying that the primary e↵ect of PCL on gene expression

is linear. Although NBAMSeq identified fewer PCL gene expression signature compared to

DESeq2 [1], the 69 genes unique to DESeq2 were marginal significant in NBAMSeq (Web

Figure 7). However, for the 8 genes unique to NBAMSeq, they were not necessarily sig-

nificant in DESeq2 (Web Table 7). We performed pathway analysis on KEGG [40], and

Web Table 8 shows the significant KEGG pathways among NBAMSeq DE genes. As ex-

pected, these KEGG pathways were also significant among DESeq2 DE genes due to the

large overlapping genes.

7 Discussion

This paper introduced a flexible negative binomial additive model combined with Bayesian

shrinkage dispersion estimates for RNA-Seq data (NBAMSeq). Motivated by the nonlinear

e↵ect of age in TCGA GBM data and PTSD severity (PCL score) in WTC data, our model

aimed to detect genes which exhibit nonlinear association with the phenotype of interest. The

smoothing parameters and coe�cients in NBAMSeq were estimated e�ciently by adapting

the nested iterative method implemented in mgcv [31, 30, 32, 28, 33]. To increase the

accuracy of the dispersion parameter estimation in small sample size scenario, the Bayesian

shrinkage approach was applied to model the dispersion trend across all genes. Hypothesis

tests to identify DE genes were based on chi-squared approximations.

Although our model was developed for RNA-Seq data, it can be extended to identify

nonlinear associations in other genomics biomarkers generated from high throughput se-

quencing which share similar features as RNA-Seq (e.g., over dispersion in count data). Our

simulation studies show that the proposed NBAMSeq is powerful when the underlying true

association in nonlinear and robust when the underlying true association in linear compared

to DESeq2 [1] and edgeR [2]. In addition, the real data analysis showed that NBAMSeq o↵ers

additional biological insights to the role of aging in the development of cancer by modeling

the nonlinear age e↵ect. Both the simulation and case studies suggest the advantageous of

investigating potential nonlinear associations to detect more e↵ective biomarkers, in order

to better understand the biological underpinnings of complex diseases.

NBAMSeq exhibits slight empirical FDR inflation (⇠0.06 at nominal FDR 0.05) in small

sample size (n < 15) and large dispersion scenario, attributed to the chi-squared approxi-

mation in the hypothesis tests. One way to circumvent this issue is by using a permutation

test to obtain the null distribution of the test statistic, but this approach is computational

intensive. Future work includes developing an e�cient parallel computing framework for the

permutation test. In addition, similar to other softwares for RNA-Seq DE analysis, currently

17


https://doi.org/10.1101/599811

NBAMSeq is developed for inference at gene level in the absence of prior information. Recent

methods which take into account the gene network structure in DE analysis have been shown

to a powerful approach [46]. Thus, an interesting future research direction is to incorporate

the network topology in NBAMSeq to detect nonlinear association and decipher the collec-

tive dynamics and regulatory e↵ects of gene expression. Finally, NBAMSeq is available for

download at https://github.com/reese3928/NBAMSeq and in submission to Bioconductor

repository.

Acknowledgements

This work was supported in part by the CDC/NIOSH award U01 OH011478.

18


https://doi.org/10.1101/599811

References

[1] Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold

change and dispersion for rna-seq data with deseq2. Genome biology, 15(12):550, 2014.

[2] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor

package for di↵erential expression analysis of digital gene expression data. Bioinformat-

ics, 26(1):139–140, 2010.

[3] Yi-Hui Zhou, Kai Xia, and Fred A Wright. A powerful and flexible approach to the

analysis of rna sequence count data. Bioinformatics, 27(19):2672–2678, 2011.

[4] Mark D Robinson and Gordon K Smyth. Moderated statistical tests for assessing dif-

ferences in tag abundance. Bioinformatics, 23(21):2881–2887, 2007.

[5] Davis J McCarthy, Yunshun Chen, and Gordon K Smyth. Di↵erential expression analy-

sis of multifactor rna-seq experiments with respect to biological variation. Nucleic acids

research, 40(10):4288–4297, 2012.

[6] Yunshun Chen, Aaron TL Lun, and Gordon K Smyth. Di↵erential expression analysis

of complex rna-seq experiments using edger. In Statistical analysis of next generation

sequencing data, pages 51–74. Springer, 2014.

[7] Steven P Lund, Dan Nettleton, Davis J McCarthy, and Gordon K Smyth. Detect-

ing di↵erential expression in rna-sequence data using quasi-likelihood with shrunken

dispersion estimates. Statistical applications in genetics and molecular biology, 11(5),

2012.

[8] Aaron TL Lun, Yunshun Chen, and Gordon K Smyth. It?s de-licious: a recipe for

di↵erential expression analyses of rna-seq experiments using quasi-likelihood methods

in edger. In Statistical Genomics, pages 391–416. Springer, 2016.

[9] Georg Stricker, Alexander Engelhardt, Daniel Schulz, Matthias Schmid, Achim Tresch,

and Julien Gagneur. Genogam: genome-wide generalized additive models for chip-seq

analysis. Bioinformatics, 33(15):2258–2265, 2017.

[10] Sally W Thurston, MP Wand, and John K Wiencke. Negative binomial additive models.

Biometrics, 56(1):139–144, 2000.

[11] Simon Anders and Wolfgang Huber. Di↵erential expression analysis for sequence count

data. Genome biology, 11(10):R106, 2010.

19


https://doi.org/10.1101/599811

[12] Serdar Bozdag, Aiguo Li, Gregory Riddick, Yuri Kotliarov, Mehmet Baysan, Fabio M

Iwamoto, Margaret C Cam, Svetlana Kotliarova, and Howard A Fine. Age-specific

signatures of glioblastoma at the genomic, genetic, and epigenetic levels. PloS one,

8(4):e62982, 2013.

[13] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical

and powerful approach to multiple testing. Journal of the royal statistical society. Series

B (Methodological), pages 289–300, 1995.

[14] Pei-Fen Kuan, Monika A Waszczuk, Roman Kotov, Sean Clouston, Xiaohua Yang,

Prashant K Singh, Sean T Glenn, Eduardo Cortes Gomez, Jianmin Wang, Evelyn

Bromet, et al. Gene expression associated with ptsd in world trade center responders:

An rna sequencing study. Translational psychiatry, 7(12):1297, 2017.

[15] Frank WWeathers, Brett T Litz, Debra S Herman, Jennifer A Huska, Terence M Keane,

et al. The ptsd checklist (pcl): Reliability, validity, and diagnostic utility. In annual

convention of the international society for traumatic stress studies, San Antonio, TX,

volume 462. San Antonio, TX., 1993.

[16] Pei Fen Kuan, Monika Waszczuk, Roman Kotov, Carmen Marsit, Guia Gu↵anti, Xi-

aohua Yang, Karestan Koenen, Evelyn Bromet, and Benjamin Luft. Dna methylation

associated with ptsd and depression in world trade center responders: An epigenome-

wide study. Biological Psychiatry, 81(10):S365, 2017.

[17] James H Bullard, Elizabeth Purdom, Kasper D Hansen, and Sandrine Dudoit. Eval-

uation of statistical methods for normalization and di↵erential expression in mrna-seq

experiments. BMC bioinformatics, 11(1):94, 2010.

[18] Benjamin M Bolstad, Rafael A Irizarry, Magnus Astrand, and Terence P. Speed. A

comparison of normalization methods for high density oligonucleotide array data based

on variance and bias. Bioinformatics, 19(2):185–193, 2003.

[19] Yee Hwa Yang and Natalie P Thorne. Normalization for two-color cdna microarray

data. Lecture Notes-Monograph Series, pages 403–418, 2003.

[20] Kasper D Hansen, Rafael A Irizarry, and Zhijin Wu. Removing technical variability

in rna-seq data using conditional quantile normalization. Biostatistics, 13(2):204–216,

2012.

20


https://doi.org/10.1101/599811

[21] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schae↵er, and Barbara

Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature meth-

ods, 5(7):621, 2008.

[22] Mark D Robinson and Alicia Oshlack. A scaling normalization method for di↵erential

expression analysis of rna-seq data. Genome biology, 11(3):R25, 2010.

[23] Marie-Agnes Dillies, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier, Marine

Jeanmougin, Nicolas Servant, Celine Keime, Guillemette Marot, David Castel, Jordi

Estelle, et al. A comprehensive evaluation of normalization methods for illumina high-

throughput rna sequencing data analysis. Briefings in bioinformatics, 14(6):671–683,

2013.

[24] Peipei Li, Yongjun Piao, Ho Sun Shon, and Keun Ho Ryu. Comparing the normalization

methods for the di↵erential analysis of illumina high-throughput rna-seq data. BMC

bioinformatics, 16(1):347, 2015.

[25] Simon Anders, Alejandro Reyes, and Wolfgang Huber. Detecting di↵erential usage of

exons from rna-seq data. Genome research, 22(10):2008–2017, 2012.

[26] Trevor Hastie and Robert Tibshirani. Generalized additive models. Statist. Sci.,

1(3):297–310, 08 1986.

[27] TJ Hastie and RJ Tibshirani. Generalized additive models. Chapman & Hall: CRC

Monographs on Statistics & Applied Probability. London, 1990.

[28] Simon N Wood. Generalized additive models: an introduction with R. CRC press, 2017.

[29] Simon N Wood. Generalized additive models: an introduction with R. Chapman and

Hall/CRC, 2006.

[30] Simon N Wood. Fast stable restricted maximum likelihood and marginal likelihood

estimation of semiparametric generalized linear models. Journal of the Royal Statistical

Society: Series B (Statistical Methodology), 73(1):3–36, 2011.

[31] Simon N Wood, Natalya Pya, and Benjamin Safken. Smoothing parameter and model

selection for general smooth models. Journal of the American Statistical Association,

111(516):1548–1563, 2016.

[32] Simon N Wood. Stable and e�cient multiple smoothing parameter estimation for gen-

eralized additive models. Journal of the American Statistical Association, 99(467):673–

686, 2004.

21


https://doi.org/10.1101/599811

[33] Simon N Wood. Thin plate regression splines. Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 65(1):95–114, 2003.

[34] Wolfgang Hardle, Peter Hall, and James Stephen Marron. How far are automatically

chosen regression smoothing parameters from their optimum? Journal of the American

Statistical Association, 83(401):86–95, 1988.

[35] Adi Ben-Israel and Thomas NE Greville. Generalized inverses: theory and applications,

volume 15. Springer Science & Business Media, 2003.

[36] David Roxbee Cox and Nancy Reid. Parameter orthogonality and approximate con-

ditional inference. Journal of the Royal Statistical Society. Series B (Methodological),

pages 1–39, 1987.

[37] Simon NWood. On p-values for smooth components of an extended generalized additive

model. Biometrika, 100(1):221–228, 2012.

[38] Huan Liu, Yongqiang Tang, and Hao Helen Zhang. A new chi-square approximation to

the distribution of non-negative definite quadratic forms in non-central normal variables.

Computational Statistics & Data Analysis, 53(4):853–856, 2009.

[39] Guangchuang Yu, Li-Gen Wang, Yanyan Han, and Qing-Yu He. clusterprofiler: an

r package for comparing biological themes among gene clusters. Omics: a journal of

integrative biology, 16(5):284–287, 2012.

[40] Minoru Kanehisa and Susumu Goto. Kegg: kyoto encyclopedia of genes and genomes.

Nucleic acids research, 28(1):27–30, 2000.

[41] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler,

J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al.

Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25, 2000.

[42] Tannishtha Reya and Hans Clevers. Wnt signalling in stem cells and cancer. Nature,

434(7035):843, 2005.

[43] Jan Gruber, Zhuangli Yee, and Nicholas Tolwinski. Developmental drift and the role of

wnt signaling in aging. Cancers, 8(8):73, 2016.

[44] Cynthia C Sprenger, Stephen R Plymate, and May J Reed. Aging-related alterations in

the extracellular matrix modulate the microenvironment and influence tumor progres-

sion. International journal of cancer, 127(12):2739–2748, 2010.

22


https://doi.org/10.1101/599811

[45] Nir Barzilai, Derek M Hu↵man, Radhika H Muzumdar, and Andrzej Bartke. The critical

role of metabolic pathways in aging. Diabetes, 61(6):1315–1322, 2012.

[46] Laurent Jacob, Pierre Neuvial, and Sandrine Dudoit. More power via graph-structured

tests for di↵erential expression of gene networks. The Annals of Applied Statistics,

6(2):561–600, 2012.

23


https://doi.org/10.1101/599811

negative binomial additive model for rna-seq data analysis · and bbseq [3], which is based on...

Documents