bayesian regression technique to estimate area under the...

6
Central Annals of Biometrics & Biostatistics Cite this article: Hossain A, Khan H, Beyene J (2015) Bayesian Regression Technique to Estimate Area Under the Receiver Operating Characteristic Curve and its Application to Microrna Data. Ann Biom Biostat 2(1): 1013. *Corresponding authors Ahmed Hossain, Department of Clinical Epidemiology and Biostatistics, McMaster University, ON L8S4L8, Canada, Email: Joseph Beyene, Department of Clinical Epidemiology and Biostatistics, McMaster University, ON L8S4L8, Canada, Email: [email protected] Submitted: 02 June 2014 Accepted: 10 January 2015 Published: 15 January 2015 Copyright © 2015 Hossain et al. OPEN ACCESS Keywords Bayesian regression technique microRNA Area under ROC Curve Winbugs Differential expression False Discovery Rate Review Article Bayesian Regression Technique to Estimate Area Under the Receiver Operating Characteristic Curve and its Application to Microrna Data Ahmed Hossain 1,2,3 *, Hafiz Khan 4 and Joseph Beyene 1,2 1 Department of Clinical Epidemiology and Biostatistics, McMaster University, Canada 2 Statistics for Integrative Genomics and Meta-Analysis, McMaster University, Canada 3 Department of Applied Statistics, East West University, Bangladesh 4 Department of Biostatistics, Florida International University, USA Abstract Differential biomarkers detection from a genomic study poses big challenges for statistical analysis with a large number of markers and a small number of samples. Due to the presence of a large number of markers, Bayesian hierarchical approaches are not popular to analyze such data. But, the number of microRNAs in microRNA- microarray experiments is low, typically in hundreds, compared with a few thousands of genes measured in conventional gene expression profiling. This motivates us to introduce a Bayesian regression technique to analyze microRNA expression data. We incorporated the patient covariate information and the prior about the regression coefficients into the regression models and estimate the Area Under receiver operating characteristic curve (AUC) comparing two conditions. The Bayesian estimate of AUC and its variance information is used to develop a statistic for testing the AUC for each microRNA is equal to 0.5 allowing different variance for each microRNA. Our Bayesian regression approach provides a new inferential framework for such genomic data. We focus on the primary step of microRNA selection process, namely the ranking of microRNAs with respect to the test statistic to identify differential expression under two conditions. A dataset is analyzed to illustrate the method and a simulation study is carried out to assess the relative performance of different statistical measures. Simulation results suggest that, regarding identifying true positive differentially expressed microRNAs, the Bayesian technique performs better than linear regression model especially with small sample sizes and nonlinear scenarios. INTRODUCTION Although Bayesian hierarchical methods have been studied for many years, it is not a popular technique in genomic studies especially for identifying differentially expressed biomarkers. This is due in large part to the involvement of relatively high computational burden in such high-dimensional analysis. For this reason more traditional approaches, based on point estimation of parameters, have typically been the method of choice for identifying differentially expressed biomarkers in a genomic study. Comparatively microRNA (miRNA) data is small among the genomic studies. However, the widespread availability of fast computers allows Bayesian computations to be performed in reasonable time for such applications. Furthermore, the development of Markov chain Monte Carlo techniques have greatly extended the range of models amenable to a Bayesian studies. MicroRNAs are small non-coding RNA molecules that have a central role in regulating gene expression as part of post- transcriptional functions. Further, microRNA deregulation is associated with cancer development and with tumor progression [1]. Therefore, it is essential to apply a robust computational and statistical method to identify differentially expressed miRNAs. When performing miRNA analysis for inference of relationships between microRNAs and conditions/phenotypes, we commonly analyze a small number of samples, each composed of expression values from hundreds of miRNAs.

Upload: others

Post on 17-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Regression Technique to Estimate area under the ...jscimedcentral.com/Biometrics/biometrics-2-1013.pdf · Linear regression model. The R package limma uses an approach called

Central Annals of Biometrics & Biostatistics

Cite this article: Hossain A, Khan H, Beyene J (2015) Bayesian Regression Technique to Estimate Area Under the Receiver Operating Characteristic Curve and its Application to Microrna Data. Ann Biom Biostat 2(1): 1013.

*Corresponding authorsAhmed Hossain, Department of Clinical Epidemiology and Biostatistics, McMaster University, ON L8S4L8, Canada, Email:

Joseph Beyene, Department of Clinical Epidemiology and Biostatistics, McMaster University, ON L8S4L8, Canada, Email: [email protected]

Submitted: 02 June 2014

Accepted: 10 January 2015

Published: 15 January 2015

Copyright© 2015 Hossain et al.

OPEN ACCESS

Keywords•Bayesian regression technique•microRNA•Area under ROC Curve•Winbugs•Differential expression•False Discovery Rate

Review Article

Bayesian Regression Technique to Estimate Area Under the Receiver Operating Characteristic Curve and its Application to Microrna DataAhmed Hossain1,2,3*, Hafiz Khan4 and Joseph Beyene1,2

1Department of Clinical Epidemiology and Biostatistics, McMaster University, Canada2Statistics for Integrative Genomics and Meta-Analysis, McMaster University, Canada 3Department of Applied Statistics, East West University, Bangladesh4Department of Biostatistics, Florida International University, USA

Abstract

Differential biomarkers detection from a genomic study poses big challenges for statistical analysis with a large number of markers and a small number of samples. Due to the presence of a large number of markers, Bayesian hierarchical approaches are not popular to analyze such data. But, the number of microRNAs in microRNA-microarray experiments is low, typically in hundreds, compared with a few thousands of genes measured in conventional gene expression profiling. This motivates us to introduce a Bayesian regression technique to analyze microRNA expression data. We incorporated the patient covariate information and the prior about the regression coefficients into the regression models and estimate the Area Under receiver operating characteristic curve (AUC) comparing two conditions. The Bayesian estimate of AUC and its variance information is used to develop a statistic for testing the AUC for each microRNA is equal to 0.5 allowing different variance for each microRNA. Our Bayesian regression approach provides a new inferential framework for such genomic data. We focus on the primary step of microRNA selection process, namely the ranking of microRNAs with respect to the test statistic to identify differential expression under two conditions. A dataset is analyzed to illustrate the method and a simulation study is carried out to assess the relative performance of different statistical measures. Simulation results suggest that, regarding identifying true positive differentially expressed microRNAs, the Bayesian technique performs better than linear regression model especially with small sample sizes and nonlinear scenarios.

INTRODUCTIONAlthough Bayesian hierarchical methods have been studied

for many years, it is not a popular technique in genomic studies especially for identifying differentially expressed biomarkers. This is due in large part to the involvement of relatively high computational burden in such high-dimensional analysis. For this reason more traditional approaches, based on point estimation of parameters, have typically been the method of choice for identifying differentially expressed biomarkers in a genomic study. Comparatively microRNA (miRNA) data is small among the genomic studies. However, the widespread availability of fast computers allows Bayesian computations to be performed in reasonable time for such applications. Furthermore, the

development of Markov chain Monte Carlo techniques have greatly extended the range of models amenable to a Bayesian studies.

MicroRNAs are small non-coding RNA molecules that have a central role in regulating gene expression as part of post-transcriptional functions. Further, microRNA deregulation is associated with cancer development and with tumor progression [1]. Therefore, it is essential to apply a robust computational and statistical method to identify differentially expressed miRNAs. When performing miRNA analysis for inference of relationships between microRNAs and conditions/phenotypes, we commonly analyze a small number of samples, each composed of expression values from hundreds of miRNAs.

Page 2: Bayesian Regression Technique to Estimate area under the ...jscimedcentral.com/Biometrics/biometrics-2-1013.pdf · Linear regression model. The R package limma uses an approach called

Central

Hossain et al. (2015)Email:

Ann Biom Biostat 2(1): 1013 (2015) 2/6

The receiver operating characteristic (ROC) curve has come into view for identifying differentially expressed biomarkers from a genomic study [2,3]. It represents the relationship between the true positive rate (TPR, the probability that the disease subject has a positive test) and the false positive rate (FPR, the probability that a healthy subject has a positive test) resulting from a set of binary classification tests across all possible threshold values. ROC curves are frequently summarized with a single summary measure, the Area Under the curve (AUC). The values of AUC quantify how accurately diagnostic tests discriminate between two study groups (for example, a diseased group versus a healthy group; genes in a certain functional group versus genes not in that functional group). In miRNA expression analysis, ROC curve analysis is commonly used for selecting the most informative miRNAs [4,5]. When a ROC curve is drawn using a specific microRNA expression profile, AUC estimates the probability that a subject randomly selected from one condition (e.g., a cancer group; Y) has an expression value higher than a subject randomly selected from the other condition (e.g., individuals without cancer; X). Wolfe and Hogg [6] argued using the AUC or P(Y > X), as a measure of the difference between two populations, is often more meaningful than looking at mean differences.

In a Bayesian regression technique we make predictions by integrating over the distribution of model parameters, β, rather than by using a specific estimated value of β. Such integrations may often be analytically intractable and require sophisticated Markov chain Monte Carlo methods to approximate them. On the other hand, the integration implied by the Bayesian framework overcomes the issue of over-fitting (by averaging over many different possible solutions) and typically results in improved predictive capability.

It is known that expression values may be strongly influenced by covariates. Hence, it is important to take covariates into account to estimate the AUC. To assess possible covariate effects on the AUC estimation, two different regression approaches have been suggested in the statistical literature. Induced methodology [7,8] is based on using separate regression models for the result of the diagnostic test in healthy and diseased populations respectively. Covariate effects on the associated ROC curve can then be computed by deriving the induced form of the ROC curve. Instead of targeting the diagnostic test, direct methodology [2,9] assumes a regression model for the ROC curve, with the effect of the covariates thus being directly evaluated on the ROC curve. Roughly speaking, both approaches use a generalized linear model (GLM) framework to characterize covariate effects on the ROC curve.

DATAWang et al. [10] analyzed the miRNA expression data of

Uterine Leiomyomata (ULM) patients. They identified some tumorigenic genes previously identified in ULMs that may be targeted by the deregulated miRNAs. They found a subset of miRNAs that are strongly associated with tumor sizes and race. The data was collected from GEO with accession number GSE5244. We analyzed 47 patients that include 25 patients with tumor size < = 9 cm and 22 patients with tumor size above 10 cm. Our interest is to determine statistically significant miRNAs

that are differentially expressed between the two tumor size groups taking race and age into account. Importantly, Wang et al. [10] mentioned that let-7 family, miR-21, miR-23b, miR-29b, and miR-197 were all deregulated miRNAs and the miRNAs are highly correlated with tumor sizes and race. Here in this paper we introduced the Bayesian technique to estimate the are under ROC curve and apply the method to identify differentially expressed miRNAs.

METHOD

Linear regression model

The R package limma uses an approach called linear models to analyze designed microarray experiments. Details about fitting the linear regression model for each gene is given in Smyth [11]. Consider a 2-class microRNA data, where we have measured the expression levels of m genes for ng ; g = 1,2 samples from the two conditions and n1 + n2 = n. In addition we consider each sample has a set of K covariates labeled Xk with k = 1, …, K. Denote the measured expression values as

, = 1, , ; = 1, ,ijy i m j n

and we assume necessary preprocessing has been applied [12,13]. We introduce the following class indicators

1, 1=

0,if sample j is from class or diseased group

Lotherwise

In differential miRNA expression detection, the basic idea is to compare the expression levels across the two conditions. We can do the comparison for miRNA I using the following linear regression model

0=1

= ; = 1, ,K

j k jk jk

y L x j nα β β ε+ + +∑

= ' ,jxβ ε+

where, 0 1=( , , , , )Kβ α β β β ′ and classical (non-Bayesian)

techniques can be used to determine a specific value for the parameter vector β

by minimizing the sum-of-squares error

function, 2jj

ε∑ . A well-known problem with error function

minimization is that such models can “over-fit” the data, leading to poor generalization. Indeed, when the number of parameters equals the number of data points, the least squares solution for a model of the form can achieve a perfect fit to the data. Over-fitting can be avoided by adopting a Bayesian technique to the data.

Area under the ROC curve from regression model

Let µ1 and 0 denote the mean of the miRNA expression at condition 1 and condition 0, respectively and we have additional covariates x = (x1, …, xk ). For simplicity, variances of the outcomes

are 21σ and 2

0σ for the condition 1 and condition 0, respectively.

Furthermore, let the ROC parameters be 1 01

1

= µ µγ

σ− and

02

1

= σγσ

. It can easily be shown in the linear regression model

that 01

1

= βγ

σ, free of x, where β0 is the regression coefficient for

Page 3: Bayesian Regression Technique to Estimate area under the ...jscimedcentral.com/Biometrics/biometrics-2-1013.pdf · Linear regression model. The R package limma uses an approach called

Central

Hossain et al. (2015)Email:

Ann Biom Biostat 2(1): 1013 (2015) 3/6

L. Thus, there is a single ROC curve under this model. Then the Area Under the curve can be defined as,

122

= ,1

A γ

γ

Φ +

where, (.)Φ is the cumulative normal distribution function. It is noted that, the Area Under the curve is equal to the probability that the outcome for a randomly drawn subject from condition 1 is higher than for a randomly drawn subject from condition 0.

Bayesian technique

For convenience let us define 0 0jy and 1 1j

y be real-valued

continuous miRNA expression values for the j0-th and j1-th subjects in the healthy (i.e., condition 0) and diseased (i.e.,

condition 1) group, respectively, 0 1 1 2= 1, , ; = 1, ,j n j n and

1 2 =n n n+ . We denote the expression values for i-th miRNA

as 0 1=( , )iy y y . We assume that, given the covariates, x , the miRNA expression values are independent in the healthy and diseased groups and that

0 0 0 0 00 0: (.| ), = 1, ,j jy f x j n

and

1 1 1 1 11 1: (.| ), = 1, ,j jy f x j n

where 0(.| )f x and 1(.| )f x denote the conditional densities of the

marker, given the predictors x, in the healthy and diseased group, respectively. Let us define the underlying mean for i-th miRNA y1 is µx which, depends on the both conditions and additional covariates (x).

ˆ= 'x xµ β

Also we assume that the underlying variances for both the groups are 2

1σ and 22σ , respectively, depend only on the

conditions L. With the pre-specified regression model and observed data, the posterior distribution of 'β is obtained from Bayes theorem (16):

22 ( | .) ( , )

( , | , , )=( | , )

f y ff y L x

f y L xβ σ

β σ

where ( | .)f y is the joint probability distribution function

of the two conditions, 0 1=( , , , , )kβ α β β β , 2 2 21 2=( , )σ σ σ and

2 2( | , )= ( | .) ( , ) .f y L x f y f d dβ σ β σ∫∫

Prior Distributions

We consider two types of prior distributions:

The prior distributions for all regression coefficients, the β s,

are assumed to have independent normal 3(0,10 )N distributions. Here we consider very large value of variance for each regression coefficient. The prior distributions for the variance terms, the 2σ

s, are assumed to be independent inverse gamma (0.001,0.001)IG

distributions.

Here we consider an informative prior about the disease

effect 0β . The 0β is considered from the mean of the treatment

effects across all the miRNAs, i.e. mean of 1 0i iµ µ− . In the Uterine

Leiomyomata data, the prior distribution for 0β is considered as

(0.05,0.05)N , because the mean of the treatment effects is found as 0.0411. We also consider another prior taking higher variance of 0β as (0.05,1)N .

Test statistic and false discovery rate

Testing the hypothesis 0 : = 0.5iH A for the i-th miRNA, the test statistics becomes,

ˆ 0.5= ˆ( )i

ii

AZSE A−

which is approximately standard normally distributed.

The ˆiA and the ˆ( )iSE A is estimated by following the Bayesian

regression technique. We can rank miRNAs according to the values of Zi. Moreover, we can smooth the denominator of the Zi statistic by following the approach of [11,14-16]. For example, the offset s0 can be taken as the quantile of the miRNA-wise standard errors that minimizes the coefficient of variation of the Zi statistic. Therefore we can calculate the (adj)iZ statistic to test for treatment effect as

0

ˆ 0.5(adj) = .ˆ( )i

ii

AZSE A s

+

One of the challenges of using (adj)iZ statistic is that we

have to use a permutation method to get a p-value from this test statistic because it doesn’t follow a specefic distribution. This paper uses the Zi statistic to get the p-values which help to save the computational time. The BUGS and the R code for testing the difference between two AUCs are provided in the appendix.

In miRNA analysis multiplicity arises due to testing hundreads of hypotheses. False discovery rate (FDR) is commonly used in such genomic studies to correct for multiple comparisons. The procedures are designed to control the expected proportion of false positives among the declared significant results. FDR is estimated using permutation and thresholding the statistic [16-19]. Here we have used the R package qvalue to estimate the FDR for a given set of p-values.

Simulation study

To evaluate the performance of the test statistic associated with our model, we analyzed simulated data sets under two different scenarios. Specifically, we considered a linear-mean scenario with different variances for the two groups, and a non-linear-mean scenario with predictor dependent variance. We consider the scenarios having two conditions, treatment versus

control, and sample sizes of 0 1= = 10,20,30,40n n and 50 per condition. We generated data from 200 miRNAs and considered the proportion of differentilly expressed (DE) miRNAs as 0.1, i.e, 20 miRNAs are DE among the 200 miRNAs. The following simulation scenarios are considered here:

Page 4: Bayesian Regression Technique to Estimate area under the ...jscimedcentral.com/Biometrics/biometrics-2-1013.pdf · Linear regression model. The R package limma uses an approach called

Central

Hossain et al. (2015)Email:

Ann Biom Biostat 2(1): 1013 (2015) 4/6

We consider linear-mean regression models for the diseased and healthy groups and later we added a constant treatment effect to the diseased group. Specifically, we assume that, for

= 1, ,200i ,

0 0 0 10 0 0 12

1 11 1

| (0.5 ,1); |

(0.5 ,1.5 )

j j j j

j j

y x N x y

x N x

+

+

:

:

The purpose of including this linear scenario is to investigate the performance of the methods when the standard parametric assumptions hold. We consider treatment effect size as 0.5, which is added to the diseased group for the first 10% of the miRNAs. A similar simulation details by skew normal distribution is also given in [25].

We consider nonlinear regression models for both groups and later we added treatment effect to the diseased group. We generate expression data for the i-th miRNA from a skew-normal distribution relating the expression values and covariates to have the following model:

= 0.5j j j jy x zδ ε+ + +

where : (0,1)j Nε and : HN(0,1); = 1, ,jZ j n all independent,

where HN(0,1) denotes the univariate standardized half-normal distribution. By Sahu et al. (2003), it follows that

: (0,1, )j jZ SNδ ε δ+ . From the properties of the Skew-normal

distributions, it follows that : ( ,1, )jy SN x β δ′ , where, =(0.5,1)β

and =(1, )jx x′ . In addition, we added treatment effect 0.5 to the

diseased group for the first 10% of the miRNAs.

The number of true positives is calculated by taking average number of miRNAs that are correctly identified from the set of 20 top ranked miRNAs based on 200 simulations. The proportion of true positive is calculated by dividing the number of true positives by 20. We consider ranking miRNAs by FDR values corresponding to each miRNA. (Figures 1(a) and 1(b)) show the proportion of true positive miRNAs corresponding to sample sizes for simulation scenarios of I and II, respectively. For example when applying LIMMA to the simulated dataset following scenario I and sample size 20 per condition, we observe that on average, 9.46 miRNAs are correctly identified from the list of top 20 ranked miRNAs. Moreover, the Bayesian methods by prior I, prior II (β0 : N(0.05,0.05)) and prior II (β0 : N(0.05,1)) produced on average, 10.42, 10.72 and 10.58 miRNAs, respectively. It appears from the result that the Bayesian method with informative priors (prior II) perform best among all the methods. The Bayesian method performs better than LIMMA at any sample sizes of the conditions. It appers from the results of both simulation scenarios that Bayesian technique performs robustly especially with small sample sizes. It should be noted that the presence of noise is very common in real miRNA data. Therefore, even when the normality assumption holds for a given miRNA data the Bayesian techniques performs well for identifying differentially expressed miRNAs.

APPLICATIONA detailed evaluation of statistical methods on real biological

data is challenging due to not knowing the true positive miRNAs. Here, we evaluated and applied the methods to Wang et al. [10] uterine leiomyoma microRNA expression data. The main objective is to determine statistically significant miRNAs that are differentially expressed between the patients of tumor size < = 9 cm and patients with tumor size above 10 cm. Importantly, Wang et al. [10] mentioned that let-7 family, miR-21, miR-23b, miR-29b, and miR-197 were all deregulated miRNAs and the miRNAs are highly correlated with tumor sizes and race. It is noticed that, the LIMMA is not able to identify the miRNA miR-21 from the list of first 20 miRNAs but Bayesian technique with prior 2 can identify all these miRNA if we choose to select top ranked 20 miRNAs.

Here we evaluate the effectiveness of the miRNA list by different methods to form a classifier which could predict the class of a test sample. In using classification to obtain the best method, we assumed that a better miRNA list should discriminate between the groups more effectively. Therefore we evaluate the classification performance of the four ranking measures using a simple Gaussian maximum likelihood discriminant rule [20-23]. The form of the algorithm is as follows:

1. Split the patients into 3 folds (subsamples).

2. Of the 3 folds, a single fold is retained as the validation data for testing the model, and the remaining 2 folds are used as training data. In choosing the folds we ensured at least 5 patients are chosen from each group.

3. The cross-validation process is repeated for each of the folds, with each of the 3 folds used exactly once as the validation data.

4. Record the number of top ranked genes from the training data and use these genes for classification with a simple Gaussian maximum likelihood discriminant rule. Error for a given

Figure 1 True positive rate corresponding to sample sizes at two simulation scenarios.

Page 5: Bayesian Regression Technique to Estimate area under the ...jscimedcentral.com/Biometrics/biometrics-2-1013.pdf · Linear regression model. The R package limma uses an approach called

Central

Hossain et al. (2015)Email:

Ann Biom Biostat 2(1): 1013 (2015) 5/6

classification relative to a known truth are calculated by the classError function of R package mclust.

5. Report the average error over all the test results.

The top ranking miRNAs of a specified number (k0) between 5 and 30 are used to create the classification rule. The mean of the misclassification errors and their corresponding standard errors for different methods are summarized in (Table 1). It appears that the Bayesian techniques performs better than LIMMA in choosing minimum classification errors. In fact, compared among the Bayesian techniques the informative priors produce more consistent results. Therefore the Bayesian techniques with informative prior can be used effectively to this dataset for identification of the important miRNAs.

DISCUSSIONIt is important to apply an appropriate statistical technique

for identifying differentially expressed miRNAs between two conditions in the presence of covariates. Here we introduced a Bayesian regression technique based on ROC curve analysis incorporating covariates to identify differentially expressed miRNAs. In the Bayesian regession technique we refer a linear model that use a priori information to estimate the coefficients and predict AUC in a probability framework. The methodology developed in this paper is based on the assumption that the expression values are normally distributed.

The simulation study shows that our Bayesian technique for identifying differentiall expressed miRNAs performs well. The proposed test including covariate effect yields better results than LIMMA regardless of sample size. It is also found that, in the real data situation, the Bayesian technique represents a better tool for identifying the true positive miRNAs and resulting minimum classification errors. The Bayesian technique suffers from the limitation of taking more computational time than the classical methods. It is expected that the computational load will significantly increase with each additional covariate added to the model. Further work is nevertheless needed to extend the proposed methodology to the case of non-linear models and exploring interactions between covariates and treatment effects.

APPENDIX# BUGS code saved as aucx.bug

# the mu[i] is the regression model and used for covariates.

model{

for (i in 1:N){

yt[i]~dnorm(mu[i],precy[L[i]+1])

mu[i]<-beta[1]+beta[2]*L[i]

}

for(i in 1:P){

beta[i]~dnorm(0,1.0E-6)

}

for(i in 1:K){

precy[i]~dgamma(1.0,1.0)

vary[i]<-1/precy[i]

}

la1<-beta[2]/sqrt(vary[1])

la2<-vary[2]/vary[1]

auc<-phi(la1/sqrt(1+la2))

}

### R code

library(R2WinBUGS)

# Set the location where the bugs code is saved

setwd(“C://Users/Ahmed/Documents/Bayesian/WINBUGS R FILES/AUC”)

# x1 Expression values for non-diseased group

# x2 Expression values for diseased group

bayes.AUC<-function(x1,x2){

yt<-c(x1,x2)

d<-c(rep(1,length(x1)),rep(0,length(x2)))

P<- 2 ##number of model parameters

K<- 2 ## number of conditions

N<- length(x1)+length(x2)

data<- list (“yt”, “d”,”P”,”K”,”N”)

inits<- function() {list (auc=rnorm(1,0,1))}

parameters<- c(“auc”)

auc.sim<- bugs (data, inits, parameters, “aucx.bug”, n.chains=3, n.iter=1000)

bayes.AUC.sim<-auc.sim$sims.list$auc

bayes.AUC.sim1<-pmax(auc.sim$sims.list$auc,1-auc.sim$sims.list$auc)

bayes.AUC<-mean(bayes.AUC.sim1)

var.AUC<-var(bayes.AUC.sim1)

stat.AUC<-(bayes.AUC-0.5)/sqrt(var.AUC)

p.val<-2*pnorm(-abs(stat.AUC))

return(p.val)

k0 LIMMA Bayesian_Prior 1 Bayesian_Prior 2

β0 : N(0.05,0.05) β0 : N(0.05,1)

0.293 (0.145) 0.296 (0.152) 0.282 (0.153) 0.283 (0.151)

0.244 (0.139) 0.234 (0.142) 0.230 (0.140) 0.231 (0.139)

0.249 (0.137) 0.239 (0.138) 0.232 (0.136) 0.230 (0.136)

0.217 (0.143) 0.208 (0.139) 0.205 (0.140) 0.208 (0.138)

0.188 (0.136) 0.179 (0.137) 0.175 (0.136) 0.179 (0.136)

0.156 (0.151) 0.148 (0.152) 0.143 (0.158) 0.142 (0.156)

Table 1: Average of classification errors (standard errors in parentheses) for ULM dataset.

Page 6: Bayesian Regression Technique to Estimate area under the ...jscimedcentral.com/Biometrics/biometrics-2-1013.pdf · Linear regression model. The R package limma uses an approach called

Central

Hossain et al. (2015)Email:

Ann Biom Biostat 2(1): 1013 (2015) 6/6

Hossain A, Khan H, Beyene J (2015) Bayesian Regression Technique to Estimate Area Under the Receiver Operating Characteristic Curve and its Application to Microrna Data. Ann Biom Biostat 2(1): 1013.

Cite this article

#return(list(bayes.AUC=bayes.AUC,var.AUC=var.AUC, stat.AUC=stat.AUC))

}

REFERENCES1. Negrini M, Ferracin M, Sabbioni S, Croce CM. MicroRNAs in human

cancer: from research to therapy. J Cell Sci. 2007; 120: 1833-1840.

2. Pepe MS, Longton G, Anderson GL, Schummer M. Selecting differentially expressed genes from microarray experiments. Biometrics. 2003; 59: 133-142.

3. Hossain A, Beyene J. Estimation of weighted log partial Area Under the ROC curve and its application to MicroRNA expression data. Stat Appl Genet Mol Biol. 2013; 12: 743-755.

4. Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are micro RNA targets. Cell. 2005; 120: 15-20.

5. Nikas JB, Low WC. Linear Discriminant Functions in Connection with the micro-RNA Diagnosis of Colon Cancer. Cancer Inform. 2012; 11: 1-14.

6. Wolfe DA, Hogg RV. On constructing statistics and reporting data. American Statistician. 1971; 25: 27-30.

7. Faraggi D. Adjusting receiver operating characteristic curves and related indices for covariates. Journal of the Royal Statistical Society. 2003; 52: 1152-1174.

8. Pepe MS. Three approaches to regression analysis of receiver operating characteristic curves for continuous test results. Biometrics. 1998; 54: 124-135.

9. Cai T and Pepe MS. Semiparametric Receiver Operating Characteristic Analysis to Evaluate Biomarkers for Disease. Journal of the American Statistical Association. 2002; 97: 1099-1107.

10. Wang T, Zhang X, Obijuru L, Laser J, Aris V, Lee P, et al. A micro-RNA signature associated with race, tumor size, and target gene activity in human uterine leiomyomas. Genes Chromosomes Cancer. 2007; 46: 336-347.

11. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3.

12. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J. Speed TP

Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002; 30: 15.

13. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003; 19: 185-193.

14. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001; 98: 5116-5121.

15. Eckel JE, Gennings C, Chinchilli VM, Burgoon LD, Zacharewski TR. Empirical bayes gene screening tool for time-course or dose-response microarray data. J Biopharm Stat. 2004; 14: 647-670.

16. Hossain A, Willan RA, Beyene J. An Improved method on Wilcoxon Rank Sum Test for Gene selection from Microarray Experiments, Communications in Statistics. 2013; 7: 1563-1577.

17. Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B. 2002; 64: 479-498.

18. Storey JD. The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics. 2003; 31: 2013-2035.

19. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003; 100: 9440-9445.

20. Speed T. Statistical Analysis of Gene Expression Microarray Data, Chapman and hall/CRC. 2004.

21. Gonz Ìalez-Manteiga, W Pardo-Fernand Ìez JC, Van Keilegom I. ROC curves in non-parametric location-scale regression models. Scandinavian Journal of Statistics. 2011; 38: 169-184.

22. Rhishikesh Bargaje, Manoj Hariharan, Vinod Scaria and Beena Pillai. Consensus miRNA expression profiles derived from interplatform normalization of microarray data, RNA. 2010; 16: 16-25.

23. Sahu SK, Dey DK, Branco M. A new class of multivariate skew distributions with application to Bayesian regression models. Canadian Journal of Statistics. 2003; 3: 129-150.

24. Hossain A, Willan AR, Beyene J. An Improved Method on Wilcoxon Rank Sum Test for Gene Selection from Microarray Experiments, Communications in Statistics - Simulation and Computation. 2013; 42: 1563-1577

25. Hossain A, Beyene J. Application of skew-normal distribution for detecting differential expression to microRNA data, Journal of Applied Statistics. 2015; 42: 477-491.