statistical modeling of rna-seq data by julia salzman … 252.pdf · ultra high throughput...

31
STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman Hui Jiang Wing Hung Wong Technical Report 252 March 2010 Division of Biostatistics STANFORD UNIVERSITY Stanford, California

Upload: others

Post on 18-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

STATISTICAL MODELING OF RNA-SEQ DATA

By

Julia Salzman Hui Jiang

Wing Hung Wong

Technical Report 252 March 2010

Division of Biostatistics STANFORD UNIVERSITY

Stanford, California

Page 2: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

STATISTICAL MODELING OF RNA-SEQ DATA

By

Julia Salzman Hui Jiang

Wing Hung Wong

Department of Statistics Stanford University

Technical Report 252 March 2010

This research was supported in part by National Institutes of Health grant R01 HG004634 and by National Science Foundation grant DMS 0805157.

Division of Biostatistics STANFORD UNIVERSITY

Sequoia Hall, 390 Serra Mall Stanford, CA 94305-4065

http://statistics.stanford.edu

Page 3: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data

Julia Salzman†, Hui Jiang† and Wing Hung Wong

Stanford University, Stanford, USA

Summary . In mammalian cells, RNAs can have highly similar sequences yet can encodeproteins with different functional roles. Isoforms of a gene are an example of a collectionof such RNAs. Accumulating evidence suggests that a key factor charactering cell functionin mammals is differential isoform expression. Quantifying differences in cellular abundanceof isoforms is therefore of significant biological interest. Ultra High Throughput Sequencing(UHTS) is an emerging technology which promises to become as (or more) powerful, popularand cost-effective than current microarray technology for estimating gene expression, particu-larly at the level of isoforms. Yet, statistical methods for performing such estimation are not welldeveloped. This paper introduces a statistical model for estimating isoform abundance fromUHTS data. The model is analyzed in detail, including derivation of minimal sufficient statisticsand a study of the Fisher information. Further, a computationally feasible implementation ofthe maximum likelihood estimator of the model is provided.

Keywords: Paired end RNA-Seq data analysis; Minimal sufficiency; Isoform abundance esti-mation; Fisher information

1. Introduction

1.1. Biological BackgroundAll cells in an individual mammal have the almost identical DNA. Yet, cell function withinan organism has huge variation. One mechanism that differentiates cell function is its geneexpression pattern. Recent research has shown that this differentiation may be on a finescale: that subtle sequence variants of expressed genes, called isoforms, have significantimpact on the function of the proteins encoded by the RNA and hence their function inthe cell. The purpose of this paper is to develop and analyze statistical methodology formeasuring differential expression of isoforms using an emerging powerful technology calledUltra High Throughput Sequencing (UHTS). Such study has the potential to reveal newinsights into cellular gene expression programs, including characteristics of cell specificspecialization and how cells respond to environmental cues.

The central dogma in biology describes the information transfer that allows cells togenerate proteins, the building blocks of biological function. This dogma states that DNAis transcribed to messenger RNA (mRNA) which is in turn translated into proteins. Recentdiscoveries have highlighted the importance of regulation at the level of mRNA, showingthat protein levels and function can be regulated by subtle differences in the sequence ofmRNAs in a cell.

In bacteria, short DNA sequences are transcribed in a one to one fashion to mRNA.This mRNA is referred to as a gene or a transcript. Like DNA, mRNA are a string of

†These authors contributed equally to this work and are both corresponding athors: [email protected], [email protected].

Page 4: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

2 J. Salzman, H. Jiang and W. H. Wong

nucleotides, each position taking four possible values. Mammalian cells commonly generatea large class of mRNAs from a single relatively short DNA sequence. The set of suchmRNAs are called isoforms of a gene. This paper concentrates on one common mechanismgenerating isoforms called alternative splicing. An example of alternative splicing is depictedin Figure 1: two isoforms can arise from the same gene when the DNA, which is comprisedof three sequence blocks (called exons), can be transcribed into two different mRNAs: oneof which contains all three exons and one of which only contains the first and third exon. Asthis example shows, isoforms typically have highly similar sequence. Despite this sequencesimilarity, isoforms can encode proteins which may have different functional roles. Further,most genes have more than three exons, and alternative use of exons can give rise to largenumbers of isoforms. Thus, it has been historically difficult for technology and statisticalmethods to allow researchers to distinguish between different isoforms of the same gene.

1.2. Technological Background

Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech-nology which promises to become as (or more) powerful, popular and cost-effective thancurrent microarray technology for several applications, including isoform estimation. Whenused to study mRNA levels, UHTS is referred to as RNA-Seq. In the past year, studiesusing UHTS to study genome organization, including isoform expression, have been promi-nent (see Pan et al. (2008), Zhang et al. (2009), Wahlstedt et al. (2009), Hansen et al.(2009), Maher et al. (2009)) and featured in the journals Science and Nature (see Sultanet al. (2008), Wang et al. (2008)) , which dubbed 2007 as the “year of sequencing” (see Chi(2008)).

The rapid technological advances in sequencing have spawned a large number of al-gorithms for analyzing sequence data (see Langmead et al. (2009), Trapnell et al. (2009),Trapnell et al. (2009), Mortazavi et al. (2008)), some of which aim to estimate mRNA abun-dance. These advances have out-paced development of statistical methodology for analyzingthe results of isoform specific RNA-Seq experiments. Notable exceptions are the work ofMarioni et al. (2008) and Jiang and Wong (2009). However, researchers now commonlygenerate data which require more complex statistical models than previously developed.The work in this paper significantly extends a model developed in Jiang and Wong (2009)to allow for more accurate inference and investigates its statistical theory in much greaterdetail.

Briefly, UHTS is a method that relies on directly sequencing the nucleotides in a samplerather than inferring abundance of mRNA by measuring intensities using pre-determinedhomologous DNA probes as microarrays do. Thus, the data generated from an UHTSexperiment are large numbers of discrete strings of nucleotides, called base pairs (bp),which can take values of A,C,G or T. The number of sequences generated in an experimentis growing very quickly. In 2008, standard technology produced millions of up to 35bp readsper experiment; in 2009, each experiment produced tens of millions of up to 100bp reads.The throughput of this technology is expected to continue its rapid growth.

Two experimental protocols for RNA-Seq are in common use: a) single end and b) pairedend sequencing experiments. For single end experiments, one end (typically about 50 bp)of a long (typically 200-400 nucleotide) molecule is sequenced. For paired end experiments,typically 50 bp of both ends of a typically 200-400 nucleotide molecule are sequenced. Usingcurrent Illumina technology, a particular experiment produces 8 parallel lanes.

Page 5: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 3

1.3. Statistical BackgroundAn important application and use of UHTS technology is to quantify the abundance ofmRNA in a cell (RNA-Seq). This is done by matching the sequences generated in an UHTSexperiment to a database of known mRNAs sequences (called alignment) and inferringthe abundance of each mRNA from the number of experimental reads (fragments of theoriginal mRNAs) aligning to it. Sometimes, a statistical model is used for this estimate.Importantly, experimental steps involved in an UHTS experiment can affect the probabilityof each fragment being observed, although modeling of these processes is not the focus ofthis paper.

To date, inference on the abundance of mRNA has been made by aligning reads toknown genes and estimating a gene’s expression by averaging the number of reads which mapuniquely to it using the simplifying assumption that the transcript is sampled uniformly (seeJiang and Wong (2009); Mortazavi et al. (2008)), and sometimes using heuristic approachesto accommodate reads which map to multiple locations (see Mortazavi et al. (2008)). Thesemodels do not provide optimal estimators of isoform-specific expression levels and do notaccommodate modeling of important steps in the experimental procedure.

In this paper, we introduce an estimator of isoform specific expression with larger Fisherinformation than previously proposed estimators. Further, this paper introduces a flexiblemodel which can accommodate models of non-uniform sampling of a particular locationin a transcript. Work presented here also advances statistical modeling and analysis ofcurrent methods for UHTS by providing a method for inference in expression studies whenpaired end reads are used. In addition, it quantifies the advantage of using paired end se-quencing technology compared to single end technology for RNA-Seq. Statistical propertiesof the model such as minimal sufficiency are investigated. Using Fisher information andsimulation, this paper establishes gains in accuracy of estimates of gene expression usingpaired end libraries. To our knowledge, this is the first such methodology and analysis tobe developed.

2. Data Format

Isoforms of a gene are subtle differences in a gene sequence, sometimes resulting frominclusion or exclusion of exon, a discrete piece of sequence depicted in Figure 1. In principle,compared to microarrays, UHTS has the potential to provide high resolution estimatesof isoform use. However, signal deconvolution must take place for these estimates to beaccurate.

In order to estimate the expression of different isoforms of the same gene, several mea-surements of that gene’s expression, whether from a microarray or sequencing, must bedeconvolved– unless the entire gene sequence is observed once. Several studies have inves-tigated this deconvolution problem when measurements are made from a microarray (seeHiller et al. (2009) or She et al. (2009)). This paper presents an estimator for deconvolutionfor ultra high throughput sequencing experiments.

As mentioned, two experimental approaches for RNA-Seq are in wide use, but to ourknowledge, no formal statistical comparison of the approaches exists in the literature. Insingle read experiments, reads are generated from one end of a molecule (depicted schemat-ically in Figure 2); in paired end reads, reads are generated from both ends of a molecule,called an insert, but typically a large number of nucleotides interior to the insert are leftunsequenced (depicted schematically in Figure 3).

Page 6: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

4 J. Salzman, H. Jiang and W. H. Wong

exon 1 exon 2 exon 3

gene

transcription &

splicing

alternatively

spliced mRNA

isoform 1 isoform 2

Fig. 1. A gene (DNA sequence) with three exons. During transcription, two isoforms are generated.The first isoform contains all of the gene’s three exons. The second isoform contains the first andthird exon, skipping the middle exon. This process is called alternative splicing and the middle exonis called an alternatively spliced exon.

To appreciate the additional information provided by the paired end reads, considerFigure 2 which depicts single reads randomly sampled from a transcript of a gene. Supposethere are two possible isoforms for the transcript of this gene depending on whether anexon of length l is retained or skipped. In this case, only the reads that come from thealternatively spliced exon (AS exon), or come from junctions involving either the AS exonor the two neighboring exons, can provide information to distinguish the two isoforms fromeach other, i.e. only these reads are isoform informative. If the AS exon is short compared tothe transcript, then the majority of the single reads contain information only on gene levelexpression but not isoform level expression. Assuming uniform distribution on the reads’positions in the gene, it is evident that a read is related to the AS exon with probability

P =l + r

L − rif the read comes from the AS exon inclusive isoform, where r is the length of

the reads. Thus, P is a strictly increasing function with respect to the read length r as wellas the AS exon length l. As an example, for a typical gene of length 5000bp with a shortAS exon of length 100bp; P = 0.0262 for reads of length 30bp; P = 0.0303 for reads oflength 50bp, and P = 0.0408 for reads of length 100bp.

Currently, technical limitations limit the length of sequenced reads. These limitationsvary by particular platform used for UHTS. The two platforms in widest use are the Illuminaplatform and the ABI SOLiD platform. To date, The longest read that can be sequencedon the Illumina platform is 100bp, and the most reliable read length is still 30-50bp †.

Paired end reads are an attractive way to decouple the isoform specific gene expression.By performing paired end sequencing, instead of using single reads, reads are producedfrom both ends of the fragments, but the interior of the fragment remains unsequenced.This method of sequencing both sides of the fragment increases the number of isoform-informative reads as illustrated in Figure 3. Paired end reads that are mapped to the genes

†The sequencing quality typically deteriorates quickly after the 50th base on the Illumina plat-form. It is about the same for the ABI SOLiD platform, and although for the 454 platform theread length can be as long as 400bp, the throughput is much lower comparing to the other twoplatforms so not considered here.

Page 7: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 5

chr1

L

gene

lalternatively spliced exon

reads

r

Fig. 2. A genes of length L is shown, which has an AS exon of length l (shown in broken lines).Reads that come from this gene are shown in short thin lines above the gene. Reads that are relatedto the AS exon are shown in red color. In this case only the reads in red are isoform informative.Notice that there is a read that is mapped to the junction that skips the AS exon, which is also isoforminformative.

are shown in short lines above the gene, with read pairs connected by broken lines.As shown in Figure 3, some reads (colored red) are directly informative on the retention

or skipping of the AS exon. In addition, some read pairs span both sides of the AS exon(colored green). For these read pairs, the length of the fragment that they span (i.e. theinsert size) depends on whether the AS exon is used or skipped in the transcript. If thedistribution of the insert size is given, then these read pairs can also provide discriminatoryinformation on the isoforms as shown in Figure 3 and developed rigorously through theinsert length model in Section 3.4.2. For example, suppose the experimental protocol selectsfragments of sizes around 200bp for pair-end sequencing ‡. In such an experiment, if theinsert size between a pair of read is 200bp or 350bp depending on whether a AS exon (ofsize 150bp) is used, then this read is unlikely to have come from a transcript that retainedthe AS exon.

chr1

L

gene

lalternatively spliced exon

paired-end reads

r

Fig. 3. Paired end sequencing provides indirect information to discriminate between isoforms. Agenes of length L is shown, which has an alternatively spliced (AS) exon of length l (shown inbroken lines). Paired end reads that come from this gene are shown in short thin lines above thegene, and the pairing information is shown in thin broken lines with broken lines representing theinferred position of the unsequenced middle nucleotides separating the two paired ends. Reads thatare directly related to the AS exon are in red as before. Reads that provide indirect information forseparating isoform expressions are in green.

It is easy to see from Figure 3 that the fraction of reads that contain information to

‡The insert size can be controlled by tuning the parameters involved in the fragmentation, randompriming and size selection steps in the sample preparation process.

Page 8: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

6 J. Salzman, H. Jiang and W. H. Wong

Table 1. Notation

Symbol Meaning

I Total number of different types of transcripts in the sample.J Total number of different types of reads.θi The abundance of transcript type i, i = 1, . . . , I.θ The isoform abundance vector [θ1, θ2, . . . , θI ].sj Read type j, j = 1, . . . , J .ni,j The number of reads sj that are generated from transcripts i.nj The number of read sj that are generated from all the transcripts,

i.e. nj =PI

i=1ni,j .

ai,j Up to proportionality, the sampling rate of ni,j , i.e., the rate thatread sj is generated from each individual transcript i.

aj The sampling rate vector [a1,j , a2,j , . . . , aI,j ] for read sj .θ · aj The sampling rate of nj , i.e., the rate that read sj is generated

from all the transcripts.

A The J × I matrix of the sampling rates {ai,j}I,Ji=1,j=1

.

ci The copy number of the ith transcript in the sample.

li The length of the ith transcript in the sample.n The total number of reads.

distinguish the two isoforms from each other increases not only with the read length andthe length of the AS exon, but also with the insert size. Since it is possible to have a muchlonger insert size than read length§, a considerable amount of information can be extractedfrom the paired end reads for decoupling the isoform-specific gene expression. This conceptis developed precisely in the following sections.

3. RNA-Seq Model

3.1. NotationThe notation in Table 1 is used to present the statistical model.

3.2. AssumptionsThe following assumptions on the process of UHTS are used in this paper.

(a) The sample contains I types of transcripts e.g., all the isoforms of the genes of interest.The abundances for the transcripts are the parameters of interest and denoted {θi}I

i=1.(b) After sequencing the sample, J different types of reads denoted as {sj}J

j=1 cab beobtained. A type of read refers to single read that is mapped to a specific positionin a transcript in single read sequencing or a pair of reads that are mapped to twospecific positions in paired end sequencing.

(c) Each transcript is independently processed and then sequenced.

§The current technology allows the insert size to be up to 700bp on the Illumina platform and3,000bp on the ABI platform

Page 9: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 7

(d) ni,j , the number of reads of type sj that are generated from transcript i, are ap-proximated as Poisson with parameter θiai,j , where ai,j is the relative rate that eachindividual transcript i generates read sj , called the sampling rate defined below.

(e) Given {θi}1≤i≤I , {ni,j}1≤i≤I, 1≤j≤J are independent random variables.

If transcript i can not generate read sj , ai,j is set to zero: ai,j = 0. More specifically,for 1 ≤ i1, i2 ≤ I, 1 ≤ j1, j2 ≤ J , assuming none of the aik,jk

for k = 1, 2 are zero, aik,jkare

defined so that

ai1,j1

ai2,j2

=Pr (read sj1 observed after processing one copy of transcript i1)

Pr (read sj2 observed after processing one copy of transcript i2). (1)

Therefore, up to a multiplicative constant, ai,j is the sampling rate of the jth read fromthe ith transcript. This constant is chosen so that the estimates of θi are normalized in orderto be comparable across experiments. Two such choices are described in the Section 3.4.With appropriate choice of aij , the probabilistic interpretation of aij can be maintainedacross different experiments.

3.3. Likelihood FunctionThe challenge of estimating isoform abundance arises from the fact that isoforms of agene have common sequence characteristics. Thus, the ni,j ’s can not be directly observed.Rather, the observed quantities are sequences that are necessarily collapsed over the po-tentially multiple transcripts generating them. The observed quantities in an RNA-Seqexperiment are therefore nj where

nj := n.,j =I

i=1

ni,j ,

denoted as nj for simplicity.

Since {ni,j}1≤i≤I, 1≤j≤J are assumed to be independent, and it is assumed that thenumber of reads of type sj that are generated from transcript i follows a Poisson distribution

with parameter θiai,j , nj follows a Poisson distribution with parameter∑I

i=1 θiai,j = θ ·aj ,where θ is the vector of isoform abundance [θ1, θ2, . . . , θI ] and aj is the vector of samplingrates [a1,j , a2,j , . . . , aI,j ] for read sj , in which there is a component for each isoform.

Under the assumption that each read is independently generated, given {θi}Ii=1, {nj}J

j=1

are independent Poisson random variables, and therefore have the joint probability densityfunction

fθ(n1, n2, . . . , nJ ) =

J∏

j=1

(θ · aj)nj e−θ·aj

nj !. (2)

Note that since E(nj) = θ · aj =∑I

i=1 θiai,j , for all i, j, θ, the density (2) is a curvedexponential family: the natural parameter of the model is in R

J while the underlyingparameter is in R

I with J > I.

Page 10: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

8 J. Salzman, H. Jiang and W. H. Wong

3.4. Statistical Models for the Sampling Rate: aij

This paper focuses on two choices of aij which each give rise to a model and illustratesthe assumptions and interpretation of the resulting {θi}I

i=1 parameters. The first model iscalled the uniform sampling model, and the second is called the insert length model.

While these two models differ by whether insert length is taken into consideration, bothare motivated by the same model of sample preparation below. To facilitate such modeling,the biochemical steps preparation for sequencing that may generate bias in which fragmentsare sequenced are represented schematically as the composition of the following steps:

(a) Transcript fragmentation: each full length mRNA is fragmented at positions accordingto a Poisson process with rate parameter λ.

(b) Size selection: each fragment is selected with some probability depending on only itslength.

(c) Sequence specific amplification or selection: each sequence is amplified or furtherselected based on sequence characteristics.

The sampling rate matrices A for the uniform sampling model and the insert lengthmodel presented below are approximated from the same statistical model for steps (a) and(b) above. Namely, transcript fragmentation (positions where the transcript is cut) is mod-eled as a Poisson point process. Let p(·) denote the probability mass function of fragmentlengths obtained from this process. Note that p(·) is an unobserved quantity because thesample is subject to a size selection step after fragmentation and before sequencing. Thesize selection step is modeled as follows: a length l fragment of transcript is obtained withprobability r(l) independently of the identity of the molecule. r(·) is called the filteringfunction.

While the model in steps (a) and (b) are realistic across experiments, modeling step(c) is more involved and variable across experiments. Modeling how the specific nucleotidesequences affect the probability of being amplified and selected for sequencing varies sig-nificantly by experiment and is beyond the scope of this work. However, it is importantto emphasize that the model presented in this section is flexible enough to account for es-timation of the affect of step (c). Moreover, the model can be adapted to accommodatedifferent model choices in any of steps (a), (b), or (c). In the two models presented below,it is assumed that sequence selection and amplification is uniform.

Modeling the random processes (a) and (b) above as independent and only dependenton a fragment’s length and assuming that sequence selection and amplification is uniformproduces a model for the distribution of fragment lengths in the sample. This distribu-tion is represented by q(·) and can be estimated empirically from a paired end sequencingrun: namely, mapping both pairs from each read to a database and inferring the insertlength. Such an empirical function q(·) is depicted in Figure 5 and represents a reasonableapproximation to the overall distribution of molecule sizes sequenced in an experiment.Further, note that a consequence of the modeling in steps (a), (b) and (c) above producesthe expression q(l) = r(l)p(l).

3.4.1. Uniform sampling model

The uniform sampling model assumes that during the sequencing process, each read (re-garded as a point) is sampled independently and uniformly from every possible nucleotide in

Page 11: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 9

the biological sample. Uniform sampling is a good approximation to sampling from a Pois-son fragmentation process and subsequent filtering step when the filtering function q(·) hassupport on a set that is small compared to the transcript lengths; under these conditions,the process is approximately stationary.

To investigate if the uniform sampling model satisfactorily approximates the Poissonfragmentation and filtering above for numerical regimes of transcript length and fragmenta-tion rate encountered in sequencing, the following three simulations were performed: readsare generated from 10, 100 or 1000 copies of a transcript of length 1,500bp with λ = 0.01.All the fragments of length 100± 20bp are then compared to the sampled read positions asmodeled by the uniform sampling model (see Figure 4). It can be seen that as the samplesize increases, the two models are very similar except at the two ends of the transcript. Atthe two ends the Poisson process has some boundary effects, and the sequencing protocolcannot be explained by a simple model. For most situations, these effects will be small, andhence are ignored in the uniform sampling model.

Thus, the uniform sampling model is appropriate for sequencing single short reads wherethe sequencing process can be regarded as a simple random sampling process, during whicheach read (regarded as a point) is sampled independently and uniformly from every possiblenucleotide in the sample. The assumption of uniformity implies that a constant samplingrate for all ai,j > 0 is used. Specifically, let ai,j = 0 if transcript i can not generate read sj ,and otherwise, ai,j = n, where n is the total number of reads. As seen below, n serves as anormalization factor.

To motivate this choice of ai,j , consider the interpretation of {θi}Ii=1 induced by A. Un-

der the uniform model, the (unobserved) counts from the jth nucleotide which is generatedfrom the ith transcript are modeled as

ni,j = Po(aijθi).

Computing E (nij) using the uniform sampling model with n total reads,

E (nij) = nPr ( jth nucleotide generated by transcript i)

= nci

i lici

,

where li is the length of the ith transcript and ci is the copy number of the ith transcriptin the sample. Thus, setting aij = n iff transcript i can generate read j produces theidentity

nθi =nci

i cili

so the uniform sampling model has parameter

θi =ci

i lici

.

This choice of A has the property that it normalizes {θi}Ii=1 so that

i

θili = 1,

Page 12: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

10 J. Salzman, H. Jiang and W. H. Wong

0 500 1000 1500

050

010

0015

00

Theoretical Uniform Quantiles

Sam

pled

Rea

d P

ositi

ons

0 500 1000 1500

050

010

0015

00

Theoretical Uniform Quantiles

Sam

pled

Rea

d P

ositi

ons

(a) (b)

0 500 1000 1500

050

010

0015

00

Theoretical Uniform Quantiles

Sam

pled

Rea

d P

ositi

ons

(c)

Fig. 4. Uniform Q-Q plot with sampled read positions. (a), (b) and (c) are generated by simulationswith 10, 100 and 1000 copies of transcripts, respectively.

that is, it normalizes θi as a fraction of the total nucleotides sequenced, as shown in Jiangand Wong (2009), making it conceptually compatible with the RPKM (Reads Per Kilobaseof exon model per Million mapped reads) normalization scheme in Mortazavi et al. (2008),which is widely used by the RNA-Seq community.

3.4.2. Insert length model

In paired end sequencing, the insert length is usually controlled to have a small range.Therefore, as suggested in Figure 3, besides read positions, information can also be extractedfrom insert lengths inferred from reads. By modeling insert lengths properly, this piece ofinformation can be utilized and statistical inference can be improved. Example 1 below

Page 13: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 11

illustrates this concept and Section 6 quantifies the gain in statistical efficiency using theparing information.

The insert length model models the sampling of transcripts, conditional on insert length,as uniform. The insert length model sets each ai,j using the empirical distribution of theinsert lengths of the sample (see Figure 5) such that conditional on the insert length, readsare sampled from the transcriptome uniformly. This is specified mathematically as

aij = q(lij)n (3)

where lij is the length of corresponding fragment of sj on the ith transcript, n is the totalnumber of read counts and q(l) is the probability of a fragment of length l in the sample afterthe filtering. In application, for the insert length model, q(·) is taken as q(·), the empiricalprobability mass function computed from all the mapped read pairs. Empirical evidencesuggests this function is unimodal. A typical mass function is illustrated in Figure 5.

0 50 100 150 200 250 300 350 400 450 5000

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

insert length

freq

uenc

y

Fig. 5. A typical empirical mass function of the insert length.

To see the relationship between this choice of sampling rate matrix and a model wherereads are subject to Poisson fragmentation and length dependent filtering, suppose thatpaired end read sj is mapped to transcript 1 at coordinates (x1, y1) and transcript 2 atcoordinates (x2, y2) and both reads are in the forward direction. Then, assuming none ofx1, x2, y1, y2 is at the boundary of a transcript, under the Poisson fragmentation model (a)and length dependent size selection (b),

Page 14: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

12 J. Salzman, H. Jiang and W. H. Wong

Pr(read sj observed after processing one copy of transcript i1)

Pr(read sj observed after processing one copy of transcript i2)

=Pr(cut at x1, y1, no intermediate cut, and transcript of length x1 − y1 retained)

Pr(cut at x2, y2, no intermediate cut, and transcript of length x2 − y2 retained)

=Pr(tr. of length x1 − y1 retained | cut at x1, y1, no int. cut) Pr(cut at x1, y1, no int. cut)

Pr(tr. of length x2 − y2 retained | cut at x2, y2, no int. cut) Pr(cut at x2, y2, no int. cut)

=r(|x1 − y1|)r(|x2 − y2|)

p(|x1 − y1|)p(|x2 − y2|)

=q(|x1 − y1|)q(|x2 − y2|)

Thus, the ratioai1,j

ai2,j

is approximately the same as defined by the sampling rate matrix A for the insert lengthmodel, with the assumption that none of x1, x2, y1 or y2 is on the boundary of the transcript.As long as the insert length distribution has support which is small compared to transcriptlength, relatively few transcripts map exactly to the boundary, and little data is lost byignoring them; doing so allows the above conditions to be satisfied. Further, the argumentabove shows that the insert length model is consistent with assumptions (a), (b) and (c) ofthe sample preparation.

The insert length model yields a similar interpretation for the normalization of {θi}Ii=1

as in the uniform sampling model, illustrated in following computation: The paired endread model specifies that the reads of type j from transcript i are Poisson with parameter

nij = Po(aijθi).

The insert length model assumes that reads are filtered based on length independent oftheir sequence. This produces a method of estimating the expectation of nij . The followingapproximates E (nij) under the insert length model:

E (nij) = nPr(read j observed after processing one copy transcript i )

:= nPr(A ∩ B ∩ C)

where A,B and C are defined as follows. Let Y be a random variable representing a readin the sample after fragmentation. Let A be the event that Y is a fragment of transcript i,B the event that Y is read j of transcript i and C the event that Y is a fragment of lengthlij and is observed after filtering. Using product rule,

Pr(A ∩ B ∩ C) = Pr(B|A ∩ C)Pr(C|A)Pr(A).

Each term is analyzed separately. Assuming uniform fragmentation across the transcriptand length dependent filtering,

Page 15: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 13

Pr(B|A ∩ C) =1

li − lij.

The basic assumption of the insert length model is that the probability of observinga transcript of length lij does not depend on the transcript and is equal to the empiricalinsert length, q(lij), hence

Pr(C|A)·= q(lij).

To estimate Pr(A), consider the random variables Xi, the number of fragments in thesample from transcript i and X, the total number of transcript fragments in the sample.Then, assuming transcript i is sufficiently and not overly abundant in the sample,

Pr(A) = E

(

Xi

X

)

·=

E (Xi)

E (X).

Assuming a Poisson fragmentation model, up to a boundary effect which has smallimpact on the approximation,

E (Xi)

E (X)

·=

cili∑

i cili.

Combining these approximations yields

E (nij)·= nq(lij)

1

li − lij

cili∑

i cili.

Thus, if lili−lij

is close to 1, θi is identified in this model as

θi·=

ci∑

i cili.

Thus, in both models, the choice of aij is consistent with its definition in Equation (1).To illustrate the difference between the insert length and uniform sampling models, considerthe following example:

Example 1. Consider a case of two isoforms labeled 1 and 2 with an alternative includedexon as in Figure 1. Suppose the middle exon 2 has length 50. Suppose read sj has animputed length of 50 when mapped to 2 and of 100 when mapped to 1 as will be the case ifone of the reads is in exon 1 and one in exon 3. Suppose the empirical insert length functionis modelled as uniform [60, 140]. Then, in the uniform model,

a1j = a2j = n

whereas in the insert length model,

a1j =n

80and a2j = 0.

Page 16: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

14 J. Salzman, H. Jiang and W. H. Wong

3.5. Maximum Likelihood EstimationIn this paper, θ is estimated using the MLE. Standard theory shows that the MLE ofmodel (2) will be asymptotically unbiased and consistent provided the parameters in themodel as in the interior of the parameter space (see Theorem 6.3.10 of Lehmann (1998)).Computationally efficient procedures are needed to solve for these estimates in practice.

The fact that the distribution of minimum sufficient statistics are Poisson allows for asimplification of the calculation of the MLE by regarding the distribution of the minimumsufficient statistics as a generalized linear model (GLM) problem with Poisson density andidentity link function (see McCullagh and Nelder (1989)) with extra linear constraints thatrequire all the parameters {θi}I

i=1 to be non-negative. The optimization problem in matrixform is

maximize nT log(Aθ) − sum(Aθ)s.t. θ ≥ 0

(4)

where n is a J × 1 column vector for the observed read counts [n1, n2, . . . , nJ ], A is

a J × I matrix for the sampling rates {ai,j}I,Ji=1,j=1 and θ is the I × 1 isoform abundance

vector [θ1, θ2, . . . , θI ]. log(·) takes logarithm over each element of a vector and sum(·) takessummation over all the elements of a vector.

As shown in Jiang and Wong (2009), the log-likelihood function

log(L(θ)) = log(fθ(n1, n2, . . . , nJ ))

is always concave and therefore any linear constraint convex optimization method can beused to solve this non-negative GLM problem¶.

4. Sufficiency and Minimal Sufficiency

Because J is usually very large, it is extremely inefficient to work with the statistics {ni}Ji=1

in (2) directly: in single read sequencing of a human or mouse cell, J can exceed 2K fora typical gene, and in paired end sequencing with variable insert length it can easily reach100K. For computational purposes, it is therefore necessary to use sufficient statistics forthe likelihood function (2). Because these statistics have an intuitive interpretation, theyare referred to as a collapsing. This section analyzes sufficiency and minimal sufficiency inmodel (2) and its relation to collapsing.

4.1. Sufficient Statistics and CollapsingAs will be shown below, sufficient statistics have a natural interpretation as collapsing readcounts. Proposition 2 shows that to group reads j and k into the same category, it issufficient that reads have the same normalized sampling rate vector (i.e.,

aj

||aj ||=

ak

||ak||,

where || · || is the vector 2-norm).Such grouping of reads will be called (maximal) collapsings: reads with the same normal-

ized sampling rate vector are grouped together. Intuitively, a maximal collapsing reducesthe number of such groups to be as small as possible.

¶The experiments in this paper use a simple coordinate-wise hill climbing method.

Page 17: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 15

Definition 1. Let Ck be a collection of mk reads so

Ck = {sj1 , . . . , sjmk}1≤j1<j2<...<jmk

≤J

A set C = {Ck}Kk=1 is called a collapsing, if for any Ck ∈ C and any sj1 , sj2 ∈ Ck,

aj1 = caj2

for some positive number c.Furthermore, if for any k1 6= k2 and any sj1 ∈ Ck1

, sj2 ∈ Ck2,

aj1 6= caj2

for any positive number c, {Ck}Kk=1 is called a maximal collapsing. In a collapsing, each Ck

is called a category.

As will be seen in Theorem 3, the maximal collapsing gives rise to a set of minimalsufficient statistics making it useful from a computational perspective. A real data exampleof such a collapsing is provided in Section 5. The collapsed read counts also has a standardstatistical interpretation as the sum of independent Poisson random variables. Supposecategories

{Ck | k = 1, 2, . . . ,K} with Ck ⊆ {s1, s2, . . . , sJ}are non-overlapping, i.e., Ck1

∩ Ck2= ∅ when k1 6= k2. Then, assuming each nj follows a

Poisson distribution with parameter θ · aj , nCk, the number of observed reads that belong

to category Ck (i.e., nCk=

sj∈Cknj) follows a Poisson distribution with parameter

a(k) · θ :=

mk∑

j=1

θ · a(k)j

where for 1 ≤ j ≤ mk, a(k)j is the sampling rate vector of the jth read in category k.

Proposition 1. The maximal collapsing is unique.

Proof. The relation satisfied by two types of reads in a category in the maximal collaps-ing is an equivalence relation. This makes the maximal collapsing a grouping of reads intoequivalence classes which is always uniquely determined. To show a relation is a equivalencerelation, it suffices to show that the reflexivity, symmetry and transitivity hold.

Reflexivity: For any sj ,aj = aj ,

i.e., sj ∼ sj .Symmetry: For any sj and sk,

aj = cak ⇒ ak =1

caj ,

i.e., sj ∼ sk ⇒ sk ∼ sj .Transitivity: For any sj , sk and sl,

aj = c1ak

Page 18: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

16 J. Salzman, H. Jiang and W. H. Wong

andak = c2al ⇒ aj = c1c2al,

i.e., sj ∼ sk and sk ∼ sl ⇒ sj ∼ sl.

To illustrate how maximal collapsing can be derived from the choice of aij in the uniformmodel to produce the maximal collapsing, reads with the same normalized sampling ratevector are grouped into one category. Because ai,j is either 0 or n, two reads sj1 and sj2

will have the same normalized sampling rate vector, i.e., aj1/||aj1 || = aj2/||aj2 ||, if and onlyif they can be generated by the same set of transcripts.

Example 2. Consider a continuation of the setup in Example 1. Suppose a uniformsampling model and suppose reads s1 and s2 can be generated by both transcripts 1 and 2,whereas read s3 can only be generated by transcript 1. Then

a1 = a2 = [n, n]

anda3 = [n, 0].

Grouping s1 and s2 together produces the maximal collapsing C = {{s1, s2}, {s3}}; thefirst category containing reads that can be produced by both transcripts and the second cat-egory containing reads only generated by transcript 1.

4.1.1. Collapsing and Sufficiency

Analysis of the likelihood function (2) shows that collapsing the reads produces sufficientstatistics and maximal collapsings are equivalent to minimal sufficient statistics.

Recall that a statistic T (X) is sufficient for the parameter θ in a model with likelihoodfunction fθ(x) if

fθ(X) = h(x)gθ(T (X)).

It is easy to show that the observed count vector n = [n1, n2, . . . , nJ ] is sufficient for θ.

Proposition 2. For any collapsing C = [C1, C2, . . . , CK ], the observed read count vec-tor nC = [nC1

, nC2, . . . , nCK

] is a sufficient statistic for θ.

Proof. From the definition of collapsing, consider the kth category Ck with the re-

enumerated reads {s(k)j }1≤k≤K,1≤j≤mk

, the reads in category k are enumerated

Ck = {s(k)1 , s

(k)2 , . . . , s(k)

mk}.

Define a(k)j to be the sampling rate vector for s

(k)j , 1 ≤ j ≤ mk. By definition, for all

1 ≤ j ≤ mk, for some scalar c(k)j > 0,

a(k)j = c

(k)j a

(k)1 .

Therefore,

θ · a(k)j = c

(k)j θ · a(k)

1 .

Page 19: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 17

Rearranging the product in the RHS of equation (2) as a product over each read by the

category into which it falls, and denoting the ith read ni and parameter θ · ai as x(k)j with

parameter θ · a(k)j when it falls as the jth enumerated read in the kth category,

fθ(n1, n2, . . . , nJ ) =

J∏

i=1

(θ · ai)nie−θ·ai

ni!

=

K∏

k=1

mk∏

j=1

(θ · a(k)j )x

(k)j e−θ·a

(k)j

x(k)j !

=

K∏

k=1

mk∏

j=1

(c(k)j θ · a(k)

1 )x(k)j e−θ·a

(k)j

x(k)j !

=

K∏

k=1

(θ · a(k)1 )

Pmkj=1 x

(k)j e−

Pmkj=1 θ·a

(k)j

mk∏

j=1

(c(k)j )x

(k)j

x(k)j !

= h(n1, n2, . . . , nJ )gθ(nC1, nC2

, . . . , nCK)

(5)

where, since {ni}Ji=1 = {x(k)

j }1≤j≤mk, 1≤k≤K ,

h(n1, n2, . . . , nJ ) =

K∏

k=1

mk∏

j=1

(c(k)j )x

(k)j

x(k)j !

and

gθ(nC1, nC2

, . . . , nCK) =

K∏

k=1

(θ · a(k)1 )nCk e−θ·a(k)

establishing the sufficiency of nC = [nC1, nC2

, . . . , nCK].

In addition to the sufficiency proved in Proposition 2, nC is minimal sufficient if thecorresponding collapsing C is a maximal collapsing. This is detailed in the next section.

4.2. Minimal SufficiencyTo prove that the read counts derived from a maximal collapsing are minimal sufficientstatistics, recall

Definition 2 (Definition 6.2.13 of Casella and Berger (2002)). for the fam-ily of densities fθ(·), the statistic T (X) is minimal sufficient if and only if

fθ(x)

fθ(y)does not depend on θ ⇔ T (X) = T (Y )

Theorem 3. In the likelihood specified by Equation (2), counts on maximally collapsedcategories are minimal sufficient statistics.

Page 20: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

18 J. Salzman, H. Jiang and W. H. Wong

Proof (Proof of Theorem 3). Let T (X) be the collapsed vector of counts xC1, xC2

, . . . , xCK

and let T (Y ) be the vector of counts yC1, yC2

, . . . , yCKeach of which are maximal collaps-

ings. If T (X) = T (Y ), Equation (5) shows that the ratio of densities

fθ(x)

fθ(y)

does not depend on θ. To show the reverse implication, suppose T (X) 6= T (Y ). To showthat

fθ(x)

fθ(y)does not depend on θ,

it suffices to show thatgθ(x)

gθ(y)does not depend on θ.

It is possible to simplify this ratio as

gθ(x)

gθ(y)=

g∈G(θ · ag)ng

h∈H(θ · ah)nh. (6)

where {ng}g∈G and {nh}h∈H are positive numbers and G and H are subsets of thecategories and are disjoint since if G and H share a common j, the ratio in Equation (6)can be reduced. Further, since the collapsings are maximal, for any ai, aj appearing inany product in the numerator or denominator, there is no c so that ai = caj . Using theseproperties, it will be shown that the ratio of densities does depend on θ by contradiction.

Suppose for some (now fixed) T (X) 6= T (Y ), Equation (6) does not depend on θ and isequal to a constant c. Note that since θ can be the vector of all 1s, if Equation (6) does notdepend on θ, c > 0 as when θ is the vector of all 1′s both the numerator and denominatorof Equation (6) are positive. Then Equation (6) is equivalent to a polynomial equation

0 = c∏

h∈H

(θ · ah)nh −∏

g∈G

(θ · ag)ng (7)

∀θ ∈ (R+)I . By basic algebraic geometry, any polynomial in θ which is identically zeroin the space (R+)I is identically zero in all of R

I . Therefore, the last step is to show thatthe right hand side of Equation (7) is not actually zero for some θ ∈ R

I . To proceed, fixh ∈ H. The claim is that there exists v ∈ R

I with 〈v, ah〉 = 0 but ∀g ∈ G,

〈v, ag〉 6= 0.

This v will be be the choice of θ producing the contradiction. For a vector z ∈ RI , let

z⊥ denote the I − 1 dimensional subspace of vectors orthogonal to it. Then, to finish theproof, it suffices to showing that there is some vector in a⊥

h which is not in ∪g∈Ga⊥g . It is

equivalent to show there is a strict containment

(∪g∈Ga⊥g ) ∩ a⊥

h = ∪g∈G(a⊥g ∩ a⊥

h ) ⊂ a⊥h .

Strict containment follows since for any h ∈ H,

a⊥h ∩ a⊥

g

Page 21: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 19

is a subspace of dimension at most I − 2, thus a countable union of such spaces cannotequal a subspace of dimension I − 1.

The next section illustrates the relationship of minimal sufficient statistics to raw dataobserved in sequencing experiments.

5. Application

This section illustrates how minimal sufficient statistics are calculated in an example withreal RNA-Seq data from an experiment on cultured mouse B cells. After the sequencingreads were generated, they were mapped to a database of known mouse mRNA transcriptsusing the RefSeq annotation database (see Pruitt et al. (2005)) and the mouse referencegenome (mm9, NCBI Build 37). The reads were mapped using SeqMap, a short sequencemapping tool developed in Jiang and Wong (2008). A total of 2,789,546 read pairs (32bpfor each end) mapped to the reference transcript sequences. The empirical distribution ofthe insert length was inferred. This distribution has a mean of 251bp and a single mode of234bp (See Figure 5).

Because more than 99% of the data has an inferred length between 73 bp and 324 bp,reads outside of this range are not considered in subsequent analysis for this example asit is likely these reads come from unannoted isoforms. This resulted in 27,118 (about 1%)read pairs being excluded from a total of 2,762,428 (denoted as n below) read pairs used inthe computation.

The mouse gene Rnpep is used to demonstrate the computation of minimal sufficientstatistics. Rnpep has an alternatively spliced exon which gives rise to two different isoforms(see Figure 6). The gene itself is an amino peptidase, meaning that it is used to degradeproteins in the cell. After mapping, 116 read pairs were assigned to this gene, out of which113 read pairs were used in the computation after outlier removal. Figure 6 presents thepositions where the reads are mapped. The gene was picked because it has two alterna-tively spliced isoforms with a structure that makes distinguishing reads from each isoformchallenging, and because the number of reads was small enough to visualize all of them ina simple figure.

5.1. Uniform Sampling ModelAny paired end read experiment can be treated as a single read experiment by takingeach paired end read and treating it as two distinct single end reads, one from each sideof the pair. In this, the 113 paired end reads become 226 single reads (without pairinginformation).

In the uniform sampling model, for either isoform, the sampling rate vector for eachread sj can take at most two values: 2n when the isoform can generate read j and 0 whenit cannot. Because there are only two isoforms, one of which (isoform 2) excludes one ofthe exons of the other (isoform 1), it is evident that in the uniform sampling model, thereare only three categories for the two isoforms.

The total length of isoform 1 is 2,300. The total length of isoform 2 is 2,183. Hence,computing aij by summing over the sampling rate vectors of the reads in the same category,the three categories can be represented by their sampling vectors: [4242n, 4242n], [296n, 0],[0, 62n]. Using minimal sufficient statistics reduces the the data from a vector representing

Page 22: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

20 J. Salzman, H. Jiang and W. H. Wong

Fig. 6. Visualization of RNA-Seq read pairs mapped to the mouse gene Rnpep in CisGenomeBrowser (see (Ji et al., 2008)). From top to bottom: genomic coordinates, gene structure whereexons are magnified for better visualization, read pairs mapped to the gene. Reads are 32bp at eachends. A read that spans a junction between two exons is represented by a wider box.

Page 23: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 21

Table 2. Single read categories for Rnpep

Category ID Sampling rate vector Read count

1 [4242n, 4242n] 2162 [296n, 0] 103 [0, 62n] 0

counts on the 2, 300 possible reads sj from the two isoforms to the 3 minimal sufficientstatistics which are counts on these categories.

The three categories representing minimal sufficient statistics are tabulated in Table 2.Each category refers to a group of reads that are generated by a particular set of isoforms.For example, category 1 consists of reads generated by both isoforms and category 3 con-sists of reads generated by isoform 2 only. Using these statistics to solve the optimizationproblem (4), the MLE for the two isoforms is [θ1, θ2] = [15.47, 2.70]‡. Bayesian confidenceintervals for these estimates can be obtained by sampling from the posterior space of theparameters (as outlined in Jiang and Wong (2009)), the marginal 95% probability intervalsfor θ1 and θ2, are (7.89, 18.81) and (0.25, 10.83) respectively.

5.2. Insert Length ModelTo visualize how the insert length model can be used to produce potentially stronger sta-tistical inference as compared to the uniform sampling model, consider Figure 6. Eachpaired end read is depicted by two boxes with arrows joining pairs of reads. The directionof the arrows represent which side of the read was sequenced first. For those interested,the direction of the arrows in the Rnpep gene itself indicates the transcriptional directionof the gene in genomic coordinates, although this concept can be ignored for the purposeshere. Note that there is no direct evidence that isoform 2 is present in the sample, as noread crosses the junction between the two exons which are adjacent in isoform 2 but not inisoform 1. There is direct evidence of the presence of isoform 1, for example, as depictedin the fifth read from the left in the first row which directly crosses a junction between twoexons only adjacent in isoform 1.

Because of the small gap between exons in the figure, reads spanning exons will beslightly longer than reads not spanning exons. Also, some inserts are very short, and absenceof the arrow connecting two reads indicates that the entire insert has been fully sequenced.Note that several of the reads spanning the alternatively spliced exon are exceedingly long.This suggests that such reads are actually generated from isoform 2 rather than isoform 1.If such reads are generated from isoform 2, they would likely have a smaller insert lengththan the inferred insert length when generated by isoform 1, which are the lengths depictedin the figure. Because the empirical insert length distribution has its only mode near 250bp, conditional on observing the 6th and 7th reads from the top of the figure spanning thealternatively spliced exon, the read is more likely to come from isoform 2. Thus, there isindirect evidence of the presence of isoform 2 in the sample.

Such indirect evidence is utilized by the insert length model; the model produces quan-titative estimates of the relative abundance of the two isoforms. As will be seen in the next

‡All the expression estimates in this paper are in units compatible with RPKM(Reads Per Kilo-base of exon model per Million mapped reads) (see Mortazavi et al. (2008))

Page 24: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

22 J. Salzman, H. Jiang and W. H. Wong

Table 3. Paired read categories for Rnpep

Category ID Sampling rate vector Read count

1 [1681.82n, 1681.82n] 952 [294.60n, 0] 103 [0, 245.80n] 2...

......

138 [0.0057n, 0.0018n] 0

section, the abundance estimates from the insert length model have larger Fisher informa-tion than the estimates from the uniform sampling model.

In the insert length model, each of the possible insert lengths where q(·) has supportproduce a unique read sj yeilding a total of 569, 205 possible reads from the two isoforms.The maximal collapsing produces a total of 138 categories some of which are representedin Table 3. For intuition, all of the reads with a fixed insert length where both ends fall inthe left most 7 or right most 3 exons of Rnpep will be in the same category as they havethe same probability of being sampled.

Using the minimal sufficient statistics, the MLE is computed to be [θ1, θ2] = [16.73, 3.43].The marginal 95% probability intervals for θ1 and θ2 are (11.22, 21.02) and (1.03, 9.29)respectively. The computed marginal 95% probability intervals for θ1 and θ2 are non-overlapping, whereas in the single read model, one cannot conclude that the expressionof isoform 1 and 2 differ. Further, the insert length model has slightly smaller marginalconfidence intervals for each parameter.

This example suggests that the insert length model provides estimates with smallerstandard errors than those generated by the uniform sampling model. This difference canbe quantified by analyzing the Fisher information of each model, the subject of the nextsection.

6. Information Theoretic Analysis

Many considerations impact the choice of sequencing protocol in an experimental design.One such choice is relative cost of sequencing. In this case, the experimentalist may beinterested in choosing the sequencing protocol (paired or single read) that provides the bestestimate of isoform abundance at least relative cost. This section outlines the statisticalargument for why, in typical situations, paired end sequencing can produce better estimatesof transcript abundance compared to single read sequencing at a fixed number of sequencednucleotides (cost). The theoretical analysis aims to show that for the same number ofsequenced nucleotides, the Fisher information in the insert length model is more than doublethe Fisher information in the single read model. Since estimates in RNA-Seq are maximumlikelihood estimators, their variance of the estimator converges to the reciprocal of theFisher information. Thus, larger Fisher information produces estimators with improvedaccuracy.

Page 25: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 23

6.1. Theoretical AnalysisConsider the following quite simple example showing the increase in information as thefraction of reads unique to each isoform grows:

Example 3. Continuing Example 1, suppose that isoform 1 and isoform 2 have Poissonrate parameters θ1 and θ2 respectively where θ2 = 1 − θ1 and probability 0 < α, β < 1respectively of producing a read read unique to the isoform. Let n1 be the reads unique to 1,n2 the reads unique to 2 and n3 the reads which cannot be distinguished between the isoforms.Assume there are n total reads in the sample, and assume there is uniform fragmentationwhich gives rise to three categories.

n1 = Po(n αθ1)

n2 = Po(n βθ2)

n3 = Po (n ((1 − α)θ1 + (1 − β)θ2))

Fix α < β as known and compute the information in this distribution with θ1 as theunknown parameter as a function of α using the definition that the information is equal tothe variance of the derivative of the log likelihood with respect to θ1:

I(θ1) = var(n1

θ1− n2

θ2+

n3(α − β)

θ1(α − β) + β)

= n

(

α

θ1+

β

θ2+

(α − β)

θ1(α − β) + β

)

where x = 1 − x. Thus, α − β = β − α, and δ := β − α and α are re-parameterizationsof β, α. The last equation shows that the partial derivatives of the information with respectto α and with respect to δ are positive.

Note that in the example above, no generality is lost by assuming β > α since θ1 andθ2 can be interchanged with no effect on the model.

To see that for a fixed cost of sequencing (number of sequenced nucleotides), the statis-tical model produced by paired end sequencing has more information than single read, it isnecessary to show that the information obtained by twice as many single read sequencingexperiments is smaller than that obtained by a paired end sequencing experiment. Sucha comparison necessarily depends on each gene, its isoforms and their relative abundance.The computation of the Fisher information for a typical such example is presented below,and the computation shows that the examples easily generalizes to other configurations ofisoforms.

Example 4. Continuing the running example, consider reads of length 30 bp and pairedend distance between reads (non sequenced nucleotides) of a length of 200 ± 20 bp in theschematic of three exons in Figure 1 where the length of exons 1 and 3 is 500 and exon 2 is50.

For a single read experiment, αs is the probability that a read includes any part of theincluded exon (i.e. uniquely identifies isoform 2) so for the read length of r = 30,

αs =r − 1 + 50

1050 − r + 1

Page 26: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

24 J. Salzman, H. Jiang and W. H. Wong

Isoform 1

Isoform 2

exon 1 exon 2 exon 3

500bp 500bp50bp

Model gene structure

Fig. 7. A model gene for study of Fisher information and accuracy of the single end and insert lengthmodels.

and βs is the probability that a read includes any part of the spliced junction (i.e. uniquelyidentifies isoform 1) so

βs =r − 1

1000 − r + 1.

For a paired end read experiment, with x the insert length, αp, the probability that a readuniquely identifies the second isoform is

αp =50 + x − 1

1050 − x + 1

and βp, the probability that a read uniquely identifies the first isoform is

βp =x − 1

1000 − x + 1.

For a concrete example, suppose θ1 = 2θ2. Assume further that there are twice as manysingle reads (a sample size of 2n) compared to the n reads in a paired-end run:

Is := 2n

(

3

2αs + 3βs +

(αs − βs)23 (αs − βs) + βs

)

and the information in a paired end run for a fixed insert size is:

Ip := n

(

3

2αp + 3βp +

(αp − βp)23 (αp − βp) + βp

)

Assuming the insert length distribution is uniform and symmetric around an insertlength of 200, the effective αp and βp for the model are the parameters when evaluatedat x = 200. Taking the ratios of these quantities and plugging in,

Is

Ip

=.31

1.12= .27.

In other words, in the insert length model, the maximum likelihood estimator of θ1 hasasymptotic variance roughly 3 times larger in the single read experiment than in the pairedend experiment.

The next section gives simulation results for a related example.

Page 27: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 25

6.2. Simulation studySimulations were used to study the following questions: 1) the quality of the proposedmodel at estimating isoform-specific gene expression, especially when the insert length isvariable and 2) whether abundance estimates from paired end reads are more reliable thanabundance estimates from single end reads.

To address these questions, reads were simulated from a “hard case” where a gene hasthree exons of lengths 500bp, 50bp and 500bp, respectively (See Figure 7); the middleexon can be skipped, producing two different isoforms of the gene. Since the middle exon isshort, this case has been shown to be difficult for isoform-specific gene expression estimationin Jiang and Wong (2009).

In the simulation, the two isoforms were assumed to have equal abundance. Readswere simulated using different models and parameters described in detail in below andestimate isoform abundances as described in Section 3.5. The relative error of estimationwas computed based on the empirical relative L2 loss:

||θ − θ||2||θ||2

=

(θ1 − θ1)2 + (θ2 − θ2)2√

θ21 + θ2

2

=

( 12 − θ1)2 + (1

2 − θ2)2√2/2

where θ = [θ1, θ2] = [12 , 12 ] is the true isoform abundance vector, and θ = [θ1, θ2] is

the estimated isoform abundance vector after normalization so that θ1 + θ2 = 1. Eachsimulation experiment was repeated 200 times to get the sample mean and standard errorof the relative error.

6.2.1. Simulating single reads with uniform sampling

To explore the quality of estimation in the uniform sampling approach, single reads withlength 30bp using the uniform sampling model were generated. Five separated experimentswere performed to investigate the effect of sample size on the estimation procedure usingsample sizes of 10, 50, 200, 1000 and 5000, respectively. The red curve in Figure 8(a) givesthe sample mean and standard error of the relative error. It is clear that relative errordecreases as the sample size increases.

To examine whether longer reads can provide better estimates, all the simulation exper-iments were repeated with read length 100bp. Figure 8(a) shows the comparison betweenread lengths of 30bp and 100bp. As expected, 100bp reads produce smaller error than 30bpreads.

6.2.2. Simulating single reads with non-uniform sampling

In real UHTS data, the read distribution is not uniform. To evaluate how well the RNA-Seqmethodology performs in this regime, simulations were performed where the positions ofreads were sampled from a log-normal distribution. Specifically, up to a scalar multiple,the true sampling rates ai,j are independently and identically distributed random variableswhich follows log-normal distribution with mean µ = 0 and standard deviation σ = 1.

Figure 8(b) gives the comparison between reads that were sampled from uniform dis-tribution and reads that were sampled from log-normal distribution The figure shows thatnon-uniform reads produce estimates which appear consistent, albeit with larger error thanwith uniform reads.

Page 28: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

26 J. Salzman, H. Jiang and W. H. Wong

0.2

0.4

0.6

0.8

1.0

10 50 200 1000 5000

0.1

0.3

0.5

0.7

0.9

Number of reads

Rel

ativ

e er

ror

single read 30bpsingle read 100bp

0.2

0.4

0.6

0.8

1.0

10 50 200 1000 5000

0.1

0.3

0.5

0.7

0.9

Number of reads

Rel

ativ

e er

ror

single read 30bp withuniform sampling ratesingle read 30bp withlognormal sampling rate

(a) (b)

0.2

0.4

0.6

0.8

1.0

10 50 200 1000 5000

0.1

0.3

0.5

0.7

0.9

Number of reads

Rel

ativ

e er

ror

single read 30bppaired read 30bp withGaussian insert length

0.2

0.4

0.6

0.8

1.0

10 50 200 1000 5000

0.1

0.3

0.5

0.7

0.9

Number of reads

Rel

ativ

e er

ror

single read 30bppaired read 30bp withuniform insert length

(c) (d)

Fig. 8. Relative error of different read generation models. X axis is the sample size, i.e., the numberof reads that are generated in each simulation experiment. Y axis is the mean relative error basedon 200 simulation experiments. The error bars give the standard errors of the sample means. In thefigures, single 30bp reads generated with uniform sampling rate (red curves) is compared to (bluecurves) (a) single 100bp reads, (b) single 30bp reads generated with lognormal sampling rate, (c)paired end 30bp reads generated with Gaussian insert size and (d) paired end 30bp reads generatedwith uniform insert size. When compared with n (e.g., 5000) single reads, n/2 (i.e., 2500) pairs ofpaired end reads were used.

6.2.3. Simulating paired end reads

This section investigates whether, in simulation, paired end reads can provide more infor-mation than single reads. When insert lengths do not have a simple distribution, closedform expressions for the information are difficult to obtain. Simulation studies are thusimportant tools for analyzing such situations. For this purpose, paired end reads of length

Page 29: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 27

30bp with insert size following a normal distribution with mean µ = 200bp and standarddeviation σ = 20bp were generated. For a given insert size, read pairs were generated usinga uniform sampling model. Figure 8(c) shows that the paired end reads produce smallererrors than single end reads with the same number of sequenced nucleotides: to make thecomparison comparable on the level of total sequenced bases, n/2 pairs of paired end readswere used when compared with n single reads.

When the insert size was generated using a uniform distribution, e.g., the effectiveinsert size is uniform within 200 ± 20bp, similar results were produced (See Figure 8(d)).Comparing Figure 8(d) with Figure 8(a) shows that paired end 30bp reads produce similarlyaccurate estimates as 100bp single reads, which means that on average paired end readsprovide more information per nucleotide being sequenced.

7. Discussion

The insert length model presented in this paper is a flexible statistical tool. In Sections 3.2,the model has been derived when the experimental step of fragmentation is assumed to beapproximated by a Poisson point process, and a transcript is assumed to be retained in thesample in proportion to the fraction of transcripts of its length estimated after sequencing.These assumptions are at once simplifying and realistic. As experimental protocols improve,it is likely they will better model RNA-Seq data.

At the current time, several improvements may be made to the model to increase itsaccuracy. Firstly, the read sampling rate is undoubtedly non-uniform, as it depends onbiochemical properties of the sample and fragmentation process as experimental studieshave highlighted (see Ingolia et al. (2009), Vega et al. (2009), and Quail et al. (2008)). Thiseffect becomes more apparent for longer fragments such as those used in paired end librarypreparation. Explicit models for the sampling rates are difficult to obtain, but doing so isan area of future research.

Statistical tests of the reproducibility of the non-uniformity of reads shows a consistentsequence specific bias across biological and technical replicates of a gene. This effect couldbe due to bias in RNA fragmentation, bias in other biochemical sample preparation stepsor boundary effects when a gene of fixed length is fragmented. The last cause of bias canbe modeled using Monte Carlo simulations of a fixed length mRNA sequence subject to aPoisson fragmentation process and incorporated into the insert length model.

Similarly, the fragmentation and filtering steps have not been explicitly modeled in theinsert length model presented here. Rather, the probability mass function of read lengths,what is necessary for defining the model, has been estimated empirically. Improvements tothe model could be made by increasing the precision of the estimate of the probability massfunction of read lengths, for example by simulating a fragmentation and filtering processby Monte Carlo and matching the output of the simulations to the empirical distributionfunction q(·). If such modeling were desired, as described in Section 3.2, the affects couldbe easily incorporated into the insert length model. On the other hand, as experimentalprotocols improve, they may reduce this bias and increase the accuracy of the insert lengthmodel as presented in this paper.

In conclusion, this paper has presented a statistical model for RNA-Seq experimentswhich provides estimates for isoform specific expression. Finding such estimates is difficultusing microarray technology, focusing interest in UHTS to address this question. To thebest of our knowledge, this is the first statistical model for paired end isoform specific

Page 30: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

28 J. Salzman, H. Jiang and W. H. Wong

abundance estimation.In addition to modeling, the paper has presented an in depth statistical analysis. By

using the classical statistical concept of minimal sufficiency, a computationally feasible so-lution to isoform estimation in RNA-Seq is provided. Further, statistical analysis quantifiesthe perceived gain in experimental efficiency from using paired end rather than single endread data to provide reliable isoform specific gene expression estimates.

8. Acknowledgements

The authors would like to acknowledge Jamie Geier Bates for providing the data used inSection 5, and Pat Brown for useful discussions. Salzman’s research was supported in partby a grant from the National Science Foundation (DMS-0805157). Wong is funded by aNational Institute of Health grant (R01-HG004634).

References

Casella, G. and R. Berger (2002). Statistical inference (2 ed.). Thomson Learning.

Chi, K. R. (2008). The year of sequencing. Nature Methods 5, 11 – 14.

Hansen, K. D., L. F. Lareau, M. Blanchette, R. E. Green, Q. Meng, J. Rehwinkel, F. L. Gal-lusser, E. Izaurralde, D. C. R. andSandrine Dudoit, and S. E. Brenner (2009). Genome-wide identification of alternative splice forms down-regulated by nonsense-mediated mrnadecay in drosophila. PLoS Genetics (6).

Hiller, D., H. Jiang, W. Xu, and W. H. Wong (2009, Dec). Identifiability of isoform decon-volution from junction arrays and rna-seq. Bioinformatics 25 (23), 3056–3059.

Ingolia, N. T., S. Ghaemmaghami, J. R. S. Newman, and J. S. Weissman (2009). Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling.Science 324 (5924), 218 – 223.

Ji, H., H. Jiang, W. Ma, D. S. Johnson, R. M. Myers, and W. H. Wong (2008, Nov). An inte-grated software system for analyzing chip-chip and chip-seq data. Nat Biotechnol 26 (11),1293–1300.

Jiang, H. and W. H. Wong (2008, Aug). Seqmap : mapping massive amount of oligonu-cleotides to the genome. Bioinformatics 24 (20), 2395–2396.

Jiang, H. and W. H. Wong (2009, Feb). Statistical inferences for isoform expression inrna-seq. Bioinformatics.

Langmead, B., C. Trapnell, M. Pop, and S. Salzberg (2009, Mar). Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol 10 (3),R25.

Lehmann, E. L. (1998). Theory of point estimation (2 ed.). Springer.

Maher, C. A., C. Kumar-Sinha, X. Cao, S. Kalyana-Sundaram, B. Han, X. Jing, L. Sam,T. Barrette, N. Palanisamy, and A. M. Chinnaiyan (2009). Transcriptome sequencing todetect gene fusions in cancer. Nature 458, 97–101.

Page 31: STATISTICAL MODELING OF RNA-SEQ DATA By Julia Salzman … 252.pdf · Ultra High Throughput Sequencing (UHTS or simply ‘sequencing’) is an emerging tech- ... which can take values

Statistical Modeling of RNA-Seq Data 29

Marioni, J. C., C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad (2008, Sep). Rna-seq:an assessment of technical reproducibility and comparison with gene expression arrays.Genome Res 18 (9), 1509–1517.

McCullagh, P. and J. Nelder (1989). Generalized Linear Models, Second Edition. BocaRaton: Chapman and Hall/CRC.

Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer, and B. Wold (2008, Jul). Mappingand quantifying mammalian transcriptomes by rna-seq. Nat Methods 5 (7), 621–628.

Pan, Q., O. Shai, L. J. Lee, B. J. Frey, and B. J. Blencowe (2008, Dec). Deep survey-ing of alternative splicing complexity in the human transcriptome by high-throughputsequencing. Nat Genet 40 (12), 1413–1415.

Pruitt, K. D., T. Tatusova, and D. R. Maglott (2005, Jan). Ncbi reference sequence (refseq):a curated non-redundant sequence database of genomes, transcripts and proteins. NucleicAcids Res 33 (Database issue), D501–D504.

Quail, M. A., I. Kozarewa, F. Smith, A. Scally, P. J. Stephens, R. Durbin, H. Swerdlow, , andD. J. Turner1 (2008). A large genome centres improvements to the illumina sequencingsystem. Nat Methods., 1005–1010.

She, Y., E. Hubbell, and H. Wang (2009). Resolving deconvolution ambiguity in genealternative splicing. BMC Bioinformatics.

Sultan, M., M. H. Schulz, H. Richard, A. Magen, A. Klingenhoff, M. Scherf, M. Seifert,T. Borodina, A. Soldatov, D. Parkhomchuk, D. Schmidt, S. O’Keeffe, S. Haas, M. Vin-gron, H. Lehrach, and M.-L. Yaspo (2008, Aug). A global view of gene activity andalternative splicing by deep sequencing of the human transcriptome. Science 321 (5891),956–960.

Trapnell, C., L. Pachter, and S. L. Salzberg (2009, May). Tophat: discovering splice junc-tions with rna-seq. Bioinformatics 25 (9), 1105–1111.

Trapnell, C., G. Pertea, B. Williams, A. Mortazavi, J. van Baren, S. Salzberg, B. Wold,and L. Pachter (2009). http://cufflinks.cbcb.umd.edu/.

Vega, V. B., E. Cheung, N. Palanisamy, and W.-K. Sung (2009). Inherent signals insequencing-based chromatin-immunoprecipitation control libraries. PLoS ONE 4.

Wahlstedt, H., C. Daniel, M. Enster, and M. hman (2009). Large-scale mrna sequencingdetermines global regulation of rna editing during brain development. Genome Res. (19),978–986.

Wang, E. T., R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F. Kingsmore,G. P. Schroth, and C. B. Burge (2008, Nov). Alternative isoform regulation in humantissue transcriptomes. Nature 456 (7221), 470–476.

Zhang, W., S. Duan, W. K. Bleibel, S. A. Wisel, R. S. Huang, X. Wu, L. He, T. A.Clark, T. X. Chen, A. C. Schweitzer, J. E. Blume, M. E. Dolan, and N. J. Cox (2009).Identification of common genetic variants that account for transcript isoform variationbetween human populations. Journal Human Genetics (1), 81–93.