nm11

IsoEM vs. DGE-EM

DGE-EMObserved data: • tag classes = tags and their

matching sites in the isoforms• for every tag, matching sites

are weighted based on tag quality scores (Q)

Hidden variables: • n(i,s) = number of tags

coming from site s in isoform iParameters: • fi = frequency of isoform i, for

all i

IsoEMObserved data: • read classes = reads and their

matching positions in the isoforms• for every read, matching positions

are weighted based on read quality scores (Q), read orientation (O) and fragment length probability distribution (F)

Hidden variables: • n(i) = number of reads coming

from isoform iParameters: • fi = frequency of isoform i, for all i

Motivation• Massively parallel sequencing is quickly replacing microarrays as the

technology of choice for performing gene expression profiling • Two main protocols proposed in the literature:

– RNA-Seq: generates short (single or paired) sequencing tags from the ends of randomly generated cDNA fragments.

– Digital Gene Expression (DGE) [1], or high-throughput sequencing based Serial Analysis of Gene Expression (SAGE-Seq) [2]: generates single cDNA tags using an assay including as main steps transcript capture and cDNA synthesis using oligo(dT) beads, cDNA cleavage with an anchoring enzyme, and release of cDNA tags using a tagging enzyme whose recognition site is ligated upstream of the recognition site of the anchoring enzyme

• We present results of a simulation study comparing the accuracy achieved by the two protocols based on state-of-the-art inference algorithms using the expectationmaximization (EM) framework

In Silico Experimental Setup• UCSC human genome hg19• 5 tissues from the GNFAtlas2 table

Rna-Seq:• 1-30 million reads• lengths 14-100 bp• normal fragment length

distribution N(250, 25)• Algorithm: IsoEM [3]

DGE: • 1-30 million tags • lengths 14-26 bp• cutting enzymes from the Restriction

Enzyme Database:• DpnII [1] with recognition site GATC• NlaIII [2] with recognition site CATG• CviJI with degenerate recognition site

RGCY (R=G or A, Y=C or T)• Algorithm: DGE-EM

DGE-EM algorithm

E-step

M-step

RNA-Seq

1418

2126

36

50

75

100

0

5

10

15

20

25

10000005000000

1000000015000000

30000000

Read Length

MPE

#Reads

0-5 5-10 10-15 15-20 20-25

Median Percent Error (MPE)

Isoform MPE %Isof. Cut

Gene MPE %Genes CutProtocol Avg. S.D. Avg. S.D.

RNA-Seq 20.6 0.14 N/A 8.3 0.17 N/A

DGE

Full digestp=1

GATC 30.1 0.09 95% 4.7 0.27 95%

CATG 25.8 0.07 97% 2.9 0.08 98%

RGCY 24.1 0.08 98% 2.5 0.05 99.9%

Partial digestp=0.5

GATC 24.4 0.22 95% 3.7 0.17 95%

CATG 22.8 0.19 97% 2.5 0.05 98%

RGCY 23.9 0.11 98% 2.3 0.06 99.9%

DGE enzyme CATG p=.5

1000000

5000000

10000000

15000000

30000000

0

2

4

6

8

10

12

14

14

18

21

24

26

#Tags

MPE

Tag Length

0-2 2-4 4-6 6-8 8-10 10-12 12-14

DGE enzyme RGCY p=.5

1000000

5000000

10000000

15000000

30000000

0

2

4

6

8

10

12

14

14

18

21

24

26

#Tags

MPE

Tag Length

0-2 2-4 4-6 6-8 8-10 10-12 12-14

DGE enzyme GATC p=.5

1000000

5000000

10000000

15000000

30000000

0

2

4

6

8

10

12

14

16

1418

21

24

26

#Tags

MPE

Tag Length

0-2 2-4 4-6 6-8 8-10 10-12 12-14 14-16

Tag-Isoform Compatibility

DGECleave with anchoring enzyme (AE)

Attach primer for tagging enzyme (TE)

Cleave with tagging enzyme

AAAAA

AAAAACATG

AE

TCCRAC AAAAACATG

AETE

CATG

Map tags

A B C D E

Gene Expression (GE) Isoform Expression (IE)

RNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression (GE) Isoform Expression (IE)

IsoEM algorithm

E-step

M-step

Conclusions• For isoform expression inference, RNA-Seq yields slightly better accuracy

compared to DGE, whereas for gene expression DGE has significantly better accuracy when enough isoforms are cut by the restriction enzyme used.

• DGE accuracy improves with the percentage of isoforms cut and is better for partial digestion compared to complete digestion.

• Using enzymes with degenerate recognition sites, such as CviJI, yields better accuracy than using enzymes from published studies

ACKNOWLEDGEMENTSThis work was supported in part by NSF awards IIS-0546457 and IIS-0916948

REFERENCES[1] Y. Asmann, E. Klee, E. A. Thompson, E. Perez, S. Middha, A. Oberg, T. Therneau, D. Smith, G. Poland, E. Wieben,

and J.-P. Kocher, “3’ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer,” BMC Genomics, vol. 10, no. 1, p. 531, 2009.

[2] Z. J. Wu, C. A. Meyer, S. Choudhury, M. Shipitsin, R. Maruyama, M. Bessarabova, T. Nikolskaya, S. Sukumar, A. Schwartzman, J. S. Liu, K. Polyak, and X. S. Liu, “Gene expression profiling of human breast tissue samples using SAGE-Seq,” Genome Research, vol. 20, no. 12, pp. 1730–1739,2010.

[3] M. Nicolae, S. Mangul, I. I. Mandoiu, and A. Zelikovsky, “Estimation of alternative splicing isoform frequencies from RNA-Seq data,” in Proc. WABI, ser. Lecture Notes in Computer Science, V. Moulton and M. Singh, Eds., vol. 6293. Springer, 2010, pp. 202–214.

Read-Isoform Compatibility

a

aaair FQOw ,

irw ,

Empirical Comparison of Protocols for Sequencing-Based Gene and Isoform Expression Profiling

Marius Nicolae and Ion Măndoiu (University of Connecticut)

nm11

Documents

isoform iparameters

gene expression dge

frequency of isoform

dge accuracy

number of reads

sequencingbased gene

map reads

better accuracy