nm11
DESCRIPTION
Empirical Comparison of Protocols for Sequencing-Based Gene and Isoform Expression Profiling Marius Nicolae and Ion M ă ndoiu (University of Connecticut). - PowerPoint PPT PresentationTRANSCRIPT
IsoEM vs. DGE-EM
DGE-EMObserved data: • tag classes = tags and their
matching sites in the isoforms• for every tag, matching sites
are weighted based on tag quality scores (Q)
Hidden variables: • n(i,s) = number of tags
coming from site s in isoform iParameters: • fi = frequency of isoform i, for
all i
IsoEMObserved data: • read classes = reads and their
matching positions in the isoforms• for every read, matching positions
are weighted based on read quality scores (Q), read orientation (O) and fragment length probability distribution (F)
Hidden variables: • n(i) = number of reads coming
from isoform iParameters: • fi = frequency of isoform i, for all i
Motivation• Massively parallel sequencing is quickly replacing microarrays as the
technology of choice for performing gene expression profiling • Two main protocols proposed in the literature:
– RNA-Seq: generates short (single or paired) sequencing tags from the ends of randomly generated cDNA fragments.
– Digital Gene Expression (DGE) [1], or high-throughput sequencing based Serial Analysis of Gene Expression (SAGE-Seq) [2]: generates single cDNA tags using an assay including as main steps transcript capture and cDNA synthesis using oligo(dT) beads, cDNA cleavage with an anchoring enzyme, and release of cDNA tags using a tagging enzyme whose recognition site is ligated upstream of the recognition site of the anchoring enzyme
• We present results of a simulation study comparing the accuracy achieved by the two protocols based on state-of-the-art inference algorithms using the expectationmaximization (EM) framework
In Silico Experimental Setup• UCSC human genome hg19• 5 tissues from the GNFAtlas2 table
Rna-Seq:• 1-30 million reads• lengths 14-100 bp• normal fragment length
distribution N(250, 25)• Algorithm: IsoEM [3]
DGE: • 1-30 million tags • lengths 14-26 bp• cutting enzymes from the Restriction
Enzyme Database:• DpnII [1] with recognition site GATC• NlaIII [2] with recognition site CATG• CviJI with degenerate recognition site
RGCY (R=G or A, Y=C or T)• Algorithm: DGE-EM
DGE-EM algorithm
E-step
M-step
RNA-Seq
1418
2126
36
50
75
100
0
5
10
15
20
25
10000005000000
1000000015000000
30000000
Read Length
MPE
#Reads
0-5 5-10 10-15 15-20 20-25
Median Percent Error (MPE)
Isoform MPE %Isof. Cut
Gene MPE %Genes CutProtocol Avg. S.D. Avg. S.D.
RNA-Seq 20.6 0.14 N/A 8.3 0.17 N/A
DGE
Full digestp=1
GATC 30.1 0.09 95% 4.7 0.27 95%
CATG 25.8 0.07 97% 2.9 0.08 98%
RGCY 24.1 0.08 98% 2.5 0.05 99.9%
Partial digestp=0.5
GATC 24.4 0.22 95% 3.7 0.17 95%
CATG 22.8 0.19 97% 2.5 0.05 98%
RGCY 23.9 0.11 98% 2.3 0.06 99.9%
DGE enzyme CATG p=.5
1000000
5000000
10000000
15000000
30000000
0
2
4
6
8
10
12
14
14
18
21
24
26
#Tags
MPE
Tag Length
0-2 2-4 4-6 6-8 8-10 10-12 12-14
DGE enzyme RGCY p=.5
1000000
5000000
10000000
15000000
30000000
0
2
4
6
8
10
12
14
14
18
21
24
26
#Tags
MPE
Tag Length
0-2 2-4 4-6 6-8 8-10 10-12 12-14
DGE enzyme GATC p=.5
1000000
5000000
10000000
15000000
30000000
0
2
4
6
8
10
12
14
16
1418
21
24
26
#Tags
MPE
Tag Length
0-2 2-4 4-6 6-8 8-10 10-12 12-14 14-16
Tag-Isoform Compatibility
DGECleave with anchoring enzyme (AE)
Attach primer for tagging enzyme (TE)
Cleave with tagging enzyme
AAAAA
AAAAACATG
AE
TCCRAC AAAAACATG
AETE
CATG
Map tags
A B C D E
Gene Expression (GE) Isoform Expression (IE)
RNA-Seq
A B C D E
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
Gene Expression (GE) Isoform Expression (IE)
IsoEM algorithm
E-step
M-step
Conclusions• For isoform expression inference, RNA-Seq yields slightly better accuracy
compared to DGE, whereas for gene expression DGE has significantly better accuracy when enough isoforms are cut by the restriction enzyme used.
• DGE accuracy improves with the percentage of isoforms cut and is better for partial digestion compared to complete digestion.
• Using enzymes with degenerate recognition sites, such as CviJI, yields better accuracy than using enzymes from published studies
ACKNOWLEDGEMENTSThis work was supported in part by NSF awards IIS-0546457 and IIS-0916948
REFERENCES[1] Y. Asmann, E. Klee, E. A. Thompson, E. Perez, S. Middha, A. Oberg, T. Therneau, D. Smith, G. Poland, E. Wieben,
and J.-P. Kocher, “3’ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer,” BMC Genomics, vol. 10, no. 1, p. 531, 2009.
[2] Z. J. Wu, C. A. Meyer, S. Choudhury, M. Shipitsin, R. Maruyama, M. Bessarabova, T. Nikolskaya, S. Sukumar, A. Schwartzman, J. S. Liu, K. Polyak, and X. S. Liu, “Gene expression profiling of human breast tissue samples using SAGE-Seq,” Genome Research, vol. 20, no. 12, pp. 1730–1739,2010.
[3] M. Nicolae, S. Mangul, I. I. Mandoiu, and A. Zelikovsky, “Estimation of alternative splicing isoform frequencies from RNA-Seq data,” in Proc. WABI, ser. Lecture Notes in Computer Science, V. Moulton and M. Singh, Eds., vol. 6293. Springer, 2010, pp. 202–214.
Read-Isoform Compatibility
a
aaair FQOw ,
irw ,
Empirical Comparison of Protocols for Sequencing-Based Gene and Isoform Expression Profiling
Marius Nicolae and Ion Măndoiu (University of Connecticut)