20140711 4 e_tseng_ercc2.0_workshop
Post on 28-Jul-2015
143 Views
Preview:
TRANSCRIPT
FIND MEANING IN COMPLEXITY For Research Use Only. Not for use in diagnostic procedures.
© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.
Elizabeth Tseng / 2014.07.11 Staff Scientist
Technical Variability in PacBio® Full-length cDNA (Iso-SeqTM) Sequencing
SampleNet: Iso-Seq Method with Clonetech® cDNA Synthesis Kit
PacBio’s Iso-Seq™ Method for High-quality, Full-length Transcripts
PolyA mRNA AAAAA
AAAAA
AAAAA
AAAAA
cDNA synthesis with adapters
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
Size partitioning & PCR amplification
SMRTbell™ ligation
PacBio® RS II Sequencing
Experimental Pipeline
Informatics Pipeline
Remove adapters Remove artifacts
Clean sequence
reads
Reads clustering
Isoform clusters
Consensus calling
Nonredundant transcript isoforms
Quality filtering
Final isoforms PacBio raw sequence
reads
5’ primer 3’ primer
Map to reference genome
Experimental pipeline Informatics pipeline
PacBio raw sequence reads
Figure 1
a b
AAAA
AAAA
AAAAAAAAAA
AAAAAAAAAAAAAAA
Size partitioning &PCR amplification
cDNA synthesiswith adapters
SMRTbell ligation
RS sequencing
Remove adaptersRemove artifacts
Reads clustering
Quality filtering
Cleansequence reads
Nonredundant transcript isoforms
Final isoforms
TTTT
TTTT
Consensus calling
Isoform clusters
Map to reference genome
Evidence-based gene models
polyA mRNA
AAAA
AAAA
TTTT
TTTT
AAAATTTT
AAAATTTT
AAAATTTT
AAAATTTT
Evidenced-based gene models
(AAA)n
(TTT)n
1 2 3 4 5
6 7 8 9 10
(TTT)n(AAA)n
Coding sequence polyA tail
SMRT® adapter
DevNet: Iso-Seq wiki page
(AAA)n Reads of Insert (AAA)n
Iso-Seq Full-length cDNA Library Protocol
3
polyA+ RNA
Total RNA
Optional Poly-A Selection
Reverse Transcription (SMARTScribe RT)
Full-‐length 1st Strand cDNA
PCR Optimization
Large-scale Amplification
Amplified cDNA
1-‐2 kb
2-‐3 kb
3-‐6 kb
Size Selection
1-‐2 kb
2-‐3 kb
3-‐6 kb
Re-Amplification
1-‐2 kb
2-‐3 kb
3-‐6 kb
SMRTbell™ Template Preparation
1-‐2 kb
2-‐3 kb
3-‐6 kb
SMRT® Sequencing
3-‐6 kb
Optional Size Selection
Iso-Seq Informatics Pipeline Per-molecule reads
Clusters of transcript alignments using FL + nFL reads
Transcript 1 Transcript 2 Transcript 3
Final transcript consensus
Transcript 1 Transcript 2 Transcript 3
Full-length (FL) reads
Non-FL reads
Transcript 1 Transcript 2 Transcript 3
Isoform-level clusters
Key Features of Current Iso-Seq Bioinformatics
• Non-redundant, full-length, transcript consensus sequences – No assembly – De novo
– Achieves high-quality consensus (≥ 99%) – Universal PacBio features: robust to GC%, repeat structure, etc
• Applications
– Alternative splicing
– Fusion transcripts
– Alternative polyadenlyation – (possible w/ proper protocol) Alternative start sites
Disclaimer
• Everything shown from now on are transcripts/isoforms, not genes
• Data shown is preliminary, very unbaked
• Concept Analysis
Count Information Associated with Each Unique Transcript
Clusters of transcript alignments using FL + nFL reads
Transcript 1 Transcript 2 Transcript 3
Final transcript consensus
Transcript 1 Transcript 2 Transcript 3
Count matrix
Transcript Count Norm_Count
1 2 3 …
8 5 7 …
0.08 0.05 0.07 …
Count Information from non-FL reads
For non-FL reads: • If uniquely associated with a transcript, assume it is the transcript • If ambiguously associated, most likely because it’s a partial match
• For now, weight of ambiguous nFL is just
read _ count = # of FL + # of unique nFL + weighted # of ambiguous nFL
1Number of associated transcripts
In current dataset, about 40-60% nFL reads partially match multiple isoforms (FL reads are always fully and uniquely associated)
Read Count Variation in Technical Replicates
Rat Heart • Technical replicates (same starting RNA & protocol) • 3 size libraries (1 – 2 kb, 2 – 3 kb, 3 – 6 kb) • Runs from diff sizes pooled for
bioinformatics pipeline
Boxplot of log2 read counts
Scatterplot of log2 read count for each transcript
Rat Heart, technical replicates
Read Count Variation in Technical Replicates
10
Rat Lung, technical replicates
All technical replicates were seq with total ~8 SMRT® Cells (low depth) Most NA transcripts are low counts
Choice of Chemistry Does Not Bias Sequencing
11
Rat Brain Same 3-size library (not technical replicate) • Sequenced with P4-C2 chemistry • Sequenced with P5-C3 chemistry
However for longer (> 3 kb) transcripts, P5-C3 chemistry will increase chance of seeing FL reads
Choice of PCR Enzyme May Bias Amplification
12
Human Brain, 2 – 3 kb library
Human Brain, 3 – 6 kb library
Current Iso-Seq Protocol Amplifies Sample Twice
13
polyA+ RNA
Total RNA
Optional Poly-A Selection
Reverse Transcription (SMARTScribe RT)
Full-‐length 1st Strand cDNA
PCR Optimization
Large-scale Amplification
Amplified cDNA
1-‐2 kb
2-‐3 kb
3-‐6 kb
Size Selection
1-‐2 kb
2-‐3 kb
3-‐6 kb
Re-Amplification
1-‐2 kb
2-‐3 kb
3-‐6 kb
SMRTbell™ Template Preparation
1-‐2 kb
2-‐3 kb
3-‐6 kb
SMRT® Sequencing
3-‐6 kb
Optional Size Selection
2nd Amplification Does Not Introduce Strong Bias
14
FL Read Length Distribution
Std. vs. skipping 2nd amp
Std. vs. skipping 1st amp Skipping 1st amplification results in size selection of first-strand cDNA that may be hard to optimize
Expected Transcript Variability in Different Rat Tissues
15
Rat Heart vs Rat Lung
Rat Heart vs Rat Brain
Heart Lung
Heart Brain
Conclusion
• Technical variation not a big issue – If done with same library protocol – Different (PCR) enzymes bias amplification
– Amplification can be tolerated if kept at reasonable # of cycles
• Potential for DE – Still many unknown factors – Everything shown in previous slides merely “proof of concept”
– With control comes better modeling
16
Looking Ahead
17
• Detection limit • Amplification bias
– Adding control at known %
– Factors: GC? Length? Enzyme?
• Account for library pooling • Ambiguous mapping • Modeling bias • DE isoform detection • Combining short-read data
Wet Lab Bioinformatics
For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.
top related