sequencing technologies and applications at jgi feng chen, ph.d. 05/14/2012 mgm workshops
TRANSCRIPT
Sequencing Technologies and Applications at JGI
Feng Chen, Ph.D.05/14/2012
MGM Workshops
Outline
• Overview of sequencing technologies at JGI
• Pacific Biosciences potentials
• Highlights of application development
Staying State of the Art
454 early access
07/2005
Solexa early access
01/2007
454 in production
04/2007
SOLiD early access
10/2007
Solexa in production
07/2008
Megabace offline
08/2007
AB 3730 reduced
12/2007
Illumina GAIIx
454 Titanium
05/2009
454 1K
12/2009
Illumina HiSeq 2000
PacBio
Ion TorrentIllumina MiSeq
ONT
Emerging Sequencing Technologies
Illumina MiSeq(improvement) Illumina HiSeq 2500
Ion Torrent PGM Ion Torrent Proton
Illumina Improvement
• Longer read length (250 bp)• 3-fold more reads (15 M)• Higher throughput (5-7 Gb)• Faster run time
Two run configurations•Fast run config can be done in 27 hours and produce 120 Gb•Standard run config remains the same (600 Gb in 17 days)
Promises from Ion Torrent
Oxford Nanopore Technologies
Long read length: > 50kbHigh output: > 1gb/hr
“Run until…”Cheap: ~$40/gbError rate: < 4%
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
30
35
40
2009 2010 2011
ABI3730xl Units
Roche/454 Units
GAIIx Units
HiSeq Units
Budget ($ Millions)
Output (Trillions Bases)
Staff (FTE)
{ { {
FY200949 Units24 FTEs
$11M1Tb
FY201022 Units15 FTEs
$11M6Tb
FY201115 Units9 FTEs
$8M29Tb
Evolution of JGI Sequencing Platforms
Budget ($M)
Staff (#)O
utpu
t (Tb)
3730
454GAii
Hiseq
454
454
GAii
GAiiHiseq
Illumina HiSeq
Pacific Biosciences
RS
Illumina MiSeq
Illumina GAIIx
Roche/454FLX-Ti
Units 8 2 2 5 2
Reads 1,400 Million per Flowcell
0.04 Million per SMRT Cell
5 Million per Flowcell
210 Million per Flowcell
1 Million per PTP
Average Readlength
150bp 2,700bp 150bp 150bp 450bp
Total Bases 325 Billion per Flowcell
0.100 Billion per SMRT Cell
2.1 Billion per Flowcell
75 Billion per Flowcell
0.450 Billion per PTP
Run Time 16.5 Days 0.08 Days (2 hours)
1 Day 14 Days 0.3 Days (8 hours)
Applications Primary Sequence
Generator at JGI
de novo, cDNA, 16S ID, validation
16s, Sample QC, R&D
Replaced by HiSeq
16s (replaced by MiSeq)
JGI Current Sequencing Platforms
Major Platforms Supplement Platform
Platforms being Phased-out
STANDARDDNA De Novo and Reseq:Std frag 270bp, 500bp (amplified/ unamp)tight insert 250bp, 500bp (amplified/ unamp)CLIP-PE 4kb, 8kb
Transcriptome Diversity/Counting:RNASeq strandedRNASeq with/without rRNA depletion (Prok and Meta)small RNASeqPET RNASeq (5’ and 3')
Environmental Diversity Profiling:16S Profiling
CUSTOM/R&D:DNA De Novo:CLIP-PE fosmidCLIP-PE 20kb LFPE 4kb, 8kbHaplotype resolved sequencingsingle cells/fragmentsPacbio WGSPacBio amplicon sequencing
Functional Genomics:TSS prokaryotic RNAseqTn insertion site profiling sequencingPacbio FL RNAPacBio methylation sequencingpools of 96 fosmids indexed librariesBisulfite Seqchromatin IPnano RNAseq
Portfolio of Library Capabilities
Outline
• Overview of sequencing technologies at JGI
• Pacific Biosciences potentials
• Highlights of application development
Pacific Biosciences Technology
• Single Molecule– Sequence directly from the molecules in your sample, not
the amplification product
• Real time– Direct observation of natural DNA synthesis in a
continuous and processive manner
• Phospholinked Nucleotides– Fluorescent label is at
gama-phosphate position– Naturally cleaved during
incorporation
Pacific Biosciences Advantages
• Fast run time• Long read length • No amplification biases• Able to measure DNA polymerase kinetics
– Inter-pulse distance– Pulse duration
• Multiple sequencing modes– Standard– Strobe– Circular consensus
• Disadvantages: high error (indel), low throughput
28% GC
73% GC
V3 HiSeq V2 HiSeq V2 GAiix
Less GC Bias Than Newest Illumina Chemistry
PacBio Data Improves Assembly
Least improved genomes (.. but started out in good shape)
Most improved genome: 53 / 71 (75%) gaps closed
11% of gaps were closed incorrectly with either errors in consensus or
misassemblies
Allpaths assembly (illumina only)
Illumina coverage
PacBiocoverage
Great coverage of PacBio in gap region
~100x coverage
PacBio Data Coverage
PacBio Read Length
0
500
1000
1500
2000
2500
3000
3500
Oct-10 Dec-10 Feb-11 Apr-11 Jun-11 Aug-11 Oct-11 Dec-11
Timeline
Rea
d L
eng
th (
bp
)
V 1.1.2 V 1.2
V 1.2.1
V 1.2.2
C2 chemistry
We started from here
Successful upgrades
Laser overpower
Instrument fine tuning
before
after
coverage
annotation 0x
800x
Alignment before and after correction
Transcriptome/FL-cDNA Sequencing
Goals: capture the 5’ and 3’ end of the transcripts and splicing variants
annotated transcript
annotated transcript
annotated transcript
Transcripts hit (73.3%)
Transcripts tiled (38.6%)
Transcripts covered by > 1 subread (36.5%)
Transcriptome Coverage
• 1/3 of the transcripts (1/2 of transcripts hit by this dataset) are covered by at least one single PacBio subread
• There is NO ambiguity if splice variants are detected
Error Correction revealed isoforms
J. MartinZ. Wang
Outline
• Overview of sequencing technologies at JGI
• Pacific Biosciences potentials
• Highlights of application development
Application Development
• Large-insert paired-end sequencing- 3-5 kb, 8-10 kb, and >20 kb insert size- CLIP-PE: developed in-house
• RNA sequencing- 5’ and 3’ end targeted and full-length sequencing- Metatranscriptome sequencing
• 16S rRNA profiling and identification- iTag on Illumina MiSeq and 16S ID on PacBio
•Haplotype-resolved sequencing- Single chromosome sequencing
•Functional genomics:- Gene synthesis
- Large scale gene disruption
16S Tagging on MiSeq
Targeting V4 region in 16S gene (291 nt in length)• Use 3rd-read indexing strategy and custom forward sequencing primer to maximize the use of Illumina’s limited read length• 2x250 bp run to ensure read overlap
16S geneHVR
16S specific primerIllumina adapter 1
Illumina adapter 2 Read1 priming site
Read2 priming site
Barcode priming site
Spacer
V4
Illumina454
Amplicon Modification
96 samples are pooled in one MiSeq runHigh quality sequencing data were obtained from both reads
• MiSeq data largely agrees with 454 PyroTag data• Major differences are in low abundance clusters
Illumina MiSeq Suitable for 16S Tagging
• Random Tn insertion mutagenesis• Cell growth at multiple conditions• High throughput insertion site sequencing• Map insertion sites to reference sequence for functional
annotation
Functional Genomics through “Transposon bombing”
High throughput sequencing revels “essential” genes appear as transposon free regions
0
230
Illumina read depth
Genes
Non-essential genes Non-essential genes
Essential gene:dihydroxy-acid dehydratase
(required for biosynthesis of amino acids)
Transposon insertions
Transposon insertions
Insertion free site
Tn Insertion Reveals Essential Genes
Essential genes 508 (12 %)
Non-essential genes 3,542 (80 %)
Uncertain 362 (8%)
Expected distribution from random insertions
Observed distribution of insertions
Pseudomonas Stutzeri RCH2
Single Chromosome Sequencing
Metaphase chromosomes
Single chromosome in droplet or micro-well
MDA/PCR amplification
MM
MM: micromanipulatorMF: microfluidicsLCM: Laser Capture
Microdissector
LCM
MF
Thank you very much!
Question?