sequencing technologies and applications at jgi feng chen, ph.d. 05/14/2012 mgm workshops

Sequencing Technologies and Applications at JGI

Feng Chen, Ph.D.05/14/2012

MGM Workshops

Outline

• Overview of sequencing technologies at JGI

• Pacific Biosciences potentials

• Highlights of application development

Staying State of the Art

454 early access

07/2005

Solexa early access

01/2007

454 in production

04/2007

SOLiD early access

10/2007

Solexa in production

07/2008

Megabace offline

08/2007

AB 3730 reduced

12/2007

Illumina GAIIx

454 Titanium

05/2009

454 1K

12/2009

Illumina HiSeq 2000

PacBio

Ion TorrentIllumina MiSeq

ONT

Emerging Sequencing Technologies

Illumina MiSeq(improvement) Illumina HiSeq 2500

Ion Torrent PGM Ion Torrent Proton

Illumina Improvement

• Longer read length (250 bp)• 3-fold more reads (15 M)• Higher throughput (5-7 Gb)• Faster run time

Two run configurations•Fast run config can be done in 27 hours and produce 120 Gb•Standard run config remains the same (600 Gb in 17 days)

Promises from Ion Torrent

Oxford Nanopore Technologies

Long read length: > 50kbHigh output: > 1gb/hr

“Run until…”Cheap: ~$40/gbError rate: < 4%

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

2009 2010 2011

ABI3730xl Units

Roche/454 Units

GAIIx Units

HiSeq Units

Budget ($ Millions)

Output (Trillions Bases)

Staff (FTE)

{ { {

FY200949 Units24 FTEs

$11M1Tb


$11M6Tb


$8M29Tb

Evolution of JGI Sequencing Platforms

Budget ($M)

Staff (#)O

utpu

t (Tb)

3730

454GAii

Hiseq

454

454

GAii

GAiiHiseq

Illumina HiSeq

Pacific Biosciences

RS

Illumina MiSeq

Illumina GAIIx

Roche/454FLX-Ti

Units 8 2 2 5 2

Reads 1,400 Million per Flowcell

0.04 Million per SMRT Cell

5 Million per Flowcell

210 Million per Flowcell

1 Million per PTP

Average Readlength

150bp 2,700bp 150bp 150bp 450bp

Total Bases 325 Billion per Flowcell

0.100 Billion per SMRT Cell

2.1 Billion per Flowcell

75 Billion per Flowcell

0.450 Billion per PTP

Run Time 16.5 Days 0.08 Days (2 hours)

1 Day 14 Days 0.3 Days (8 hours)

Applications Primary Sequence

Generator at JGI

de novo, cDNA, 16S ID, validation

16s, Sample QC, R&D

Replaced by HiSeq

16s (replaced by MiSeq)

JGI Current Sequencing Platforms

Major Platforms Supplement Platform

Platforms being Phased-out

STANDARDDNA De Novo and Reseq:Std frag 270bp, 500bp (amplified/ unamp)tight insert 250bp, 500bp (amplified/ unamp)CLIP-PE 4kb, 8kb

Transcriptome Diversity/Counting:RNASeq strandedRNASeq with/without rRNA depletion (Prok and Meta)small RNASeqPET RNASeq (5’ and 3')

Environmental Diversity Profiling:16S Profiling

CUSTOM/R&D:DNA De Novo:CLIP-PE fosmidCLIP-PE 20kb LFPE 4kb, 8kbHaplotype resolved sequencingsingle cells/fragmentsPacbio WGSPacBio amplicon sequencing

Functional Genomics:TSS prokaryotic RNAseqTn insertion site profiling sequencingPacbio FL RNAPacBio methylation sequencingpools of 96 fosmids indexed librariesBisulfite Seqchromatin IPnano RNAseq

Portfolio of Library Capabilities

Outline




Pacific Biosciences Technology

• Single Molecule– Sequence directly from the molecules in your sample, not

the amplification product

• Real time– Direct observation of natural DNA synthesis in a

continuous and processive manner

• Phospholinked Nucleotides– Fluorescent label is at

gama-phosphate position– Naturally cleaved during

incorporation

Pacific Biosciences Advantages

• Fast run time• Long read length • No amplification biases• Able to measure DNA polymerase kinetics

– Inter-pulse distance– Pulse duration

• Multiple sequencing modes– Standard– Strobe– Circular consensus

• Disadvantages: high error (indel), low throughput

28% GC

73% GC

V3 HiSeq V2 HiSeq V2 GAiix

Less GC Bias Than Newest Illumina Chemistry

PacBio Data Improves Assembly

Least improved genomes (.. but started out in good shape)

Most improved genome: 53 / 71 (75%) gaps closed

11% of gaps were closed incorrectly with either errors in consensus or

misassemblies

Allpaths assembly (illumina only)

Illumina coverage

PacBiocoverage

Great coverage of PacBio in gap region

~100x coverage

PacBio Data Coverage

PacBio Read Length

0

500

1000

1500

2000

2500

3000

3500

Oct-10 Dec-10 Feb-11 Apr-11 Jun-11 Aug-11 Oct-11 Dec-11

Timeline

Rea

d L

eng

th (

bp

)

V 1.1.2 V 1.2

V 1.2.1

V 1.2.2

C2 chemistry

We started from here

Successful upgrades

Laser overpower

Instrument fine tuning

before

after

coverage

annotation 0x

800x

Alignment before and after correction

Transcriptome/FL-cDNA Sequencing

Goals: capture the 5’ and 3’ end of the transcripts and splicing variants

annotated transcript



Transcripts hit (73.3%)

Transcripts tiled (38.6%)

Transcripts covered by > 1 subread (36.5%)

Transcriptome Coverage

• 1/3 of the transcripts (1/2 of transcripts hit by this dataset) are covered by at least one single PacBio subread

• There is NO ambiguity if splice variants are detected

Error Correction revealed isoforms

J. MartinZ. Wang

Outline




Application Development

• Large-insert paired-end sequencing- 3-5 kb, 8-10 kb, and >20 kb insert size- CLIP-PE: developed in-house

• RNA sequencing- 5’ and 3’ end targeted and full-length sequencing- Metatranscriptome sequencing

• 16S rRNA profiling and identification- iTag on Illumina MiSeq and 16S ID on PacBio

•Haplotype-resolved sequencing- Single chromosome sequencing

•Functional genomics:- Gene synthesis

- Large scale gene disruption

16S Tagging on MiSeq

Targeting V4 region in 16S gene (291 nt in length)• Use 3rd-read indexing strategy and custom forward sequencing primer to maximize the use of Illumina’s limited read length• 2x250 bp run to ensure read overlap

16S geneHVR

16S specific primerIllumina adapter 1

Illumina adapter 2 Read1 priming site

Read2 priming site

Barcode priming site

Spacer

V4

Illumina454

Amplicon Modification

96 samples are pooled in one MiSeq runHigh quality sequencing data were obtained from both reads

• MiSeq data largely agrees with 454 PyroTag data• Major differences are in low abundance clusters

Illumina MiSeq Suitable for 16S Tagging

• Random Tn insertion mutagenesis• Cell growth at multiple conditions• High throughput insertion site sequencing• Map insertion sites to reference sequence for functional

annotation

Functional Genomics through “Transposon bombing”

High throughput sequencing revels “essential” genes appear as transposon free regions

0

230

Illumina read depth

Genes

Non-essential genes Non-essential genes

Essential gene:dihydroxy-acid dehydratase

(required for biosynthesis of amino acids)

Transposon insertions

Transposon insertions

Insertion free site

Tn Insertion Reveals Essential Genes

Essential genes 508 (12 %)

Non-essential genes 3,542 (80 %)

Uncertain 362 (8%)

Expected distribution from random insertions

Observed distribution of insertions

Pseudomonas Stutzeri RCH2

Single Chromosome Sequencing

Metaphase chromosomes

Single chromosome in droplet or micro-well

MDA/PCR amplification

MM

MM: micromanipulatorMF: microfluidicsLCM: Laser Capture

Microdissector

LCM

MF

Thank you very much!

Question?

sequencing technologies and applications at jgi feng chen, ph.d. 05/14/2012 mgm workshops

Documents

days slide

ion torrent slide

gaii hiseq slide

incorporation slide

illumina improvement

gb faster run time

bp amplified unamp clippe

meta small rnaseq pet