![Page 1: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/1.jpg)
Near perfect de novo assemblies of eukaryotic genomes using PacBio long read sequencing!
James!Gurtowski!
Schatz!Lab!
5/29/2014!
![Page 2: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/2.jpg)
Assembly Complexity of Long Reads
M.ja
nnas
chii
(Eur
yarc
haeo
ta)
C.hy
drog
enof
orm
ans
(Firm
icute
s)
E.co
li(Eu
bact
eria
)Y.
pest
is(Pr
oteo
bact
eria
)B.
anth
racis
(Firm
icute
s)
A.m
irum
(Act
inob
acte
ria)
S.ce
revis
iae(
Yeas
t)
Y.lip
olyt
ica(F
ungu
s)
D.di
scoi
deum
(Slim
e m
old)
N.cr
assa
(Red
bre
ad m
old)
C.in
test
inal
is(Se
a sq
uirt)
C.el
egan
s(Ro
undw
orm
)C.
rein
hard
tii(G
reen
alg
ae)
A.ta
liana
(Ara
bido
psis)
D.m
elan
ogas
ter(F
ruitf
ly)
P.pe
rsica
(Pea
ch)
O.s
ativa
(Rice
)P.
trich
ocar
pa(P
opla
r)
S.lyc
oper
sicum
(Tom
ato)
G.m
ax(S
oybe
an)
M.g
allo
pavo
(Tur
key)
D.re
rio(Z
ebra
fish)
A.ca
rolln
ensis
(Liza
rd)
Z.m
ays(
Corn
)M
.mus
culu
s(M
ouse
)H.
sapi
ens(
Hum
an)
Genome Size
Targ
et P
erce
ntag
e
SVR Fit : Genome Assembly Using Genome Size and Read Length
106 107 108 109
0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
mean8 (30,000 ± 692bp)mean4 (15,000 ± 435bp)mean2 ( 7,400 ± 245bp)mean1 ( 3,650 ± 140bp)SVR Fit (30,000 ± 692bp)SVR Fit (15,000 ± 435bp)SVR Fit ( 7,400 ± 245bp)SVR Fit ( 3,650 ± 140bp)
Assembly complexity of long read sequencing Lee, H*, Gurtowski, J*, Yoo, S, Marcus, S, McCombie, WR, Schatz MC. (2014) In preparation
Ass
embl
y N
50 /
C
hrom
osom
e N
50
“C5”!????!
“C4”!????!
“C3”!2013!
“C2”!2012!
![Page 3: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/3.jpg)
S. pombe dg21
103x over 10kbp
7.6x over 20kb
PacBio!RS!II!sequencing!at!CSHL!• Size!selecJon!using!an!7!Kb!eluJon!window!on!a!BluePippin™!
device!from!Sage!Science!
Max: 35,415bp
Mean: 5170
Over 275x coverage in 5 SMRTcells using P5-C3
![Page 4: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/4.jpg)
S. pombe dg21 ASM294!Reference!sequence!• 12.6Mbp; 3 chromo + mitochondria; N50: 4.53Mbp !
PacBio!assembly!using!HGAP!+!Celera!Assembler!• 12.7Mbp; 13 non-redundant contigs;!N50:!3.83Mbp;!>99.98%!id!
Near perfect assembly: Chr1: 1 contig Chr2: 2 contigs Chr3: 2 contigs MT: 1 contig
![Page 5: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/5.jpg)
Spanning vs Standard Coverage
Length(read)reads∑
GenomeSize
max(0,Length(read)− SpanLength)reads∑
GenomeSize
Standard!Coverage!(SpanLength!=!1bp)!
Spanning!Coverage!(SpanLength!>!1bp)!
![Page 6: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/6.jpg)
Spanning!Coverage!(S.!pombe)!
• How$many$reads$span$a$par.cular$16kb$region?$!!!!!!!!!!23x!Coverage!of!reads!>!16kb,!but!only!expect!3.6!reads!to!span!a!!!!!!!!!!!!!!!parJcular!16kb!region!!
!
!
0!
50!
100!
150!
200!
250!
300!
1000!
2000!
3000!
4000!
5000!
6000!
7000!
8000!
9000!
10000!
11000!
12000!
13000!
14000!
15000!
16000!
17000!
18000!
19000!
20000!
21000!
22000!
23000!
24000!
25000!
26000!
27000!
28000!
29000!
Coverage!
Spanning!Coverage!23x!Coverage!>!16kb!
3.6x!Spanning!Coverage!16kb!
![Page 7: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/7.jpg)
PacBio Correction/Assembly Algorithms
PacBioToCA!
&!ECTools!
Hybrid/PB-only Error Correction
Koren, Schatz, et al (2012) Nature Biotechnology. 30:693–700
HGAP!&!Quiver!
PB-only Correction & Polishing
Chin et al (2013) Nature Methods. 10:563–569
PBJelly!
Gap Filling and Assembly Upgrade
English et al (2012) PLOS One. 7(11): e47768
<!5x! >!50x!PacBio!Coverage!
![Page 8: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/8.jpg)
Hybrid Approaches for Larger Genomes
PacBioToCA$fails$in$complex$regions$1. Error!Dense!Regions!–!Difficult!to!compute!overlaps!with!many!
errors!
2. Simple!Repeats!–!Kmer!Frequency!Too!High!to!Seed!Overlaps!
3. Extreme!GC!–!Lacks!Illumina!Coverage!
0 1000 2000 3000 4000
Position Specific Coverage and Error Rate
Read Position
05
1015
2025
30
Obs
erve
d Co
vera
ge
1520
2530
Obs
erve
d Er
ror R
ate
CoverageError Rate
![Page 9: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/9.jpg)
ECTools: Error Correction with pre-assembled reads
Short$Reads$F>$Assemble$Uni.gs$F>$Align$&$Select$F$>$Error$Correct$$$!
Can!Help!us!overcome:!
1. Error!Dense!Regions!–!Longer!sequences!have!more!seeds!to!match!
2. Simple!Repeats!–!Longer!sequences!easier!to!resolve$$
However,$cannot$overcome$Illumina$coverage$gaps$&$other$biases$$
hmps://github.com/jgurtowski/ectools!
![Page 10: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/10.jpg)
Generate!UniJgs!with!Celera!Using!Illumina!
or!another!high!idenJty!sequencing!technology!
Align!UniJgs!to!Pacbio!Reads!With!Nucmer!
Use!DeltaqFilter!to!Generate!UniJg!Layout!
ShowqSnps!shows!differences!between!trusted!
!Illumina!UniJg!Sequence!and!Pacbio!Read!!
Script!to!“Correct”!Pacbio!Read!
Nucmer!
DeltaqFilter!
ShowqSnps!
Custom!Script!
Celera!UniJgs!
ECTools Pipeline
Note:!Reads!are!never!split!or!trimmed!
![Page 11: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/11.jpg)
Delta-Filter Alignment filtering
ShortqRead!UniJgs!
![Page 12: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/12.jpg)
A. thaliana Ler-0 hmp://blog.pacificbiosciences.com/2013/08/newqdataqreleaseqarabidopsisqassembly.html!
High!quality!assembly!of!chromosome!arms!
Assembly!Performance:!8.4Mbp/23Mbp!=!36%!!
MiSeq!assembly:!63kbp/23Mbp!=!.2%!
Mean:!4,137bp!
Max:!41,753bp!
Cov:!118x!
!
![Page 13: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/13.jpg)
O. sativa pv Indica (IR64) Genome!size:! ! !~370!Mb!
Chromosome!N50: !~29.7!Mbp!
Assembly Contig NG50
“ALLPATHS-recipe” 50x 2x100bp @ 180 36x 2x50bp @ 2100 51x 2x50bp @ 4800
18,450
MiSeq Fragments 25x 456bp (3 runs 2x300 @ 450 FLASH)
19,078
PacbioToCA – 47 SMRTCells 10.7x @ 10kbp
144,042
ECTools - 47 SMRTCells 10.7x @ 10kbp
272,137
HGAP – 114 SMRTCells 29.2x @ 10kbp
600,021
ECTools Read Lengths Mean: 9,348
Max: 54,288bp 10.75x over 10kbp
![Page 14: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/14.jpg)
Real Data Results
M.ja
nnas
chii
(Eur
yarc
haeo
ta)
C.hy
drog
enof
orm
ans
(Firm
icute
s)
E.co
li(Eu
bact
eria
)Y.
pest
is(Pr
oteo
bact
eria
)B.
anth
racis
(Firm
icute
s)
A.m
irum
(Act
inob
acte
ria)
S.ce
revis
iae(
Yeas
t)
Y.lip
olyt
ica(F
ungu
s)
D.di
scoi
deum
(Slim
e m
old)
N.cr
assa
(Red
bre
ad m
old)
C.in
test
inal
is(Se
a sq
uirt)
C.el
egan
s(Ro
undw
orm
)C.
rein
hard
tii(G
reen
alg
ae)
A.ta
liana
(Ara
bido
psis)
D.m
elan
ogas
ter(F
ruitf
ly)
P.pe
rsica
(Pea
ch)
O.s
ativa
(Rice
)P.
trich
ocar
pa(P
opla
r)
S.lyc
oper
sicum
(Tom
ato)
G.m
ax(S
oybe
an)
M.g
allo
pavo
(Tur
key)
D.re
rio(Z
ebra
fish)
A.ca
rolln
ensis
(Liza
rd)
Z.m
ays(
Corn
)M
.mus
culu
s(M
ouse
)H.
sapi
ens(
Hum
an)
Genome Size
Targ
et P
erce
ntag
e
SVR Fit : Genome Assembly Using Genome Size and Read Length
106 107 108 109
0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
mean8 (30,000 ± 692bp)mean4 (15,000 ± 435bp)mean2 ( 7,400 ± 245bp)mean1 ( 3,650 ± 140bp)SVR Fit (30,000 ± 692bp)SVR Fit (15,000 ± 435bp)SVR Fit ( 7,400 ± 245bp)SVR Fit ( 3,650 ± 140bp)
Assembly complexity of long read sequencing Lee, H*, Gurtowski, J*, Yoo, S, Marcus, S, McCombie, WR, Schatz MC. (2014) In preparation
Ass
embl
y N
50 /
C
hrom
osom
e N
50
“C5”!????!
“C4”!????!
“C3”!2013!
“C2”!2012!
![Page 15: James!Gurtowski! Schatz!Lab! 5/29/2014!schatzlab.cshl.edu/presentations/2014.05.29.SFAF... · 2014. 5. 29. · Near perfect de novo assemblies of eukaryotic genomes using PacBio long](https://reader036.vdocument.in/reader036/viewer/2022071501/611fd38374109c7c196d0014/html5/thumbnails/15.jpg)
Acknowledgements
!
McCombie!Lab!Dick!McCombie!
Panchajanya!Deshpande!
Senem!Eskipehlivan!!!Melissa!Kramer!
Sara!Goodwin!
Eric!Antoniou!
!
!
!!
Pacbio!Cheryl!Heiner!
Greg!Khitrov!
Schatz!Lab!Mike!Schatz!
Hayan!Lee!
hmps://github.com/jgurtowski/ectools!
ECTools:!
Email:!