next generation sequencing of tomato - bioinformatics.nlsmit089/docs/2009_10_toledo.pdf · ready...

Next generation sequencing of

tomato Sandra Smit

Wageningen University [email protected]

Project goal Produce a draft sequence assembly of

the tomato genome (cultivar Heinz 1706), anchored to a novel physical map

1.  Generate the pieces 2.  Assemble the pieces

into chunks 3.  Order the chunks and

link them to the map

A short history International Tomato Sequencing Project

BAC-by-BAC approach

Oct 2008: Next Generation Sequencing Initiative

Tomato genome structure

Peters, Datema et al, Plant J. 2009

Participants NGS initiative

Italy, France, Japan, India, Spain, USA, UK, Netherlands

Whole Genome Assembly

Sequence read 1 Sequence read 2

Contig

Overlap

NNNN

Contig 1 Contig 2

Scaffold

Paired-end reads

Sequence data Type Amount 454 25 Gb SOLiD ~ 60 Gb SBM 3 Gb BAC ends 300,000 reads (~150,000 pairs) Fosmid ends 150,000 reads (~75,000 pairs) BAC contigs 70 Mb, 36% of euchromatin Physical map 10X BAC coverage

~90 Gb

NGS

Currently more than 80% available

Available physical map •  Generated with KeyGene’s Whole Genome Profiling method •  Input: ~ 92,000 BACs from 4 BAC libraries (3 enzyme libraries

and a new random shear library) •  Assembled into approx. 2500 contigs!

Physical map •  “Classical” FPC: bands •  Sequence-based FPC: sequence tags

26 nt sequence tag GAATTCAAAACTAGAGGAATGAACCA

BAC 1

BAC 2

Assembly strategy

Map Assemble Integrate

•  De novo assembly of sequence data –  454, SOLiD, SBM, clone ends

•  Data and assembly integration •  Mapping of data and assemblies onto WGP map

A possible assembly scenario

Physical map

454 454 454

SOLID SOLID SOLID

SBM SBM SBM SBM

BAC Contig BAC Contig BAC Contig

454 Assembly

SOLID Assembly

Existing BAC contigs

SBM Assembly

400,000 BAC-end pairs

400,000 Fosmid-end pairs

First international 454 assembly

Approx. # runs 58 # reads 58 * 106

# bases 21 Gb (22.4X) Approx. # nts WGS 15.0 Gb (15.8X) Approx. # nts PE 6.3 Gb (6.6X)

•  Selection of high-quality 454 runs –  Approx. 85 % of currently available 454 data –  Criteria: read length, linker presence (if applicable)

22 X genome coverage All contigs Large contigs Scaffolds # Sequences 152,034 85,762 10,091 Total sequence length (Mb) 742 725 792 Avg. sequence length 4,882 8,458 78,476

Scaffolds of the international assembly

95% of assembly in 5% of the scaffolds

75% of assembly in scaffolds larger than 1 Mb

First 250 Mb (size of euchromatin) in 49 scaffolds larger than 3.4 Mb

Assembly vs. known sequences

Mitochondrion

95%

Chloroplast

99%

BAC ends

93%

ESTs

95%

Link to physical map

Physical map

2521 contigs 261,913 tags

Assembly

10091 scaffolds 90% of tags (236,454) on unique location

MAP

LINK

10085 links Avg. overlap 26.8 tags

2521 contigs 3940 scaffolds 770 Mb

SOLiD: Genomic DNA 60X genome coverage in SOLiD reads

Genomic DNA of Heinz 1 Kb, 4-5 Kb 10Kb insert-size

De novo assembly

Assembly cleanup/ Mapping on 454 assembly

SOLiD RNAseq

•  Annotation •  Assembly

HEINZ Poly A mRNA Ready for RNase III

2x

Total RNA from different tissues and organs of HEINZ

Marco Piettrella

•  Annotation •  Assembly •  SNP discovery

SOLiD RNA seq Pimpi Poly A mRNA Ready for RNase III SOLiD RNA seq

Total RNA from different tissues and organs of pimpi

EELM

We are not done yet…! •  Complete 454 assembly and hybrid assemblies

–  454, SBM, clone ends

•  Quality control of data and assemblies •  Assembly cleanup by SOLiD data •  Anchoring to physical map •  Annotation

More details and discussion tomorrow in joint WP-meeting 5.4 and 6.1 (16.00-17.30)

Conclusions •  Exciting NextGen Sequencing approach •  > 80% of data has been produced

–  Sequence data (454, SOLiD, SBM) and high-quality physical map

•  Preliminary results are promising –  Assembled 792 Mb in scaffolds –  More than 90% of the ESTs is in these scaffolds –  95% of assembly in only 5% of the scaffolds

First draft of the genome to be presented in January 2010

(PAG, San Diego)

Acknowledgements •  Italy: Giovanni Giuliano, Giorgio Valle et al. •  France: Mondher Bouzayen et al. •  Japan: Satoshi Tabata et al. •  India: Akhilesh Tyagi et al. •  Spain: Antonio Granell et al. •  US: Jim Giovannoni et al. •  UK: Gerhard Bishop et al. •  Netherlands:

–  WUR/CBSG/EUSOL Elio Schijlen, Jose van de Belt, Marjo van Staveren, Erwin Datema, Jan van Haarst, Bas te Lintel, : Roeland van Ham, Willem Stiekema, Rene Klein Lankhorst

–  Keygene: Richard Feron, Jan van Oeveren, Marcel Prins, Michiel van Eijk, Marco van Schriek

•  Roche: Lei Du, Jason Affourtit, Gerard Irzyk, Jim Knight, Marcus Droege, Hans Lunstroo

next generation sequencing of tomato - bioinformatics.nlsmit089/docs/2009_10_toledo.pdf · ready...

Documents