next generation sequencing of tomato - bioinformatics.nlsmit089/docs/2009_10_toledo.pdf · ready...
TRANSCRIPT
Project goal Produce a draft sequence assembly of
the tomato genome (cultivar Heinz 1706), anchored to a novel physical map
1. Generate the pieces 2. Assemble the pieces
into chunks 3. Order the chunks and
link them to the map
A short history International Tomato Sequencing Project
BAC-by-BAC approach
Oct 2008: Next Generation Sequencing Initiative
Tomato genome structure
Peters, Datema et al, Plant J. 2009
Participants NGS initiative
Italy, France, Japan, India, Spain, USA, UK, Netherlands
Whole Genome Assembly
Sequence read 1 Sequence read 2
Contig
Overlap
NNNN
Contig 1 Contig 2
Scaffold
Paired-end reads
Sequence data Type Amount 454 25 Gb SOLiD ~ 60 Gb SBM 3 Gb BAC ends 300,000 reads (~150,000 pairs) Fosmid ends 150,000 reads (~75,000 pairs) BAC contigs 70 Mb, 36% of euchromatin Physical map 10X BAC coverage
~90 Gb
NGS
Currently more than 80% available
Available physical map • Generated with KeyGene’s Whole Genome Profiling method • Input: ~ 92,000 BACs from 4 BAC libraries (3 enzyme libraries
and a new random shear library) • Assembled into approx. 2500 contigs!
Physical map • “Classical” FPC: bands • Sequence-based FPC: sequence tags
26 nt sequence tag GAATTCAAAACTAGAGGAATGAACCA
BAC 1
BAC 2
Assembly strategy
Map Assemble Integrate
• De novo assembly of sequence data – 454, SOLiD, SBM, clone ends
• Data and assembly integration • Mapping of data and assemblies onto WGP map
A possible assembly scenario
Physical map
454 454 454
SOLID SOLID SOLID
SBM SBM SBM SBM
BAC Contig BAC Contig BAC Contig
454 Assembly
SOLID Assembly
Existing BAC contigs
SBM Assembly
400,000 BAC-end pairs
400,000 Fosmid-end pairs
First international 454 assembly
Approx. # runs 58 # reads 58 * 106
# bases 21 Gb (22.4X) Approx. # nts WGS 15.0 Gb (15.8X) Approx. # nts PE 6.3 Gb (6.6X)
• Selection of high-quality 454 runs – Approx. 85 % of currently available 454 data – Criteria: read length, linker presence (if applicable)
22 X genome coverage All contigs Large contigs Scaffolds # Sequences 152,034 85,762 10,091 Total sequence length (Mb) 742 725 792 Avg. sequence length 4,882 8,458 78,476
Scaffolds of the international assembly
95% of assembly in 5% of the scaffolds
75% of assembly in scaffolds larger than 1 Mb
First 250 Mb (size of euchromatin) in 49 scaffolds larger than 3.4 Mb
Assembly vs. known sequences
Mitochondrion
95%
Chloroplast
99%
BAC ends
93%
ESTs
95%
Link to physical map
Physical map
2521 contigs 261,913 tags
Assembly
10091 scaffolds 90% of tags (236,454) on unique location
MAP
LINK
10085 links Avg. overlap 26.8 tags
2521 contigs 3940 scaffolds 770 Mb
SOLiD: Genomic DNA 60X genome coverage in SOLiD reads
Genomic DNA of Heinz 1 Kb, 4-5 Kb 10Kb insert-size
De novo assembly
Assembly cleanup/ Mapping on 454 assembly
SOLiD RNAseq
• Annotation • Assembly
HEINZ Poly A mRNA Ready for RNase III
2x
Total RNA from different tissues and organs of HEINZ
Marco Piettrella
• Annotation • Assembly • SNP discovery
SOLiD RNA seq Pimpi Poly A mRNA Ready for RNase III SOLiD RNA seq
Total RNA from different tissues and organs of pimpi
EELM
We are not done yet…! • Complete 454 assembly and hybrid assemblies
– 454, SBM, clone ends
• Quality control of data and assemblies • Assembly cleanup by SOLiD data • Anchoring to physical map • Annotation
More details and discussion tomorrow in joint WP-meeting 5.4 and 6.1 (16.00-17.30)
Conclusions • Exciting NextGen Sequencing approach • > 80% of data has been produced
– Sequence data (454, SOLiD, SBM) and high-quality physical map
• Preliminary results are promising – Assembled 792 Mb in scaffolds – More than 90% of the ESTs is in these scaffolds – 95% of assembly in only 5% of the scaffolds
First draft of the genome to be presented in January 2010
(PAG, San Diego)
Acknowledgements • Italy: Giovanni Giuliano, Giorgio Valle et al. • France: Mondher Bouzayen et al. • Japan: Satoshi Tabata et al. • India: Akhilesh Tyagi et al. • Spain: Antonio Granell et al. • US: Jim Giovannoni et al. • UK: Gerhard Bishop et al. • Netherlands:
– WUR/CBSG/EUSOL Elio Schijlen, Jose van de Belt, Marjo van Staveren, Erwin Datema, Jan van Haarst, Bas te Lintel, : Roeland van Ham, Willem Stiekema, Rene Klein Lankhorst
– Keygene: Richard Feron, Jan van Oeveren, Marcel Prins, Michiel van Eijk, Marco van Schriek
• Roche: Lei Du, Jason Affourtit, Gerard Irzyk, Jim Knight, Marcus Droege, Hans Lunstroo