dna and genome sequencing - illinoisstan.cropsci.uiuc.edu/courses/cpsc265/class9.pdf · dna...
TRANSCRIPT
Matthew HudsonDept of Crop Sciences
University of Illinois
DNA and genome sequencing
Genome projects
2,424 ongoing genome projects
696 for eukaryotes
520 completed genomes
47 from eukaryotes
Almost every crop now has a genome project
DNA Sequencing
• Dideoxy sequencing was developed by Fred Sanger at Cambridge in the 1970s. Often called “Sanger sequencing”.
Nobel prize number 2 for Fred Sanger in 1980, shared with WalterGilbert from Harvard (inventor of the now little-used Maxam-Gilbertsequencing method).
Sanger’s Dideoxy DNA sequencing method -How it works:
1. DNA template is denatured to single strands.
2. DNA primer (with 3’ end near sequence of interest) is annealed to the template DNA and extended with DNA polymerase.
3. Four reactions are set up, each containing:
1. DNA template – eg a plasmid2. Primer3. DNA polymerase4. dNTPS (dATP, dTTP, dCTP, and dGTP)
4. Next, a different radio-labeled dideoxynucleotide (ddATP, ddTTP, ddCTP, or ddGTP) is added to each of the four reaction tubes at 1/100th the concentration of normal dNTPs……
ddNTPs are terminators: they possess a 3’-H instead of 3’-OH, compete in the reaction with normal dNTPS, and produce no phosphodiester bond.
Whenever the radio-labeled ddNTPs are incorporated in the chain, DNA synthesis terminates.
Terminators stop further elongation of a DNA deoxyribose-phosphate backbone
“hasta la vista”
Manual Dideoxy DNA sequencing-How it works (cont.):
5. Each of the four reaction mixtures produces a population of DNA molecules with DNA chains terminating at each “terminator”base..
6. Extension products in each of the four reaction mixutes also end with a different radio-labeled ddNTP (depending on the base).
7. Next, each reaction mixture is electrophoresed in a separate lane (4 lanes) at high voltage on a polyacrylamide gel.
8. Pattern of bands in each of the four lanes is visualized on X-ray film.
9. Location of “bands” in each of the four lanes indicate the size of the fragment terminating with a respective radio-labeled ddNTP.
10. DNA sequence is deduced from the pattern of bands in the 4 lanes.
Vigilant et al. 1989PNAS 86:9350-9354
Short products
Long products
Radio-labeled ddNTPs (4 rxns)
Sequence (5’ to 3’)
GGATATAACCCCTGT
Manual vs automatic sequencing
Manual sequencing has basically died out.
It needs four lanes, radioactive gels, and a technician in one day from one gel can get four sets of four lanes, with maybe 300 base pairs of data from each template.
Everyone now uses “automatic sequencing” – the downside is no one lab can afford the machine, so it is done in a central facility (eg. Keck center).
Most automated DNA sequencers can load robotically and operate around the clock for weeks with minimal labor.
Dye deoxy terminators
One tube.One gel lane or capilliary
Robotic 96 capillary machine:ABI 3730 xl
DNA sequence output from ABI 377 (a gel-based sequencer)
1. Trace files (dye signals) are analyzed and bases called to create chromatograms.
2. Chromatograms from opposite strands are reconciled with software to create double-stranded sequence data.
Genome sequencing
How do you use these chunks of sequence to make a “whole genome” sequence?
The “traditional” genome
A physical map is madeA BAC “tiling path” is createdBACs are farmed out to hundreds of collaborating laboratoriesEach lab does a few BACs
Arabidopsis, E. coli etc were done this way, but since Craig Venter got interested, everything is “going shotgun”
Shotgun Genome Sequencing
Slow and expensive..but accurate and completeand assembly is straightforward
Much faster and cheapervery hard to get complete genomeassembly of large (>10Mb) genomes
Finished genome Shotgun genome Maize now
Whole chromosome sequences 100kb average chunks Some BAC contigsDone clone by clone Need physical map MAGIse.g. human, Arabidopsis e.g. poplar
Shotgun sequencing
~700 bases per read
One or two reads per clone
Shotgun sequence of mouse, ~2.6GB, 7x coverage
That’s 26,000,000 sequencing reactions, 13,000,000 minipreps…
Extract DNA
ShearLigate into
library
Pick clones Grow clones
Extract vectorDNA
Sequence using ddNTPs
Read fragmentswith gel orcapillary
The genome factory
There are a few centers around the world that havea “factory” big enough to do shotgun sequence of a large eukaryotic genome:
Broad Institute, MITBaylor College of Medicine, HoustonWashington University, St LouisDoE Joint Genomics Institute, Walnut Creek, CA
Sanger Centre, CambridgeBeijing Genomics Institute, Chinese Academy of Sciences
Pictures from JGI
Qpix robot – picks colonies
Biomek – PCR / cleanup robot
PCR – 384 x 4 x 48 x 3
About 150 sequencers, at $200,000 each…
Sequence analysis
Bioinformatics
Armies of programmers and large supercomputers are necessary toassemble and annotate the sequence
Assembly and annotation
Assembly – we have to compare those 30,000,000 seqenceswith each other and work out how they fit together. Nasty mathematical problem…
Annotation – when we have the sequence, we have to work out where the genes are and what they do. Mostly a computational problem – very large databases.
Whole-genome resequencing
Wouldn’t it be great to have the whole genome of each line you work with? Then the whole genome would be haplotyped.
Whole plant or metazoan genomes still cost $40-50m
NIH have target for human genome to cost $100,000 in 2010
$1,000 in 2020
This is likely to be achieved ahead of schedule
Human resequencing technology is likely to have a big impact on plant biology also.
Cost of sequencing is falling exponentially
0.001
0.01
0.1
1
10
1994 1996 1998 2000 2002 2004 2006
Cos
t per
bas
e ($
)
Robotic 96 capillary machine:ABI 3730 xl
DNA sequence output from ABI 377 (a gel-based sequencer)
1. Trace files (~350KB / run)
2. Analyzed and bases called to create sequence and quality files (~2kb / run)
3. One run is about 700 base pairs (bp)
4. Typical genome project – soybean – 6M runs so far
Limits to how cheap sequencing can get using the Sanger method
~700 bases per read
One or two reads per clone
Cost: $2 per read high throughputPlus costs of clone generation ~$1Total current lowest cost, ~$5/kb, 0.5c /Q20 base
Extract DNA
ShearLigate into
library
Pick clones Grow clones
Extract vectorDNA
Sequence using ddNTPs
Read fragmentswith gel orcapillary
Next-generation sequencing
A number of proprietary technologies, most based on the manipulation of microbeads and/or nanobeads where sequencing is performed without gels or capillaries
First on the market was a company called 454 (now Roche) now on the second generation of instruments.
454 have a major competitor in Solexa (now Illumina)
Recently AB announced its own next-generation platform, SOLiD (AB acquired Agencourt)
Next-generation sequencing approach Extract and
Shear DNA
Fluorescent orluminescent
readout in situ
Isolate clonalmolecules on beads
“polony”amplification
Immobilize onSolid support
No E. coli
No plasmids
No freezers
No hydras
No gels
No capillaries
454 Sequencing technology
Picowell (50nm) technology
Sequencing by synthesis using chemiluminescenceGS20:
20Mb of sequence for ~$5,000 in running costsQuality is similar to early ESTs (97-98% at best)We have no clone information, so no read pairings
Homopolymer…
“flowgram file” – binary SFF format
About 250 MB per run
Similar to trace file – contains luminosity readingsfor each of 1.6M wells from a photomultiplier,for each of four bases, for each of 42 flow cycles
Processed using on-board FPGA with instrument
Others have tried to improve software, but 454’s is still best all round
Data output
454 “FLX”
Claimed: 100 MB per run, 200+ base reads
Cost: ~$12,000 / run in reagents & basic maintenance
Ours delivered Tues June 12 –no data yet
1Gb of sequence for < $3,000 in running costs
Data output
No access to data yet, reportedly:
A series of huge image filesEach is colorAnalysis uses image analysis techniquesRaw data output is ~ 500GB per runCurrent customers say compute infrastructure cannot cope100s of CPU hours to process one runRaw data currently must be discarded
Polony sequencing / ABI SOLiD
George Church’s group invented “polony” method
Since developed by Agencourt
Now bought by ABI
Similar to Solexa – no wells, small beads, 4-color fluorescent detection, about 1G per run, about $3,000 per run
Uses ligation of nucleotide-specific probes rather than reversible terminators
Summaryof NGStechnologies