dna and genome sequencing - illinoisstan.cropsci.uiuc.edu/courses/cpsc265/class9.pdf · dna...

Matthew HudsonDept of Crop Sciences

University of Illinois

DNA and genome sequencing

Genome projects

2,424 ongoing genome projects

696 for eukaryotes

520 completed genomes

47 from eukaryotes

Almost every crop now has a genome project

DNA Sequencing

• Dideoxy sequencing was developed by Fred Sanger at Cambridge in the 1970s. Often called “Sanger sequencing”.

Nobel prize number 2 for Fred Sanger in 1980, shared with WalterGilbert from Harvard (inventor of the now little-used Maxam-Gilbertsequencing method).

Sanger’s Dideoxy DNA sequencing method -How it works:

1. DNA template is denatured to single strands.

2. DNA primer (with 3’ end near sequence of interest) is annealed to the template DNA and extended with DNA polymerase.

3. Four reactions are set up, each containing:

1. DNA template – eg a plasmid2. Primer3. DNA polymerase4. dNTPS (dATP, dTTP, dCTP, and dGTP)

4. Next, a different radio-labeled dideoxynucleotide (ddATP, ddTTP, ddCTP, or ddGTP) is added to each of the four reaction tubes at 1/100th the concentration of normal dNTPs……

ddNTPs are terminators: they possess a 3’-H instead of 3’-OH, compete in the reaction with normal dNTPS, and produce no phosphodiester bond.

Whenever the radio-labeled ddNTPs are incorporated in the chain, DNA synthesis terminates.

Terminators stop further elongation of a DNA deoxyribose-phosphate backbone

“hasta la vista”

Manual Dideoxy DNA sequencing-How it works (cont.):

5. Each of the four reaction mixtures produces a population of DNA molecules with DNA chains terminating at each “terminator”base..

6. Extension products in each of the four reaction mixutes also end with a different radio-labeled ddNTP (depending on the base).

7. Next, each reaction mixture is electrophoresed in a separate lane (4 lanes) at high voltage on a polyacrylamide gel.

8. Pattern of bands in each of the four lanes is visualized on X-ray film.

9. Location of “bands” in each of the four lanes indicate the size of the fragment terminating with a respective radio-labeled ddNTP.

10. DNA sequence is deduced from the pattern of bands in the 4 lanes.

Vigilant et al. 1989PNAS 86:9350-9354

Short products

Long products

Radio-labeled ddNTPs (4 rxns)

Sequence (5’ to 3’)

GGATATAACCCCTGT

Manual vs automatic sequencing

Manual sequencing has basically died out.

It needs four lanes, radioactive gels, and a technician in one day from one gel can get four sets of four lanes, with maybe 300 base pairs of data from each template.

Everyone now uses “automatic sequencing” – the downside is no one lab can afford the machine, so it is done in a central facility (eg. Keck center).

Most automated DNA sequencers can load robotically and operate around the clock for weeks with minimal labor.

Dye deoxy terminators

One tube.One gel lane or capilliary

Robotic 96 capillary machine:ABI 3730 xl

DNA sequence output from ABI 377 (a gel-based sequencer)

1. Trace files (dye signals) are analyzed and bases called to create chromatograms.

2. Chromatograms from opposite strands are reconciled with software to create double-stranded sequence data.

Genome sequencing

How do you use these chunks of sequence to make a “whole genome” sequence?

The “traditional” genome

A physical map is madeA BAC “tiling path” is createdBACs are farmed out to hundreds of collaborating laboratoriesEach lab does a few BACs

Arabidopsis, E. coli etc were done this way, but since Craig Venter got interested, everything is “going shotgun”

Shotgun Genome Sequencing

Slow and expensive..but accurate and completeand assembly is straightforward

Much faster and cheapervery hard to get complete genomeassembly of large (>10Mb) genomes

Finished genome Shotgun genome Maize now

Whole chromosome sequences 100kb average chunks Some BAC contigsDone clone by clone Need physical map MAGIse.g. human, Arabidopsis e.g. poplar

Shotgun sequencing

~700 bases per read

One or two reads per clone

Shotgun sequence of mouse, ~2.6GB, 7x coverage

That’s 26,000,000 sequencing reactions, 13,000,000 minipreps…

Extract DNA

ShearLigate into

library

Pick clones Grow clones

Extract vectorDNA

Sequence using ddNTPs

Read fragmentswith gel orcapillary

The genome factory

There are a few centers around the world that havea “factory” big enough to do shotgun sequence of a large eukaryotic genome:

Broad Institute, MITBaylor College of Medicine, HoustonWashington University, St LouisDoE Joint Genomics Institute, Walnut Creek, CA

Sanger Centre, CambridgeBeijing Genomics Institute, Chinese Academy of Sciences

Pictures from JGI

Qpix robot – picks colonies

Biomek – PCR / cleanup robot

PCR – 384 x 4 x 48 x 3

About 150 sequencers, at $200,000 each…

Sequence analysis

Bioinformatics

Armies of programmers and large supercomputers are necessary toassemble and annotate the sequence

Assembly and annotation

Assembly – we have to compare those 30,000,000 seqenceswith each other and work out how they fit together. Nasty mathematical problem…

Annotation – when we have the sequence, we have to work out where the genes are and what they do. Mostly a computational problem – very large databases.

Whole-genome resequencing

Wouldn’t it be great to have the whole genome of each line you work with? Then the whole genome would be haplotyped.

Whole plant or metazoan genomes still cost $40-50m

NIH have target for human genome to cost $100,000 in 2010

$1,000 in 2020

This is likely to be achieved ahead of schedule

Human resequencing technology is likely to have a big impact on plant biology also.

Cost of sequencing is falling exponentially

0.001

0.01

0.1

1

10

1994 1996 1998 2000 2002 2004 2006

Cos

t per

bas

e ($

)

Robotic 96 capillary machine:ABI 3730 xl

DNA sequence output from ABI 377 (a gel-based sequencer)

1. Trace files (~350KB / run)

2. Analyzed and bases called to create sequence and quality files (~2kb / run)

3. One run is about 700 base pairs (bp)

4. Typical genome project – soybean – 6M runs so far

Limits to how cheap sequencing can get using the Sanger method

~700 bases per read

One or two reads per clone

Cost: $2 per read high throughputPlus costs of clone generation ~$1Total current lowest cost, ~$5/kb, 0.5c /Q20 base

Extract DNA

ShearLigate into

library

Pick clones Grow clones

Extract vectorDNA

Sequence using ddNTPs

Read fragmentswith gel orcapillary

Next-generation sequencing

A number of proprietary technologies, most based on the manipulation of microbeads and/or nanobeads where sequencing is performed without gels or capillaries

First on the market was a company called 454 (now Roche) now on the second generation of instruments.

454 have a major competitor in Solexa (now Illumina)

Recently AB announced its own next-generation platform, SOLiD (AB acquired Agencourt)

Next-generation sequencing approach Extract and

Shear DNA

Fluorescent orluminescent

readout in situ

Isolate clonalmolecules on beads

“polony”amplification

Immobilize onSolid support

No E. coli

No plasmids

No freezers

No hydras

No gels

No capillaries

454 Sequencing technology

Picowell (50nm) technology

Sequencing by synthesis using chemiluminescenceGS20:

20Mb of sequence for ~$5,000 in running costsQuality is similar to early ESTs (97-98% at best)We have no clone information, so no read pairings

Homopolymer…

“flowgram file” – binary SFF format

About 250 MB per run

Similar to trace file – contains luminosity readingsfor each of 1.6M wells from a photomultiplier,for each of four bases, for each of 42 flow cycles

Processed using on-board FPGA with instrument

Others have tried to improve software, but 454’s is still best all round

Data output

454 “FLX”

Claimed: 100 MB per run, 200+ base reads

Cost: ~$12,000 / run in reagents & basic maintenance

Ours delivered Tues June 12 –no data yet

1Gb of sequence for < $3,000 in running costs

Data output

No access to data yet, reportedly:

A series of huge image filesEach is colorAnalysis uses image analysis techniquesRaw data output is ~ 500GB per runCurrent customers say compute infrastructure cannot cope100s of CPU hours to process one runRaw data currently must be discarded

Polony sequencing / ABI SOLiD

George Church’s group invented “polony” method

Since developed by Agencourt

Now bought by ABI

Similar to Solexa – no wells, small beads, 4-color fluorescent detection, about 1G per run, about $3,000 per run

Uses ligation of nucleotide-specific probes rather than reversible terminators

Summaryof NGStechnologies

dna and genome sequencing - illinoisstan.cropsci.uiuc.edu/courses/cpsc265/class9.pdf · dna...

Documents