genome sequencing-- strategies - nslc · 2005-04-25 · 2 overview jargon--terms used to describe...
TRANSCRIPT
1
Genome Sequencing--Strategies
Bio 4342Spring ‘05
Chordata
Amph
ibian
s Vertebrate genomes:Completed or in progress
Fishes
ReptilesBir
ds
Ma
rsu
pials
Mon
otrem
esRoden
ts
Prim
ates
Car
nivor
es
H. sapiens
2
Overview
Jargon--terms used to describeassembled sequence data
Genomes--what are they and how dowe approach sequencing them
Genome sequencing strategies,attributes and examples
Jargon Contig-an assembly of clones, based on shared
sequence (assembly) or shared restriction fragments(physical map)
Supercontig-contigs that are associated in order byvirtue of long-range read pair links
Ultracontig-supercontigs that are associated in orderby virtue of linkage to a physical map
N50 number-N50 is a length-weighted average ofcontig or scontig size, such that the averagenucleotide in an assembly will appear in a contig (orscontig) of N50 size or greater.
Paired end reads-the forward and reverse-primedsequence reads that define clone ends and insert size
3
What is a genome?
A genome can be defined as the entire DNA content of each nucleatedcell in an organism
Each organism has one or more chromosomes that contain all of its geneticInformation--its genome
Humans, for example, have a genome that is encoded on 46 chromosomes,organized into 23 pairs. One chromosome of each pair is inherited fromthe mother and one from the father. One pair of chromosomes determinesgender (“sex chromosomes”) and the other 22 pairs are called autosomes
The object of the Human Genome Project was to determine the entireDNA sequence of each of these long DNA molecules (chromosomes), to a high quality (finished sequence) and to locate and identify each of the human genes
Genome sequencingprocess
Library creation
Production sequencing
Assembly
Prefinishing
X 2
Finishing
4
Factors that determinesequencing strategy
Genome size Chromosomal structure Repeat content and character Polymorphism/inbreeding Desired end product (draft, finished) Supporting resources and information
(physical or genetic maps, closely relatedsequenced genomes, EST sequences)
Genome size determines…
Number and types of subclones Number of sequence reads Compute power & algorithm for
assembly of raw sequencing data In general, genome size and
complexity scales with the size of theorganism
5
Chromosomal structure
Centromeres and telomeres Pairing at meiosis—even numbers?? Haploid or diploid genome Numbers and types/extent of sequence
duplications
Repeat content andcharacter
Length and degree of degeneracy ofrepeats is important to know
Cot-based methods can help tocharacterize the repeat classes in agenome
Some newer strategies eliminaterepeat content from libraries via “Cotsubtraction”
6
Polymorphism/inbreeding
The extent to which an organism is inbreddetermines the degree of polymorphism(heterozygosity) within the genome
A high degree of heterozygosity cancomplicate assembly, depending upon thesequencing strategy and the assemblyalgorithm used
A preliminary assessment of polymorphismcan determine the sequencing strategy(“het testing”)
The desired end producthelps to determine strategy
“Finished” : completed to contiguityand high quality with no/few gaps inthe sequence, gaps sized
“Comparative draft” means orderedand oriented contigs with gaps that aresized
“Draft” means assembled into contigs(and supercontigs if possible), withoutorder and orientation
7
Supporting genomicresources
Physical maps provide a context or scaffoldfor the sequence assembly contigs, orprovide an ordered series (“tile path”) ofclones for clone-based sequencing
Genetic maps typically provide context interms of simple sequence repeats that occur“near” genes
The availability of genome sequence for aclosely related organism can provide somesupport for assembly validation
ESTs are partial gene sequences obtained bycloning and sequencing mRNA populations
General considerations
Timeline Desired end product Library making capacity/ability $$ Algorithms available for sequence
assembly
8
Common sequencingstrategies
Whole genome shotgun (WGS) Clone-by-clone (physical map required) Hybrid approach (WGS and clone-by-
clone) Map-assisted WGS (physical map
required)
Whole genome sequencingstrategy
Sequencing by whole genome shotgun usingseveral subclone insert sizes/types
Large insert clone end-sequences used forincreased contig alignment and scaffolding(fosmids/BACs)
Anchoring of assembled contigs to thegenome accomplished throughcomputational identification of knownmarkers or sequences within assembled WGScontigs
9
Whole genome shotgun genome
Assemble sequence reads
Shotgun (paired ends)
Large insert clones
scaffold
WGS Strategy
Advantages include few libraries and rapidaccumulation of sequence data
Disadvantages include lack of genome-scaleorder/organization confirmation includingrepeats assembly
Absent a physical map for organization, aWGS approach is best suited for lowcomplexity, small genomes (e.g. bacterial)
10
WGS Assembly algorithms
Arachne PCAP Apollo phusion Etc. All WGS algorithms deal poorly with
repeated regions
Clone-by-clone strategy A clone-by-clone strategy involves the initial creation
of a physical map of large-insert clones (BACs orfosmids)
Physical maps are built by generating restrictiondigests of individual clones (typically 15x coverage),then using computer-based algorithms to assemblecontigs of related clones
A “minimal tile path” is then generated, to provideminimally overlapping large insert clones that spanthe genome
These clones provide the input DNA for sequencing,including the DNA for shotgun libraries. Assemblyand finishing are organized by individual clone
Computer assembly of finished clones produces thewhole chromosome sequence
11
Bacterial Artificial Chromosomes(BACs)
BACs are DNA “vectors” that contain specificsequences: - a BAC can be put into a bacterial cell, such as E. coli - the BAC DNA is replicated and copies go to daughtercells during cell division - ~100,000 base long pieces of DNA are cloned into BACs - BAC clones containing genomic DNA represent amanageable size of DNA for mapping studies—they can beamplified, isolated and characterized by restrictiondigestion to determine shared sequences
29,950 bp
540 bp
BAC Fingerprints
Digested BAC DNA is electrophoresed on 1.2% agarose gels for 8 hours at 3 V/cm. Each gel contains 96 sample and 25 marker lanes.
12
How are BAC fingerprints used??
A fingerprint map is a set of fingerprints (restriction digests of BACs) that are assembled into “contigs”.
-Contigs are clusters of related BAC clones (they sharerestriction fragments, judging by their fingerprints)-Together, the BAC clones in a contig represent the genomic region from which the clones were derived
FACT: The human and mouse genomes each required over 300,000 BACfingerprints!!
How did we put all those together?
We use them to create a fingerprint map…
Computer-aided BAC fingerprintassembly
Individual gel lanes are identified Peaks are used to represent the sizesof the fragments
13
Fingerprint contig assembly
A B C D E F G
Using FPC, overlapping clones with common restriction fragmentsare identified
*
**
*
*The process is repeated toassemble the contig until nofurther clones can be incorporated
Manual editing of contigs is usuallyrequired
A computer program calledFPC is used to look at eachBAC clone fingerprint anddetermine relationships
A contig from the human BACfingerprint map
14
What is the end result of fingerprintmapping?
1. Most BAC clones have been assembled into contigs.
2. The length (in base pairs) of the contigs is roughly equal to theanticipated size of the genome.
3. The BAC fingerprint map can be used to select a set of BAC clonesthat overlap one another to a minimal extent. These so-called “minimaltiling path” BACs will be used for the next phase of DNA sequencing…
After finishing, each BAC fingerprintcan be compared to an“in silico digest” of the cloneto ensure quality in the finished sequence
Clone-by-clone strategy
Advantage is provided by the physicalmap—an independent means of confirmingsequence assembly in advance ofsequencing. A well developed algorithm forclone-based assembly is freely available(phrap)
Disadvantages include large numbers &types of libraries required, need to wait forphysical map to be ~complete beforeselecting minimal tile path for sequencing
Mainly suited for large complex genomeswith high degree of heterozygosity
15
Hybrid approach
Assemble WGS reads and add to BAC shotgun reads
STRATEGIES FOR THE SYSTEMATIC SEQUENCING OF COMPLEX GENOMESEric D. Green, NATURE REVIEWS | GENETICS VOLUME 2 | AUGUST 2001 | 573
Mouse/Rat genome sequencing
Examples of hybrid approach Two main components:
Whole genome shotgun high genome coverage rapid genome survey (aid human annotation)
BAC-by-BAC BACs selected from fingerprint database low sequence coverage
BAC end-sequences of good quality
16
Hybrid approach strategy
Advantages include lots of early data fromWGS, time to build physical map, pre-determined genome assembly
Disadvantages include need to generatemany different libraries/types, few/poorassembly algorithms available, expensive
Mainly applied to large, more complexgenomes
Map-assisted WGSstrategy
Physical map constructed of large insertclones concurrent with WGS sequencing ofvariable insert size subclones
End sequences of mapped clones enablelinking of map contigs to sequence assemblycontigs, supercontigs result
Linkage enables the organization of finishinginto discrete units (supercontigs) of knownorder
17
Maplink Viewer
Map-assisted WGSstrategy Ultimately, this strategy seeks to organize
the sequence assembly using the map, andto complete the map using the sequenceassembly
Gap closure of the map and sequence can beaccomplished by identifying gap-spanningclones, shotgun sequencing them, andassembling in the completed clone sequence
This is a newer strategy, and we have usedit on smaller genomes such as fungi todevelop the computational tools necessary toimplement it
18
Conclusions Many factors contribute to selecting a
genome sequencing strategy Strategies and the algorithms to enable them
are constantly being developed and refined Generating the production-style data for a
genome project is rarely the rate-limitingstep—finishing and gap closure are
Having a physical map is key to being able toverify a genome assembly, but newerstrategies entwine the sequence and mapbuilding processes