genome sequencing-- strategies - nslc · 2005-04-25 · 2 overview jargon--terms used to describe...

18
1 Genome Sequencing-- Strategies Bio 4342 Spring ‘05 Chordata Amphibians Vertebrate genomes: Completed or in progress Fishes Reptiles Birds Marsupials Monotremes Rodents Primates Carnivores H. sapiens

Upload: others

Post on 18-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

1

Genome Sequencing--Strategies

Bio 4342Spring ‘05

Chordata

Amph

ibian

s Vertebrate genomes:Completed or in progress

Fishes

ReptilesBir

ds

Ma

rsu

pials

Mon

otrem

esRoden

ts

Prim

ates

Car

nivor

es

H. sapiens

Page 2: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

2

Overview

Jargon--terms used to describeassembled sequence data

Genomes--what are they and how dowe approach sequencing them

Genome sequencing strategies,attributes and examples

Jargon Contig-an assembly of clones, based on shared

sequence (assembly) or shared restriction fragments(physical map)

Supercontig-contigs that are associated in order byvirtue of long-range read pair links

Ultracontig-supercontigs that are associated in orderby virtue of linkage to a physical map

N50 number-N50 is a length-weighted average ofcontig or scontig size, such that the averagenucleotide in an assembly will appear in a contig (orscontig) of N50 size or greater.

Paired end reads-the forward and reverse-primedsequence reads that define clone ends and insert size

Page 3: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

3

What is a genome?

A genome can be defined as the entire DNA content of each nucleatedcell in an organism

Each organism has one or more chromosomes that contain all of its geneticInformation--its genome

Humans, for example, have a genome that is encoded on 46 chromosomes,organized into 23 pairs. One chromosome of each pair is inherited fromthe mother and one from the father. One pair of chromosomes determinesgender (“sex chromosomes”) and the other 22 pairs are called autosomes

The object of the Human Genome Project was to determine the entireDNA sequence of each of these long DNA molecules (chromosomes), to a high quality (finished sequence) and to locate and identify each of the human genes

Genome sequencingprocess

Library creation

Production sequencing

Assembly

Prefinishing

X 2

Finishing

Page 4: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

4

Factors that determinesequencing strategy

Genome size Chromosomal structure Repeat content and character Polymorphism/inbreeding Desired end product (draft, finished) Supporting resources and information

(physical or genetic maps, closely relatedsequenced genomes, EST sequences)

Genome size determines…

Number and types of subclones Number of sequence reads Compute power & algorithm for

assembly of raw sequencing data In general, genome size and

complexity scales with the size of theorganism

Page 5: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

5

Chromosomal structure

Centromeres and telomeres Pairing at meiosis—even numbers?? Haploid or diploid genome Numbers and types/extent of sequence

duplications

Repeat content andcharacter

Length and degree of degeneracy ofrepeats is important to know

Cot-based methods can help tocharacterize the repeat classes in agenome

Some newer strategies eliminaterepeat content from libraries via “Cotsubtraction”

Page 6: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

6

Polymorphism/inbreeding

The extent to which an organism is inbreddetermines the degree of polymorphism(heterozygosity) within the genome

A high degree of heterozygosity cancomplicate assembly, depending upon thesequencing strategy and the assemblyalgorithm used

A preliminary assessment of polymorphismcan determine the sequencing strategy(“het testing”)

The desired end producthelps to determine strategy

“Finished” : completed to contiguityand high quality with no/few gaps inthe sequence, gaps sized

“Comparative draft” means orderedand oriented contigs with gaps that aresized

“Draft” means assembled into contigs(and supercontigs if possible), withoutorder and orientation

Page 7: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

7

Supporting genomicresources

Physical maps provide a context or scaffoldfor the sequence assembly contigs, orprovide an ordered series (“tile path”) ofclones for clone-based sequencing

Genetic maps typically provide context interms of simple sequence repeats that occur“near” genes

The availability of genome sequence for aclosely related organism can provide somesupport for assembly validation

ESTs are partial gene sequences obtained bycloning and sequencing mRNA populations

General considerations

Timeline Desired end product Library making capacity/ability $$ Algorithms available for sequence

assembly

Page 8: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

8

Common sequencingstrategies

Whole genome shotgun (WGS) Clone-by-clone (physical map required) Hybrid approach (WGS and clone-by-

clone) Map-assisted WGS (physical map

required)

Whole genome sequencingstrategy

Sequencing by whole genome shotgun usingseveral subclone insert sizes/types

Large insert clone end-sequences used forincreased contig alignment and scaffolding(fosmids/BACs)

Anchoring of assembled contigs to thegenome accomplished throughcomputational identification of knownmarkers or sequences within assembled WGScontigs

Page 9: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

9

Whole genome shotgun genome

Assemble sequence reads

Shotgun (paired ends)

Large insert clones

scaffold

WGS Strategy

Advantages include few libraries and rapidaccumulation of sequence data

Disadvantages include lack of genome-scaleorder/organization confirmation includingrepeats assembly

Absent a physical map for organization, aWGS approach is best suited for lowcomplexity, small genomes (e.g. bacterial)

Page 10: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

10

WGS Assembly algorithms

Arachne PCAP Apollo phusion Etc. All WGS algorithms deal poorly with

repeated regions

Clone-by-clone strategy A clone-by-clone strategy involves the initial creation

of a physical map of large-insert clones (BACs orfosmids)

Physical maps are built by generating restrictiondigests of individual clones (typically 15x coverage),then using computer-based algorithms to assemblecontigs of related clones

A “minimal tile path” is then generated, to provideminimally overlapping large insert clones that spanthe genome

These clones provide the input DNA for sequencing,including the DNA for shotgun libraries. Assemblyand finishing are organized by individual clone

Computer assembly of finished clones produces thewhole chromosome sequence

Page 11: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

11

Bacterial Artificial Chromosomes(BACs)

BACs are DNA “vectors” that contain specificsequences: - a BAC can be put into a bacterial cell, such as E. coli - the BAC DNA is replicated and copies go to daughtercells during cell division - ~100,000 base long pieces of DNA are cloned into BACs - BAC clones containing genomic DNA represent amanageable size of DNA for mapping studies—they can beamplified, isolated and characterized by restrictiondigestion to determine shared sequences

29,950 bp

540 bp

BAC Fingerprints

Digested BAC DNA is electrophoresed on 1.2% agarose gels for 8 hours at 3 V/cm. Each gel contains 96 sample and 25 marker lanes.

Page 12: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

12

How are BAC fingerprints used??

A fingerprint map is a set of fingerprints (restriction digests of BACs) that are assembled into “contigs”.

-Contigs are clusters of related BAC clones (they sharerestriction fragments, judging by their fingerprints)-Together, the BAC clones in a contig represent the genomic region from which the clones were derived

FACT: The human and mouse genomes each required over 300,000 BACfingerprints!!

How did we put all those together?

We use them to create a fingerprint map…

Computer-aided BAC fingerprintassembly

Individual gel lanes are identified Peaks are used to represent the sizesof the fragments

Page 13: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

13

Fingerprint contig assembly

A B C D E F G

Using FPC, overlapping clones with common restriction fragmentsare identified

*

**

*

*The process is repeated toassemble the contig until nofurther clones can be incorporated

Manual editing of contigs is usuallyrequired

A computer program calledFPC is used to look at eachBAC clone fingerprint anddetermine relationships

A contig from the human BACfingerprint map

Page 14: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

14

What is the end result of fingerprintmapping?

1. Most BAC clones have been assembled into contigs.

2. The length (in base pairs) of the contigs is roughly equal to theanticipated size of the genome.

3. The BAC fingerprint map can be used to select a set of BAC clonesthat overlap one another to a minimal extent. These so-called “minimaltiling path” BACs will be used for the next phase of DNA sequencing…

After finishing, each BAC fingerprintcan be compared to an“in silico digest” of the cloneto ensure quality in the finished sequence

Clone-by-clone strategy

Advantage is provided by the physicalmap—an independent means of confirmingsequence assembly in advance ofsequencing. A well developed algorithm forclone-based assembly is freely available(phrap)

Disadvantages include large numbers &types of libraries required, need to wait forphysical map to be ~complete beforeselecting minimal tile path for sequencing

Mainly suited for large complex genomeswith high degree of heterozygosity

Page 15: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

15

Hybrid approach

Assemble WGS reads and add to BAC shotgun reads

STRATEGIES FOR THE SYSTEMATIC SEQUENCING OF COMPLEX GENOMESEric D. Green, NATURE REVIEWS | GENETICS VOLUME 2 | AUGUST 2001 | 573

Mouse/Rat genome sequencing

Examples of hybrid approach Two main components:

Whole genome shotgun high genome coverage rapid genome survey (aid human annotation)

BAC-by-BAC BACs selected from fingerprint database low sequence coverage

BAC end-sequences of good quality

Page 16: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

16

Hybrid approach strategy

Advantages include lots of early data fromWGS, time to build physical map, pre-determined genome assembly

Disadvantages include need to generatemany different libraries/types, few/poorassembly algorithms available, expensive

Mainly applied to large, more complexgenomes

Map-assisted WGSstrategy

Physical map constructed of large insertclones concurrent with WGS sequencing ofvariable insert size subclones

End sequences of mapped clones enablelinking of map contigs to sequence assemblycontigs, supercontigs result

Linkage enables the organization of finishinginto discrete units (supercontigs) of knownorder

Page 17: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

17

Maplink Viewer

Map-assisted WGSstrategy Ultimately, this strategy seeks to organize

the sequence assembly using the map, andto complete the map using the sequenceassembly

Gap closure of the map and sequence can beaccomplished by identifying gap-spanningclones, shotgun sequencing them, andassembling in the completed clone sequence

This is a newer strategy, and we have usedit on smaller genomes such as fungi todevelop the computational tools necessary toimplement it

Page 18: Genome Sequencing-- Strategies - NSLC · 2005-04-25 · 2 Overview Jargon--terms used to describe assembled sequence data Genomes--what are they and how do we approach sequencing

18

Conclusions Many factors contribute to selecting a

genome sequencing strategy Strategies and the algorithms to enable them

are constantly being developed and refined Generating the production-style data for a

genome project is rarely the rate-limitingstep—finishing and gap closure are

Having a physical map is key to being able toverify a genome assembly, but newerstrategies entwine the sequence and mapbuilding processes