last lecture summary

26
Last lecture summary

Upload: kory

Post on 23-Mar-2016

44 views

Category:

Documents


2 download

DESCRIPTION

Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector (plasmid, BAC), PCR genome mapping. relative locations of genes are established by following inheritance patterns. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Last lecture summary

Last lecture summary

Page 2: Last lecture summary

• recombinant DNA technology• DNA polymerase (copy DNA), restriction endonucleases (cut

DNA), ligases (join DNA)• DNA cloning – vector (plasmid, BAC), PCR

• genome mapping

relative locations of genes are established by following inheritance patterns

visual appearance of a chromosome when stained and examined under a microscope

the order and spacing of the genes, measured in base pairs

sequence map

Page 3: Last lecture summary

• genetic markers• polymorphic (alternative alleles)• restriction fragment length polymorphisms (RFLPs)

• some restriction sites exist as two alleles• simple sequence length polymorphisms (SSLPs)

• repeat sequences, minisatellites (repeat unit up to 25 bp), microsatellites (repeat unit of 2-4 bp)

• single nucleotide polymorphisms (SNPs, pron.: “snips”)• Positions in a genome where some individuals have one nucleotide and

others have a different nucleotide

RFLP SSLP

Page 4: Last lecture summary

New stuff

Page 5: Last lecture summary

DNA sequencing• Sanger method, chain-termination method,

developed 1974, Nobel prize in chemistry 1980• The key principle: use of dideoxynucleotide triphosphates

(ddNTPs) as DNA chain terminators.

source: http://openwetware.org/wiki/BE.109:Bio-material_engineering/Sequence_analysis

dNTP ddNTP

Daniel Svozil
Show Flash video from How to sequence a genome (http://www.genome.gov/25019885 - c. 6, 7, 8, 9)
Page 7: Last lecture summary

Shotgun sequencing• Current technology can only reliably sequence a short stretch –

a ‘read’ is typically ~1000 bp.• However genomes are large. The sequence of a long DNA

molecule has to be constructed from a series of shorter sequences.

• This is done by breaking (cleaving by restriction endonuclease) the molecule into fragments, determining their sequences, and using a computer to search for overlaps and build up the master sequence

• This shotgun sequencing is the standard approach for sequencing small prokaryotic genomes.

• But is much more difficult with larger eukaryotic genomes, as it can lead to errors when repeats are analyzed.• human genome is repeat-rich, >50% repeats (50-500 kpb duplicated

regions with >98% identity)

Page 8: Last lecture summary

Target

Copies

Shotgun

Sequence each short piece

Sequence assembly

Consensus

Finalizing (directed read)source: slides by Martin Farach-Colton

Page 9: Last lecture summary

source: Brown T. A. , Genomes. 2nd ed. http://www.ncbi.nlm.nih.gov/books/NBK21129/

Page 10: Last lecture summary

Human genome project (HGP)• Determine the sequence of haploid human

genome• Govermentally funded (DOE)• Began in 1990, working draft published

in 2001, complete in 2003, last chromosome finished in 2006

• Cost: $3 billion• Whose genome was sequenced?

• The “reference genome” is a composite from several people who donated blood samples.

Page 11: Last lecture summary

Celera - competition begins• In 1998, a similar, privately

funded quest was launchedby the American researcherCraig Venter, and his companyCelera Genomics.

• The $300,000,000 Celera effortwas intended to proceed at a faster pace and at a fraction of the cost.

• Celera wanted to patent identified genes.• Celera promised to release data annually (while the HGP daily). However,

Celera would, unlike HGP, not permit free redistribution or scientific use of the data.

• HGP was compelled to release (7.7. 2000) the first draft of the human genome before Celera for this reason.

Page 12: Last lecture summary

How did it end?• March 2000 – president Clinton announced that the

genome sequence could not be patented, and should be made freely available to all researchers.

• The statement sent Celera's stock plummeting and dragged down the biotechnology-heavy Nasdaq. The biotechnology sector lost about $50 billion in two days.

• Celera and HGP annouced jointly the draft sequence in 2000.

• The drafts covered about 83% of the genome. • Improved drafts were announced in 2003 and 2005, filling

in to ≈92% of the sequence currently.

Page 13: Last lecture summary

Human genome• 3 billions bps, ~20 000 – 25 000 genes• Only 1.1 – 1.4 % of the genome sequence codes for proteins.• State of completion:

• best estimate – 92.3% is complete• problematic unfinished regions: centromeres, telomeres (both contain

highly repetitive sequences), some unclosed gaps• It is likely that the centromeres and telomeres will remain unsequenced

until new technology is developed• Genome is stored in databases

• Primary database – Genebank (http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide)

• Additional data and annotation, tools for visualizing and searching• UCSCS (http://genome.ucsc.edu)• Ensembl (http://www.ensembl.org)

Page 14: Last lecture summary

Hierarchical genome shotgun – HGS• Hierarchical genome shotgun, hierarchical shotgun

sequencing, clone-by-clone sequencing, map-based shotgun sequencing, clone contig sequencing

• Adopted by HGP• Strategy “map first, sequence second”

• Create physical map• Divide chromosomes to smaller fragments.• Order (map) them to correspond to their respective

locations on the chromosomes.• Determine the base sequence of each of the mapped

fragments.

Page 15: Last lecture summary

Hierarchical genome shotgun – HGS1. Map genome

• As genetic markers (landmarks), short tagged sites (STS) were used (200 to 500 base pair DNA sequence that has a single occurrence in the genome)

2. Copy target DNA3. Make BAC library

• cleave (partial cleavage by restriction endonuclease) all target DNA copies randomly, insert these sub-clones into BACs

4. Physically map all BACs5. Find a subset of BACs that cover target DNA

• minimal tiling path6. Shotgun sequence only BACs at minimal tailing path

• Divide BACs into fragments (ultrasound or pressure), do plasmid cloning, reconstruct BAC sequence

7. Fill in gaps between BACS8. Merge into consensus sequence

The sequenced sub-clones are linked up to produce the DNA contig, which is the de-coded version of the original source

DNA. As this method progresses, larger and larger

contigs will be produced, until a single ordered contig of the

genome is achieved.

Daniel Svozil
Show students video How_to_Sequence_a_Genome_1._Mapping (this can be found wither at youtube http://youtu.be/UhQgSAIMs_s or at http://www.genome.gov/25019885, part Mapping)
Daniel Svozil
Show video Human_genome_sequencing-Animated_tutorial (downloaded from http://youtu.be/-gVh3z6MwdU)
Daniel Svozil
Show video WGS.swf
Page 16: Last lecture summary

http://www.nature.com/scitable/content/idealized-representation-of-the-hierarchical-shotgun-sequencing-48221

Page 17: Last lecture summary

Minimal tiling path

A collection of overlapping bacterial artificial chromosome (BAC) clones. The clones outlined in red, which provide a minimal tiling path across the corresponding genomic region, are selected for sequencing.

Page 18: Last lecture summary

Coverage• As it was shown, individual nucleotides are represented

more than with one read.• Coverage is the average number of reads representing a

given nucleotide in the reconstructed sequence.• Let’s say that for a source strand of length G = 100 Kbp

we sequence R = 1 500 reads of average legth L = 500. • Thus, we collect N = RL = 750 Kbps of data.• So we have sequenced on average every bp in the source

N/G = 7.5 times.• The coverage is 7.5X• Coverage in HGS adopted by HGP was 8X.

Page 19: Last lecture summary

Whole genome shotgun – WGS• Adopted by Celera.• De facto application of shotgun to large genome. Never

done before on such a large scale.• Expensive and time consuming mapping is not performed. • Each piece of DNA is cut into smaller fragments. Each

fragment is sequenced first, and then overlapping sequences are joined together to create the contig.

• To achieve enough of accuracy, higher coverage (20X) had to be used.

• Crucial for the assembly was development of new algorithms.

Daniel Svozil
Show video Shotgun_sequencing.flv
Page 20: Last lecture summary

Genome assembly• Aligning and merging short fragments of DNA sequence in

order to reconstruct the original (loger) sequence.• reads – typically 500-1000 bp, merge them into contigs,

arrange/merge contigs into scaffolds• scaffold – a series of contigs that are in the right order but are not

necessarily connected in one continuous stretch of sequence

source: Xiong, Essential Bioinformatics

Page 21: Last lecture summary

Genome assembly• Can be very computationally intensive when dealt at the

whole genome level.• Major challenges:

• sequence errors – can be corrected by drawing consensus sequence from an alignment of multiple overlapped sequences

• contamination by bacterial vectors – can be removed using filtering programs prior to assembly

• repeats – RepeatMasker (http://www.repeatmasker.org/) can be used to detect and mask repeats

Page 22: Last lecture summary

Forward-reverse constraint• Common constraint to avoid errors due to the repeats:

forward-reverse constraint• Sequence is generated from both ends of a clone → distance

between the two opposing fragments of a clone is fixed to a certain range (defined by a clone length). When the constraint is applied, even when one of the fragments has a perfect match with a repetitive element outside the range, it is not able to be moved to that location to cause missassembly.

source: Xiong, Essential Bioinformatics

no constraint constraintred fragment is misassembled

Page 23: Last lecture summary

Sequence assemblers• base calling – convert raw/processed data from a

sequencing instrument into sequences and scores• individual bases have scores, reflect the likelihood the base is

correct/incorrect• in capillary sequencing, identify the sequence from

chromatogram

source: Lee SH, Vigliotti VS, Pappu S., J Clin Pathol. 2010, 63(3) 235-9 PMID: 19858529

Page 24: Last lecture summary

PHRED• Base caller• Reads DNA sequence chromatogram files and analyzes

the peaks to call bases.• base quality score – PHRED examines peaks around

each base call to assign a score q to each base call that is logarithmically linked to the error propability P: , typically taken q = 20 corresponding to 1% error probability (99% correct base calling)

• Phred is a two-step process:• Training: Given a set of reads, labels as to which bases are correct,

and a set of quality statistics for each base, produce a model that can predict error rates for unseen bases

• Application: Given new reads and quality statistics, predict the quality for each of the bases.

Page 25: Last lecture summary

PHRAP• Sequence assembly• Takes PHRED base-call files with quality scores as input.• Aligns individual fragments in a pairwise fashion. The

base quality information is taken into account during the pairwise alignment.

Page 26: Last lecture summary

Personal human genomes• Personal genomes had not been sequenced in the

Human Genome Project to protect the identity of volunteers who provided DNA samples.

• Following personal genomes are available by July 2011:• Japanese male (2010, PMID: 20972442)• Korean male (2009, PMID: 19470904)• Chinese male (2008, PMID: 18987735)• Nigerian male (2008, PMID: 18987734)• J. D. Watson (2008, PMID: 18421352)• J. C. Venter (2007, PMID: 17803354)

• HGP sequence is haploid, however, the sequence maps for Venter and Watson for example are diploid, representing both sets of chromosomes.