whole genome sequencing (lecture for cs498-cxz algorithms in bioinformatics) sept. 13, 2005...

32
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Most slides are taken/adapted from Serafim Batzoglou’s lectures

Upload: shona-sutton

Post on 18-Jan-2018

226 views

Category:

Documents


0 download

DESCRIPTION

Challenges with Fragment Assembly Sequencing errors ~1-2% of bases are wrong Repeats Computation: ~ O( N 2 ) where N = # reads false overlap due to repeat Bacterial genomes:5% Mammals:50%

TRANSCRIPT

Page 1: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Whole Genome Sequencing

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Sept. 13, 2005

ChengXiang Zhai

Department of Computer ScienceUniversity of Illinois, Urbana-Champaign

Most slides are taken/adapted from Serafim Batzoglou’s lectures

Page 2: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Outline

• Practical challenges in genome sequencing

• Whole genome sequencing strategies

• Sequencing coverage (Lander-Waterman model)

• Overlap-Layout-Consensus approach

Page 3: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Challenges with Fragment Assembly• Sequencing errors

~1-2% of bases are wrong

• Repeats

• Computation: ~ O( N2 ) where N = # reads

false overlap due to repeat

Bacterial genomes: 5%Mammals: 50%

Page 4: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Repeat Types• Low-Complexity DNA (e.g. ATATATATACATA…)

• Microsatellite repeats (a1…ak)N where k ~ 3-6(e.g. CAGCAGTAGCAGCACCAG)

• Transposons/retrotransposons – SINE Short Interspersed Nuclear Elements(e.g., Alu: ~300 bp long, 106 copies)

– LINE Long Interspersed Nuclear Elements~500 - 5,000 bp long, 200,000 copies

– LTR retroposons Long Terminal Repeats (~700 bp) at each end• Gene Families genes duplicate & then diverge

• Segmental duplications ~very long, very similar copies

Page 5: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

The sequencing errors, repeats, and the complexity of genomes make it necessary to

use many heuristics in practice…

The Shortest Superstring formulation is an over-simplification of the problem

Page 6: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Strategies for whole-genome sequencing

1. Hierarchical – Clone-by-clone yeast, worm, humani. Break genome into many long fragmentsii. Map each long fragment onto the genomeiii. Sequence each fragment with shotgun

2. Online version of (1) – Walking rice genomei. Break genome into many long fragmentsii. Start sequencing each fragment with shotguniii. Construct map as you go

3. Whole Genome Shotgun fly, human, mouse, rat, fugu

One large shotgun pass on the whole genome

Page 7: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Hierarchical Sequencing vs. Whole Genome Shotgun

• Hierarchical Sequencing– Advantages: Easy assembly– Disadvantages:

• Build library & physical map; • Redundant sequencing

• Whole Genome Shotgun (WGS)– Advantages: No mapping, no redundant sequencing– Disadvantages: Difficult to assemble and resolve repeats

Whole Genome Shotgun appears to get more popular…

Page 8: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Whole Genome Shotgun Sequencing

cut many times at random

genome

forward-reverse paired readsknown dist

~500 bp~500 bp

Page 9: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Fragment Assembly

Cover region with ~7-fold redundancyOverlap reads and extend to

reconstruct the original genomic region

reads

Page 10: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Read Coverage

Length of genomic segment: GNumber of reads: NLength of each read: L

Definition: Coverage C = NL/ G

C

Page 11: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Enough CoverageHow much coverage is enough?

According to the Lander-Waterman model:

Assuming uniform distribution of reads, C=7 results in 1 gap per 1,000 nucleotides

Page 12: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Lander-Waterman Model

• Major Assumptions– Reads are randomly distributed in the genome– The number of times a base is sequenced follows a Poisson

distribution

• Implications– G= genome length, L=read length, N = # reads– Mean of Poisson: =LN/G (coverage)– % bases not sequenced: p(X=0) =0.0009 = 0.09%– Total gap length: p(X=0)*G– Total number of gaps: p(X=0)*N

( )!

xep X xx

Average times

This model was used to plan the Human Genome Project…

Page 13: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Repeats, Errors, and Read lengths• Repeats shorter than read length are OK

• Repeats with more base pair diffs than sequencing error rate are OK

• To make a smaller portion of the genome appear repetitive, try to:– Increase read length– Decrease sequencing error rate

Role of error correction:

Discards ~90% of single-letter sequencing errors

decreases error rate decreases effective repeat content

However, we have only limited read length.

Many heuristics have been introduced to handle repeats…

Page 14: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: merge reads into contigs and contigs into supercontigs

Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

Page 15: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Overlap

• Find the best match between the suffix of one read and the prefix of another

• Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment

• Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring

Page 16: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Overlapping Reads

TAGATTACACAGATTAC

TAGATTACACAGATTAC|||||||||||||||||

• Sort all k-mers in reads (k ~ 24)

• Find pairs of reads sharing a k-mer• Extend to full alignment – throw away

if not >95% similar

T GA

TAGA| ||

TACA

TAGT||

Page 17: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Overlapping Reads and Repeats

• A k-mer that appears N times, initiates N2 comparisons

• For an Alu that appears 106 times 1012 comparisons – too much

• Solution:

Discard all k-mers that appear more than t Coverage, (t ~ 10)

Page 18: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Finding Overlapping Reads

Create local multiple alignments from the overlapping reads

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA

Page 19: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Finding Overlapping Reads (cont’d)

• Correct errors using multiple alignment

TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA

C: 20C: 35T: 30C: 35C: 40

C: 20C: 35C: 0C: 35C: 40

• Score alignments

• Accept alignments with good scores

A: 15A: 25A: 40A: 25-

A: 15A: 25A: 40A: 25A: 0

Multiple alignments will be covered later in the course…

Page 20: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Layout

• Repeats are a major challenge

• Do two aligned fragments really overlap, or are they from two copies of a repeat?

Page 21: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Merge Reads into Contigs

Merge reads up to potential repeat boundaries

repeat region

Page 22: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Merge Reads into Contigs (cont’d)

• Ignore non-maximal reads

• Merge only maximal reads into contigs

repeat region

Page 23: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Merge Reads into Contigs (cont’d)

• Ignore “hanging” reads, when detecting repeat boundaries

sequencing errorrepeat boundary???

ba

Page 24: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Merge Reads into Contigs (cont’d)

?????

Unambiguous

• Insert non-maximal reads whenever unambiguous

Page 25: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Link Contigs into Supercontigs

Too dense: Overcollapsed?

(Myers et al. 2000)

Inconsistent links: Overcollapsed?

Normal density

Page 26: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Link Contigs into Supercontigs (cont’d)

Find all links between unique contigs

Connect contigs incrementally, if 2 links

Page 27: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Link Contigs into Supercontigs (cont’d)

Fill gaps in supercontigs with paths of overcollapsed contigs

Page 28: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Link Contigs into Supercontigs (cont’d)

Define G = ( V, E )V := contigs

E := ( A, B ) such that d( A, B ) < C

Reason to do so: Efficiency; full shortest paths cannot be computed

d ( A, B )Contig A

Contig B

Page 29: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Link Contigs into Supercontigs (cont’d)

Contig A Contig B

Define T: contigs linked to either A or BFill gap between A and B if there is a path in G passing only from contigs in T

Page 30: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Consensus

• A consensus sequence is derived from a profile of the assembled fragments

• A sufficient number of reads is required to ensure a statistically significant consensus

• Reading errors are corrected

Page 31: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

Page 32: Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of

What You Should Know

• The challenges in assembling fragments for whole genome shotgun sequencing

• Lander-Waterman model

• Main heuristics used in Overlap-Layout-Consensus