csci2950-c dna sequencing and fragment...

9/13/10

1

CSCI2950-C DNA Sequencing and Fragment Assembly

Lecture 3: Sept. 9, 2010 http://cs.brown.edu/courses/csci2950-c/

Sequencing Longer Regions

Cover region with K-fold redundancy

Overlap reads and extend to reconstruct the original genomic region

reads

9/13/10

2

Questions

1.  How many reads to sequence? 2.  How to assemble in presence of

missing data and/or repeated sequences?

Dominant paradigms 1.  Overlap-Layout-Consensus 2.  deBruijn graphs

Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: Merge reads into contigs and contigs into supercontigs

Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

9/13/10

3

Overlapping Reads

TAGATTACACAGATTAC

TAGATTACACAGATTAC |||||||||||||||||

•  Identify all k-mers in reads (k ~ 24)

•  Find pairs of reads sharing a k-mer via hashing

•  Extend to full alignment ‒ throw away if not >95% similar

T GA

TAGA | ||

TACA

TAGT ||

Overlapping Reads and Repeats

•  A k-mer that appears M times, initiates M2 comparisons

•  For an Alu that appears 106 times 1012 comparisons – too much

•  Solution: Discard all k-mers that appear more than

t × Coverage, (t ~ 10)

9/13/10

4

Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: Merge reads into contigs and contigs into supercontigs

Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

Layout Merge reads in contigs: greedy approach

9/13/10

5

Overlap Graph Build directed graph G = (V ,E) V = {s1, s2,…., sn } e = (si, sj) if prefix of sj matches suffix of si w(si, sj) = length of overlap b/w si, sj

Goal: Find a maximum weight path visiting every VERTEX exactly once in the OVERLAP graph:

Travelling Salesman (Hamiltonian path) problem Convert max to min by w -w

Merge Reads into Contigs

We want to merge reads up to potential repeat boundaries. 1)  Find “unique regions”. 2)  Merge unique regions using various heuristics (greedy,

weighted paths, etc.)

repeat region

Unique Contig

Overcollapsed Contig

9/13/10

6

Removing Repeats •  A-statistic (Myers) used in Celera assembler •  Use Lander-Waterman statistics to estimate

likelihood ratio of unique region vs. over-collapsed repeat

Too dense ⇒ Overcollapsed

Normal density

Pr[k reads in interval of length Δ] =

Log [Pr[k reads in interval of length 2 Δ] / Pr[k reads in interval of length Δ] ] ≈ Δ (n /G) – k Log[2]

Assemble Contigs

1)  Find unique regions. 2)  Assemble into contigs (greedy or weighted path

approaches)

Resulting assembly is unordered collection of contigs.

G-contigs: result from gap in read coverage R-contigs: result from repeat at boundary

9/13/10

7

Repeats, errors, and contig lengths •  Repeats shorter than read length are OK

–  Read that spans across a repeat disambiguates order of flanking regions

•  Repeats with more base pair diffs than sequencing error rate are OK –  We throw overlaps between two reads in different copies of the

repeat

•  To make the genome appear less repetitive, try to:

–  Increase read length –  Decrease sequencing error rate

Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate ⇒ decreases effective repeat content ⇒ increases contig length

Double-barreled sequencing: (1990)

500 bp 500 bp

Both leftmost & rightmost ends are sequenced, reads are paired

9/13/10

8

Link Contigs into Supercontigs (cont’d)

Find all links between unique contigs

Connect contigs incrementally, if ≥ 2 links

supercontig (aka scaffold)

Link Contigs into Supercontigs

Too dense ⇒ Overcollapsed

Inconsistent links ⇒ Overcollapsed?

Normal density

Scaffolding Problem: various heuristic approaches

9/13/10

9

Fill gaps in supercontigs with paths of repeat contigs

Link Contigs into Supercontigs

Use length/distance constraints

Consensus

•  A consensus sequence is derived from a profile of the assembled fragments

•  A sufficient number of reads is required to ensure a statistically significant consensus

•  Reading errors are corrected

9/13/10

10

Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

(Alternative: take maximum-quality letter)

Strategies for whole-genome sequencing

1.  Hierarchical ‒ Clone-by-clone i.  Break genome into many long pieces ii.  Map each long piece onto the genome iii.  Sequence each piece with shotgun

Example: Yeast, Worm, Human, Rat

2.  Online version of (1) ‒ Walking i.  Break genome into many long pieces ii.  Start sequencing each piece with shotgun iii.  Construct map as you go

Example: Rice genome

3.  Whole genome shotgun

One large shotgun pass on the whole genome

Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog

Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem

9/13/10

11

Whole Genome Shotgun Sequencing

cut many times at random (shotgun)

genome

forward-reverse paired reads

(mate pair)

plasmids (2 ‒ 10 Kbp)

fosmids (40 Kbp) known dist

500 bp 500 bp

Another sequencing approach Sequencing by synthesis

vs.

Sequencing by hybridization

9/13/10

12

DNA Basepairing 3’ 5’

5’ 3’

Single-stranded DNA has extremely high affinity for its complementary strand

DNA Microarrays

9/13/10

13

Sequencing by Hybridization (SBH)

•  Build a microarray with all 4l DNA sequences of length l (l ~ 20)

•  For DNA sequence s, measure l-mer composition

l-mer composition Def: Given string s, the Spectrum ( s, l ) is

unordered multiset of all possible (n – l + 1) l-mers in a string s of length n

•  The order of individual elements in Spectrum ( s, l ) does not matter

•  Example: s = TATGGTGC, Spectrum ( s, 3 ) =

{TAT, ATG, TGG, GGT, GTG, TGC}

9/13/10

14

The SBH Problem Goal: Reconstruct a string from its l-mer

composition

Input: A multiset S, representing all l-mers from an (unknown) string s

Output: String s such that Spectrum ( s,l ) = S

SBH: An Example DNA Sequencing AGT CCA ATC ATCCAGT TCC CAG

S = { ATC, CCA, CAG, TCC, AGT }

9/13/10

15

SBH: Hamiltonian Path Approach

S = { ATG AGG TGC TCC GTC GGT GCA CAG }

Path visited every VERTEX once: Hamiltonian Path

ATG AGG TGC TCC H GTC GGT GCA CAG

ATG C A G G T C C

Directed graph G = (S,E). es,t in E if suffix of s = prefix of t, where length(suffix/prefix) = l - 1


A more complicated graph:

S = { ATG TGG TGC GTG GGC GCA GCG CGT }

9/13/10

16


S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1:

ATGCGTGGCA

ATGGCGTGCA

Path 2:

SBH: Eulerian Path Approach S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT,

CA, CG }

Edges correspond to l – mers from S

AT

GT CG

CA GC TG

GG Find path that visits every EDGE once

(l-1) deBruijn graph of S: Vertices: (l-1)-mers in S Directed edges: consecutive (l-1) mers in S

9/13/10

17

SBH: Eulerian Path Approach

S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions:

ATGGCGTGCA ATGCGTGGCA

AT TG GC CA

GG

GT CG

AT

GT CG

CA GC TG

GG

Euler Theorem

•  A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges:

in(v)=out(v)

•  Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.

9/13/10

18

Repeat Graph Vertices correspond to ( k – 1 ) – mers in each read

Edges correspond to k – mers in each read

Example: S = ATGGCGTGCA

Reads = {ATGGC, GGCGTG, GTGCA}

3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

AT

GT CG

CA GC TG

GG

Two Eulerian paths: (visit every EDGE once)


Eulerian Superpath Example: S = ATGGCGTGCA

Reads = {ATGGC, GGCGTG, GTGCA}

3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

AT

GT CG

CA GC TG

GG


Eulerian superpath: an Eulerian path that contains set of paths (reads) as subpaths.

9/13/10

19

Sources

•  Serafim Batzoglou http://ai.stanford.edu/~serafim/CS262_2006/ (Sequencing slides)

•  http://bioalgorithms.info (Euler slides) •  Euler assembler. Pevzner, Tang, and

Waterman (PNAS 2001): on website

csci2950-c dna sequencing and fragment...

Documents