csci2950-c dna sequencing and fragment...

19
9/13/10 1 CSCI2950-C DNA Sequencing and Fragment Assembly Lecture 3: Sept. 9, 2010 http://cs.brown.edu/courses/csci2950-c/ Sequencing Longer Regions Cover region with K-fold redundancy Overlap reads and extend to reconstruct the original genomic region reads

Upload: others

Post on 23-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

1

CSCI2950-C DNA Sequencing and Fragment Assembly

Lecture 3: Sept. 9, 2010 http://cs.brown.edu/courses/csci2950-c/

Sequencing Longer Regions

Cover region with K-fold redundancy

Overlap reads and extend to reconstruct the original genomic region

reads

Page 2: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

2

Questions

1.  How many reads to sequence? 2.  How to assemble in presence of

missing data and/or repeated sequences?

Dominant paradigms 1.  Overlap-Layout-Consensus 2.  deBruijn graphs

Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: Merge reads into contigs and contigs into supercontigs

Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

Page 3: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

3

Overlapping Reads

TAGATTACACAGATTAC

TAGATTACACAGATTAC |||||||||||||||||

•  Identify all k-mers in reads (k ~ 24)

•  Find pairs of reads sharing a k-mer via hashing

•  Extend to full alignment ‒ throw away if not >95% similar

T GA

TAGA | ||

TACA

TAGT ||

Overlapping Reads and Repeats

•  A k-mer that appears M times, initiates M2 comparisons

•  For an Alu that appears 106 times 1012 comparisons – too much

•  Solution: Discard all k-mers that appear more than

t × Coverage, (t ~ 10)

Page 4: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

4

Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: Merge reads into contigs and contigs into supercontigs

Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

Layout Merge reads in contigs: greedy approach

Page 5: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

5

Overlap Graph Build directed graph G = (V ,E) V = {s1, s2,…., sn } e = (si, sj) if prefix of sj matches suffix of si w(si, sj) = length of overlap b/w si, sj

Goal: Find a maximum weight path visiting every VERTEX exactly once in the OVERLAP graph:

Travelling Salesman (Hamiltonian path) problem Convert max to min by w -w

Merge Reads into Contigs

We want to merge reads up to potential repeat boundaries. 1)  Find “unique regions”. 2)  Merge unique regions using various heuristics (greedy,

weighted paths, etc.)

repeat region

Unique Contig

Overcollapsed Contig

Page 6: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

6

Removing Repeats •  A-statistic (Myers) used in Celera assembler •  Use Lander-Waterman statistics to estimate

likelihood ratio of unique region vs. over-collapsed repeat

Too dense ⇒ Overcollapsed

Normal density

Pr[k reads in interval of length Δ] =

Log [Pr[k reads in interval of length 2 Δ] / Pr[k reads in interval of length Δ] ] ≈ Δ (n /G) – k Log[2]

Assemble Contigs

1)  Find unique regions. 2)  Assemble into contigs (greedy or weighted path

approaches)

Resulting assembly is unordered collection of contigs.

G-contigs: result from gap in read coverage R-contigs: result from repeat at boundary

Page 7: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

7

Repeats, errors, and contig lengths •  Repeats shorter than read length are OK

–  Read that spans across a repeat disambiguates order of flanking regions

•  Repeats with more base pair diffs than sequencing error rate are OK –  We throw overlaps between two reads in different copies of the

repeat

•  To make the genome appear less repetitive, try to:

–  Increase read length –  Decrease sequencing error rate

Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate ⇒ decreases effective repeat content ⇒ increases contig length

Double-barreled sequencing: (1990)

500 bp 500 bp

Both leftmost & rightmost ends are sequenced, reads are paired

Page 8: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

8

Link Contigs into Supercontigs (cont’d)

Find all links between unique contigs

Connect contigs incrementally, if ≥ 2 links

supercontig (aka scaffold)

Link Contigs into Supercontigs

Too dense ⇒ Overcollapsed

Inconsistent links ⇒ Overcollapsed?

Normal density

Scaffolding Problem: various heuristic approaches

Page 9: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

9

Fill gaps in supercontigs with paths of repeat contigs

Link Contigs into Supercontigs

Use length/distance constraints

Consensus

•  A consensus sequence is derived from a profile of the assembled fragments

•  A sufficient number of reads is required to ensure a statistically significant consensus

•  Reading errors are corrected

Page 10: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

10

Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

(Alternative: take maximum-quality letter)

Strategies for whole-genome sequencing

1.  Hierarchical ‒ Clone-by-clone i.  Break genome into many long pieces ii.  Map each long piece onto the genome iii.  Sequence each piece with shotgun

Example: Yeast, Worm, Human, Rat

2.  Online version of (1) ‒ Walking i.  Break genome into many long pieces ii.  Start sequencing each piece with shotgun iii.  Construct map as you go

Example: Rice genome

3.  Whole genome shotgun

One large shotgun pass on the whole genome

Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog

Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem

Page 11: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

11

Whole Genome Shotgun Sequencing

cut many times at random (shotgun)

genome

forward-reverse paired reads

(mate pair)

plasmids (2 ‒ 10 Kbp)

fosmids (40 Kbp) known dist

500 bp 500 bp

Another sequencing approach Sequencing by synthesis

vs.

Sequencing by hybridization

Page 12: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

12

DNA Basepairing 3’ 5’

5’ 3’

Single-stranded DNA has extremely high affinity for its complementary strand

DNA Microarrays

Page 13: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

13

Sequencing by Hybridization (SBH)

•  Build a microarray with all 4l DNA sequences of length l (l ~ 20)

•  For DNA sequence s, measure l-mer composition

l-mer composition Def: Given string s, the Spectrum ( s, l ) is

unordered multiset of all possible (n – l + 1) l-mers in a string s of length n

•  The order of individual elements in Spectrum ( s, l ) does not matter

•  Example: s = TATGGTGC, Spectrum ( s, 3 ) =

{TAT, ATG, TGG, GGT, GTG, TGC}

Page 14: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

14

The SBH Problem Goal: Reconstruct a string from its l-mer

composition

Input: A multiset S, representing all l-mers from an (unknown) string s

Output: String s such that Spectrum ( s,l ) = S

SBH: An Example DNA Sequencing AGT CCA ATC ATCCAGT TCC CAG

S = { ATC, CCA, CAG, TCC, AGT }

Page 15: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

15

SBH: Hamiltonian Path Approach

S = { ATG AGG TGC TCC GTC GGT GCA CAG }

     Path  visited  every  VERTEX  once:  Hamiltonian  Path  

ATG   AGG   TGC   TCC  H   GTC   GGT   GCA   CAG  

ATG   C  A  G  G  T  C  C  

Directed graph G = (S,E). es,t in E if suffix of s = prefix of t, where length(suffix/prefix) = l - 1

SBH: Hamiltonian Path Approach

A more complicated graph:

S = { ATG TGG TGC GTG GGC GCA GCG CGT }

Page 16: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

16

SBH: Hamiltonian Path Approach

S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1:

                           ATGCGTGGCA  

ATGGCGTGCA  

Path  2:  

SBH: Eulerian Path Approach S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT,

CA, CG }

Edges correspond to l – mers from S

AT

GT CG

CA GC TG

GG Find path that visits every EDGE once

(l-1) deBruijn graph of S: Vertices: (l-1)-mers in S Directed edges: consecutive (l-1) mers in S

Page 17: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

17

SBH: Eulerian Path Approach

S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions:

ATGGCGTGCA ATGCGTGGCA

AT TG GC CA

GG

GT CG

AT

GT CG

CA GC TG

GG

Euler Theorem

•  A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges:

in(v)=out(v)

•  Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.

Page 18: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

18

Repeat Graph Vertices correspond to ( k – 1 ) – mers in each read

Edges correspond to k – mers in each read

Example: S = ATGGCGTGCA

Reads = {ATGGC, GGCGTG, GTGCA}

3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

AT

GT CG

CA GC TG

GG

Two Eulerian paths: (visit every EDGE once)

ATGGCGTGCA ATGCGTGGCA

Eulerian Superpath Example: S = ATGGCGTGCA

Reads = {ATGGC, GGCGTG, GTGCA}

3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }

AT

GT CG

CA GC TG

GG

ATGGCGTGCA ATGCGTGGCA

Eulerian superpath: an Eulerian path that contains set of paths (reads) as subpaths.

Page 19: CSCI2950-C DNA Sequencing and Fragment …cs.brown.edu/courses/csci2950-c/Fall2010/LectureSlides/...Overlap reads and extend to reconstruct the original genomic region reads 9/13/10

9/13/10

19

Sources

•  Serafim Batzoglou http://ai.stanford.edu/~serafim/CS262_2006/ (Sequencing slides)

•  http://bioalgorithms.info (Euler slides) •  Euler assembler. Pevzner, Tang, and

Waterman (PNAS 2001): on website