csci2950-c dna sequencing and fragment...
TRANSCRIPT
9/13/10
1
CSCI2950-C DNA Sequencing and Fragment Assembly
Lecture 3: Sept. 9, 2010 http://cs.brown.edu/courses/csci2950-c/
Sequencing Longer Regions
Cover region with K-fold redundancy
Overlap reads and extend to reconstruct the original genomic region
reads
9/13/10
2
Questions
1. How many reads to sequence? 2. How to assemble in presence of
missing data and/or repeated sequences?
Dominant paradigms 1. Overlap-Layout-Consensus 2. deBruijn graphs
Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Layout: Merge reads into contigs and contigs into supercontigs
Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
9/13/10
3
Overlapping Reads
TAGATTACACAGATTAC
TAGATTACACAGATTAC |||||||||||||||||
• Identify all k-mers in reads (k ~ 24)
• Find pairs of reads sharing a k-mer via hashing
• Extend to full alignment ‒ throw away if not >95% similar
T GA
TAGA | ||
TACA
TAGT ||
Overlapping Reads and Repeats
• A k-mer that appears M times, initiates M2 comparisons
• For an Alu that appears 106 times 1012 comparisons – too much
• Solution: Discard all k-mers that appear more than
t × Coverage, (t ~ 10)
9/13/10
4
Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Layout: Merge reads into contigs and contigs into supercontigs
Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
Layout Merge reads in contigs: greedy approach
9/13/10
5
Overlap Graph Build directed graph G = (V ,E) V = {s1, s2,…., sn } e = (si, sj) if prefix of sj matches suffix of si w(si, sj) = length of overlap b/w si, sj
Goal: Find a maximum weight path visiting every VERTEX exactly once in the OVERLAP graph:
Travelling Salesman (Hamiltonian path) problem Convert max to min by w -w
Merge Reads into Contigs
We want to merge reads up to potential repeat boundaries. 1) Find “unique regions”. 2) Merge unique regions using various heuristics (greedy,
weighted paths, etc.)
repeat region
Unique Contig
Overcollapsed Contig
9/13/10
6
Removing Repeats • A-statistic (Myers) used in Celera assembler • Use Lander-Waterman statistics to estimate
likelihood ratio of unique region vs. over-collapsed repeat
Too dense ⇒ Overcollapsed
Normal density
Pr[k reads in interval of length Δ] =
Log [Pr[k reads in interval of length 2 Δ] / Pr[k reads in interval of length Δ] ] ≈ Δ (n /G) – k Log[2]
Assemble Contigs
1) Find unique regions. 2) Assemble into contigs (greedy or weighted path
approaches)
Resulting assembly is unordered collection of contigs.
G-contigs: result from gap in read coverage R-contigs: result from repeat at boundary
9/13/10
7
Repeats, errors, and contig lengths • Repeats shorter than read length are OK
– Read that spans across a repeat disambiguates order of flanking regions
• Repeats with more base pair diffs than sequencing error rate are OK – We throw overlaps between two reads in different copies of the
repeat
• To make the genome appear less repetitive, try to:
– Increase read length – Decrease sequencing error rate
Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate ⇒ decreases effective repeat content ⇒ increases contig length
Double-barreled sequencing: (1990)
500 bp 500 bp
Both leftmost & rightmost ends are sequenced, reads are paired
9/13/10
8
Link Contigs into Supercontigs (cont’d)
Find all links between unique contigs
Connect contigs incrementally, if ≥ 2 links
supercontig (aka scaffold)
Link Contigs into Supercontigs
Too dense ⇒ Overcollapsed
Inconsistent links ⇒ Overcollapsed?
Normal density
Scaffolding Problem: various heuristic approaches
9/13/10
9
Fill gaps in supercontigs with paths of repeat contigs
Link Contigs into Supercontigs
Use length/distance constraints
Consensus
• A consensus sequence is derived from a profile of the assembled fragments
• A sufficient number of reads is required to ensure a statistically significant consensus
• Reading errors are corrected
9/13/10
10
Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
(Alternative: take maximum-quality letter)
Strategies for whole-genome sequencing
1. Hierarchical ‒ Clone-by-clone i. Break genome into many long pieces ii. Map each long piece onto the genome iii. Sequence each piece with shotgun
Example: Yeast, Worm, Human, Rat
2. Online version of (1) ‒ Walking i. Break genome into many long pieces ii. Start sequencing each piece with shotgun iii. Construct map as you go
Example: Rice genome
3. Whole genome shotgun
One large shotgun pass on the whole genome
Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog
Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem
9/13/10
11
Whole Genome Shotgun Sequencing
cut many times at random (shotgun)
genome
forward-reverse paired reads
(mate pair)
plasmids (2 ‒ 10 Kbp)
fosmids (40 Kbp) known dist
500 bp 500 bp
Another sequencing approach Sequencing by synthesis
vs.
Sequencing by hybridization
9/13/10
12
DNA Basepairing 3’ 5’
5’ 3’
Single-stranded DNA has extremely high affinity for its complementary strand
DNA Microarrays
9/13/10
13
Sequencing by Hybridization (SBH)
• Build a microarray with all 4l DNA sequences of length l (l ~ 20)
• For DNA sequence s, measure l-mer composition
l-mer composition Def: Given string s, the Spectrum ( s, l ) is
unordered multiset of all possible (n – l + 1) l-mers in a string s of length n
• The order of individual elements in Spectrum ( s, l ) does not matter
• Example: s = TATGGTGC, Spectrum ( s, 3 ) =
{TAT, ATG, TGG, GGT, GTG, TGC}
9/13/10
14
The SBH Problem Goal: Reconstruct a string from its l-mer
composition
Input: A multiset S, representing all l-mers from an (unknown) string s
Output: String s such that Spectrum ( s,l ) = S
SBH: An Example DNA Sequencing AGT CCA ATC ATCCAGT TCC CAG
S = { ATC, CCA, CAG, TCC, AGT }
9/13/10
15
SBH: Hamiltonian Path Approach
S = { ATG AGG TGC TCC GTC GGT GCA CAG }
Path visited every VERTEX once: Hamiltonian Path
ATG AGG TGC TCC H GTC GGT GCA CAG
ATG C A G G T C C
Directed graph G = (S,E). es,t in E if suffix of s = prefix of t, where length(suffix/prefix) = l - 1
SBH: Hamiltonian Path Approach
A more complicated graph:
S = { ATG TGG TGC GTG GGC GCA GCG CGT }
9/13/10
16
SBH: Hamiltonian Path Approach
S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1:
ATGCGTGGCA
ATGGCGTGCA
Path 2:
SBH: Eulerian Path Approach S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT,
CA, CG }
Edges correspond to l – mers from S
AT
GT CG
CA GC TG
GG Find path that visits every EDGE once
(l-1) deBruijn graph of S: Vertices: (l-1)-mers in S Directed edges: consecutive (l-1) mers in S
9/13/10
17
SBH: Eulerian Path Approach
S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions:
ATGGCGTGCA ATGCGTGGCA
AT TG GC CA
GG
GT CG
AT
GT CG
CA GC TG
GG
Euler Theorem
• A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges:
in(v)=out(v)
• Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.
9/13/10
18
Repeat Graph Vertices correspond to ( k – 1 ) – mers in each read
Edges correspond to k – mers in each read
Example: S = ATGGCGTGCA
Reads = {ATGGC, GGCGTG, GTGCA}
3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
AT
GT CG
CA GC TG
GG
Two Eulerian paths: (visit every EDGE once)
ATGGCGTGCA ATGCGTGGCA
Eulerian Superpath Example: S = ATGGCGTGCA
Reads = {ATGGC, GGCGTG, GTGCA}
3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
AT
GT CG
CA GC TG
GG
ATGGCGTGCA ATGCGTGGCA
Eulerian superpath: an Eulerian path that contains set of paths (reads) as subpaths.
9/13/10
19
Sources
• Serafim Batzoglou http://ai.stanford.edu/~serafim/CS262_2006/ (Sequencing slides)
• http://bioalgorithms.info (Euler slides) • Euler assembler. Pevzner, Tang, and
Waterman (PNAS 2001): on website