bioinformatics algorithms department of computer science and engineering buet maximum likelihood...
TRANSCRIPT
Bioinformatics
AlgorithmsDepartment of Computer Science
and Engineering
BUET
Maximum Likelihood Genome Assembly
Paul MedvedevMichael Brudno
Presented byMd. Tanvir Al Amin,
Md. Shaifur RahmanKhalid Mahmood
*Some of the slides are taken from other sources
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Computational Genomics
Our genome encodes an enormous amount of information about our beings our looks our size how our bodies work …. our health our behaviors … who we are!
gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccg…….
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Contributions of the paper
Two-fold, first one being : First exact polynomial time algorithm for
the shortest double-stranded genome, given its k-molecule spectrum
A problem that was solved for strings, but remained open for molecules
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Contributions of the paper
Second one : Oppose the idea of shortest genome
Because It overcollapses Instead propose a new objective :
A maximum likelihood framework for assembling the genome that is most likely the source of the reads.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Contributions of the paper
Maximum likelihood framework Assumes perfect reads Uniform distribution Advantage of high coverage (NGS)
Estimate copy counts of repeats Combine with matepair data
Read => Contigs
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
6
Outline
Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost Biflow
Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
7
Outline
Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost Biflow
Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Whole Genome Shotgun Sequencing
SEQUENCER
DNA
ASSEMBLER
reads
FINISHING
contigs
sequence
Sanger vs. NGS
C++C++ Problems in Assembly Sequencing Errors Unknown
Orientation Incomplete
Coverage Repeats
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}9
Whole Genome Shotgun Sequencing
Break genome into shotgun-sized fragments and sequence
Match the overlapping regions of contiguous sequences
Demonstrated by Celera Genomics to be feasible for whole genome assembly
Sequenced human genome at 1/10’th the cost of the public Human Genome Project
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
10
Whole Genome Assembly
Next Generation Sequencing (NGS) ?? Improved speed and cost-effectiveness
relative to the other methods… … but much shorter read length (25-200
bp) Only proven on re sequencing projects, i.e.
a reference genome is already available Posses significant challenges to the
problem of de novo genome assembly – determination of a completely unknown genome.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Assemblers
Previous (Sanger) Assemblers
NGS Assemblers SSAKE (Jeck et al., 2007) VCAKE (Warren et al. 2007) SHARCGS (Dohm et al. 2007) Shorty (Chen and Skiena 2007) ALLPATHS (Butler et al. 2008) Edena (Hernandez et al. 2008) Euler-(U)SR (Chaisson and Pevzner 2008, 2009) Velvet (Zerbino and Birney, 2008)
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
12
Outline
Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost Biflow
Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Theoretical view
Input: set of strings over {A,C,G,T} called reads Output: A common superstring of the reads.
{TACAT, CATAC, ACGTAC} TACATACGTAC
Initially: Shortest Common Superstring (SCS) NP-hard [Gallant et al 1980] Over-collapsing of repeats Can be found using a TSP solver
de Bruijn graphs [Pevzner, Tang, Waterman 01] string graphs [Myers 05]
Both formulations are NP-hard.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
14
String graph (Myers)
Represent reads as vertices, and read overlaps as edges
Remove redundant edges Establish edge constraints
Unique? (flow is exactly one) Required? (min. flow is 1) Optional? (min. flow is 0)
Find shortest walk
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
15
EULER assembler (Pevzner, Tang and Waterman)
Represent reads as edges and overlaps as vertices in a de Bruijn graph
Assembly can be efficiently solved as an Eulerian Path Problem: each edge must be visited exactly once
Repeats dealt with by using multiple edges for a single repeat read
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Overlap Graph
Nodes are reads Edges are overlaps Weights are lengths of prefix TSP Tour is SCS
Example:{TACAT, CATAC, ACGTAC} TACATACGTAC
ACGTAC
CATAC
TACAT
3 5
5 3
2
2
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Why Shortest CS?
Maximum Likehood Genome Assembly {Medvedev, Brundo} 04/20/23
DNA is full of repeats: identical and nearly identical copies that appear multiple times
Alu repeat is 300beses long, present 1,000,000 times in the human genome
SCS approach “over-collapses” the repeats: they are only present once in the answer
Solution: Model repeats explicitly through either de Brujin graph or String graps
Maybe this will also become tractable?
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
De Bruijn Graphs
Nodes are (k-1)-mers Edges are k-mers The set of k-mers is called
a k-spectrum Finding shortest string with
given k-spectrum equivalent to Chinese Postman
{AGC, ATC, ATT, CAG, CAT, GCA, TCA,
TTC}
CA
GC AG
TC AT
TT
Pevzner 1989
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
De Bruijn Graphs with Walks Nodes are (k-1)-mers Edges are k-mers Reads are walks Finding superwalk (one that includes all walks) Not a polynomial time problem De Bruijn Superwalk is NP-hard
{AGC, ATTCA, CATT, GCAG, ATG}
CA
GC AG
TC AT
TT
Pevzner et al 2001
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Chinese Postman Tours
Solving Chinese Postman: An Eulerian tour is a solution
Euleriazation: make a graph Eulerian Can be done with min cost flow:
Unbalanced nodes are sources/sinks
Duplicate all edges used in flow
Pevzner 1989
{AGC, ATTCA, CATT, GCAG, ATG}
CA
GC AG
TC AT
TT
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
DNA is not a String
CCCA
TG
GC
AA
TT
AT
AC
GGGT
{AAC, ATT, CAA, CCA,
GCC, TGC, TTG}
{GTT, TAA, TTG, TGG,
GGC, GCA, CAA}
CCCA
TG
GC
AA
TT
AT
AC
GGGT
• The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest string with the given k-spectrum [Pevzner 1989]
ATTGCCAAC5’ 3’
GTTGGCAAT5’3’
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Complexity of CPT
Equivalent to
Undirected Polynomial Time Matching
Directed Polynomial Time Matching
Mixed NP-hard Network Flow
Bidirected Polynomial Time
Bidirected Flow
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Modeling Double Strandedness
Kececioglu 91, Kececioglu-Meyers 95
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Modeling Double-Strandedness How can two DNA molecules overlap?
A A C
C T T
A A C
T C G
T G G
A A C
Kececioglu 1992
-GTT+AAC
-AAG+CTT
-GTT+AAC
-CGA+TCG
-GTT+AAC
-CCA+TGG
ATTGCCAAC5’ 3’
GTTGGCAAT5’3’
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Walks in bidirected graphs
A walk has to “match” directions at each node.
Suppose the node +AA/TT-. Edge orientations correspond to
strands A path can use a node in both orientations
-AT+AT
-TT+AA
-GT+AC
-GC+GC
-TG+CA
-GG+CC
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Rules for Matching Directions When we walk through it, we can
Come in using in arrow, then leave using out arrow This is forward, so read the “+” strand. i.e.
AA here Come in using out arrow, then leave using in
arrow This is backward, so Read the “-” strand, i.e
TT here.-AT+AT
-TT+AA
-GT+AC
-GC+GC
-TG+CA
-GG+CC
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Bidirected Graphs So what this walk corresponds to ?
• GGCAAT• ATTGCC
-AT+AT
-TT+AA
-GT+AC
-GC+GC
-TG+CA
-GG+CC
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Bidirected de Bruijn Graphs The shortest walk that visits every edge at least once (a Chinese
postman tour) is the shortest DNA molecule with the given k-spectrum
-AT+AT
-TT+AA
-GT+AC
-GC+GC
-TG+CA
-GG+CC
-GC+GC
-AT+AT
-TT+AA
-GT+AC
-TG+CA
-GG+CC
{AAC, ATT, CAA, CCA,
GCC, TGC, TTG}
{GTT, TAA, TTG, TGG,
GGC, GCA, CAA}
ATTGCCAAC5’ 3’
GTTGGCAAT5’3’
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Representing Bidirected graphs
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Motivation: Overlap Graphs
Several downsides of the de Bruijn approach Division into k-mers arbitrary Very sensitive to sequencing errors Not memory efficient (one node per k-mer)
Goal One node per read (or better) No division into k-mers Flexibility in the presence of sequencing errors
Myers 2005
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
How To Build A Overlap Graph (1)
{ACGTAC, CATAC, TACAT}
Nodes are reads Edges are overlaps Weights are lengths of
non-overlapping prefix Transitively inferable overlaps
ACGTAC
TACAT
CATAC3
53
22
TACATACGTAC
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Bidirected Overlap Graph
In this work, authors have used Bidirected overlap graph.
In a bidirected overlap graph, each vertex is a double-stranded read
Edges represent read overlaps
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
33
Bidirected Overlap Graph
Three possible ways that two double-stranded reads can overlap (corresponds to the three types of edges) Suppose we have two reads r1 and r2
Each read can be oriented to the left or to the right The three possible overlaps are:
i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) ii) r1 points left and r2 points right iii) r1 points right and r2 points left
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
34
Bidirected Overlap Graph
The overlap graph is constructed by placing an edge between two reads if they overlap by a minimum number of characters omin
Question: How is omin determined? Then perform transitive edge reduction:
remove overlaps covered by two shorter overlaps
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Observation
A bidirected graph contains an Eulerian circuit if and only if it is connected and balanced.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Chinese postman Problem on Bidirected Graphs
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Chinese postman Problem on Bidirected Graphs
Let G be a weighted bidirected graph. There exists a circuit of weight i if and only if there exists an Eulerian extension of weight i. G has a circuit if and only if it is strongly
connected. The minimum weight Eulerian extension of
G has at most 2|E||V| edges.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Chinese postman Problem on Bidirected Graphs
The running time of Algorithm 1 is O(|E|2log(|V|)log(E)).
Gabow’s algorithm runs in O(|E|2log(|V|)log(max(u(e)))
u is the flow upper bound function f(e) <= 2 |E| |V| for every edge e,
So, we can safely let u(e) = 2 |E| |V|
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Chinese postman Problem on Bidirected Graphs Hence the theorem is proved : Given a set of k-molecules S, we can find
the shortest (k-1)-circular DNA molecule whose k-molecule spectrum is S in time O(|S|2log2(|S|)).
This is a polynomial time algorithm, explicitly handling the double strandedness
The first main result of this paper.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
40
Outline
Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost Biflow
Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
41
Sequence assembly using NGS
Sequence assembly using NGS Several methods available now (e.g.
SSAKE, VCAKE, SHARCGS, etc.) All of these assume that the length of
the assembled genome must be minimized
Results in over-collapsing of repeats Given ubiquity of repeats in eukaryotic
genomes, authors considered this a poor assumption
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Goal of an Assembler
What should the goal of an assembler be ?? Shortest string ??
Problem of over-collapse
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
43
Maximum Likelihood Genome Assembly Change goal of sequence assembly
Maximize the likelihood that the resultant genome was the source of the given reads
Take advantage of the high coverage of NGS to statistically estimate the copy-count of each read: identify and quantify repeats
Maximizing the likelihood of observed read frequencies can be cast as mininum cost bidirected flow (biflow) problem
Allows solution to be obtained with an off-the-shelf network flow solver
Authors claim 99.99% accuracy
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
44
Maximum Likelihood Genome Assembly
Second important aspect is the use of matepair information for joining contigs
Other systems look for all paths between mated reads
The proposed Method looks only for short paths between some pairs of reads
Question: How to decide the upper bound for these “short paths”? And how to decide which pairs of reads to examine?
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
45
Outline
Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost
Biflow Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
46
Adjustments to the Standard Min-cost Biflow Problem
Standard Min-cost Biflow Problem Set upper and lower flow bounds on
each edge Flow function f : E → N must obey the
constraint for each edge e For each vertex, the incoming flow is
balanced with the outgoing flow Objective: Find the flow that minimizes
l e f e u e
ce f e
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
47
Adjustments to the Standard Min-cost Biflow Problem
Medvedev-Brudno Min-cost Biflow Problem Upper and lower flow bounds on vertices as well Accomplished by splitting every vertex v into
two: v+ and v-
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
48
Adjustments to the Standard Min-cost Biflow Problem
v- serves as the “incoming” vertex, and inherits v’ incoming edges
v+ serves as the “outgoing” vertex, and inherits v’s outgoing edges
Finally add one edge between v- and v+ and assign it the upper and lower flow bounds for v
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Adjustments to the Standard Min-cost Biflow Problem
Second variation: represent the cost ce as a convex function
A function is convex if every point on or above it forms a convex set
A convex set refers to an area where, for every pair of points within that area, every point on the straight line segment connecting those points also lies within that area
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Convex Function
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}51
Adjustments to the Standard Min-cost Biflow Problem
An area that is not convex would have some sort of concave portion that would contradict the above property of convex sets
In the overlap graph, convex functions are modelled with piecewise-linear approximations, allowing the flow to be solved as a linear min-cost flow problem
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
52
Adjustments to the Standard Min-cost Biflow Problem
Supersource and supersink added to convert flow problem into circulation problem
Each vertex has a lower bound of 1, since each read must appear in the finished genome at least once
Edge bounds are set to 0 (lower bound) and infinity (upper bound)
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
53
Adjustments to the Standard Min-cost Biflow Problem
Prohibitively large cost on the edge leading from the supersource and the edge leading to the supersink to ensure that the assembly uses the smallest number of contigs possible
Flow through each vertex represents number of times it appears in the assembled genome
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Supersource and Supersink
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Maximum Likelihood Framework Let D be a circular genome of length
N(D) di = number of times the k-molecule i
appears in D Suppose i = ACGT
For, simplicity they are drawnas strings instead of molecules
A C G T
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Maximum Likelihood Framework Random trial Sample a position and take a k-molecule What is the probability that the k-
molecule is i
For, simplicity they are drawnas strings instead of molecules
)(DN
di
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Maximum Likelihood Framework Sample Uniformly We call it success, if we get i So, p = success probability =
We do the experiment n times Xi be the random variable indicating
number of times we get i What is the distribution of Xi ??
)(DN
di
Binomial Distribution
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Maximum Likelihood Framework How many options for i ? There of 4k possibilities …. Hence 4k random variables ….
Suppose k = 3 X1 X2 X3 X4 X5 X6 …… X64
They are, XAAA XAAC XAAG XAAT …… XTTT
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
59
Maximizing the Global Read-Count Likelihood
Taking all random variables over n experiments. What is the probability that AAA comes xAAA times, and
AAC comes xAAC times, ….. and CGT comes xCGT times ……and TTT comes xTTT times ??
Each random variable for every possible k-mer has a binomial distribution. Their joint distribution is the following multinomial distribution:
i
x
i
i
i
kk
DN
d
x
nxXxXxXP
!
!,,,
442211
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Maximum Likelihood Framework But D is not known, but the results of the
n trials are known !! The probability can be considered as the
likelihood of the parameters of the distribution di, given the outcome of the trials xi which is called Global Read-count Likelihood
kk xxddL4141 ,,|,,
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
61
Maximizing the Global Read-Count Likelihood
Goal is to maximize L, or, equivalently, minimize the negative log of L
i
kk
x
i
i DN
d
x
nxxddL
!
!,,|,,4141
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
62
Maximizing the Global Read-Count Likelihood
To translate this problem into a convex min-cost biflow problem, we need convex functions ci for each k-mer
Problem: the Xi random variables are not independent, because we have constraint :
We need something like : ii gcLlog
)(DNdi
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
63
Maximizing the Global Read-Count Likelihood
But, as the number of trials goes to infinity, the Xi random variables become independent.
In NGS techniques, the number of trials is usually large enough to warrant the approximation of the multinomial distribution as the product of the binomial distributions for each Xi
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Maximizing the Global Read-Count Likelihood In this binomial approximation,
genome length N(G) is constant, and independent of the sampling frequencies
Therefore, use N instead, which is the actual length of the genome G
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
65
Maximizing the Global Read-Count Likelihood
New approximation of L:
Now And ci is used as the convex functions for the
vertices of the min-cost biflow
ii
kk
xn
i
x
i
iii N
d
N
d
x
nxXPxxddL 1,,|,,
4141
ii dcKLlog
iiiiii dNxndxdc loglog
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
66
Outline
Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost
Biflow Problem Methods: Maximizing the Global Read-Count
Likelihood Methods: Efficiently Solving a Min-cost
Biflow Methods: Show Me the Contigs Results Discussion
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
67
Efficiently Solving a Min-Cost Biflow Problem: No existing efficient
implementation of a min-cost biflow algorithm
Though, Gabow (1983) presented polynomial time algorithm for min cost biflow It is difficult to implement. Author’s didn’t find any existing
implementation either…
Authors solve by converting a bidirected flow into a directed flow problem.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
68
Efficiently Solving a Min-Cost Biflow Directed network flow is solved by
reducing the problem to a linear program (LP)
Use an edge incidence matrix derived from the overlap graph If cell has a value of 1, then edge n is an in-
edge for vertex m If the value is -1, n is an out-edge 0 means n and m are not on speaking terms
Use incidence matrix as constraint matrix for LP: optimal LP solution corresponds to a minimum flow
IV E
Im,n
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
69
Efficiently Solving a Min-Cost Biflow
The incidence matrix is Totally Unimodular (TU)
Leads to Linear programs that always have integer solutions.
Makes it possible to produce an integral solution with LP, rather than resort to Integer Programming -> NP-hard
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
70
Efficiently Solving a Min-Cost Biflow
Possible for +2 or -2 to appear in the incidence matrix, since two in-edges/out-edges can enter a single vertex
Incidence matrix is actually a binet matrix Optimal LP solution for binet
matrices is guaranteed to be half-integral (i.e. the coefficients are multiples of 0.5)
Hochbaum 2004
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
71
Efficiently Solving a Min-Cost Biflow
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
72
Efficiently Solving a Min-Cost Biflow
Monotonization Procedure For every vertex v in the bidirected graph, replace
with two vertices v1 and v2 in the new graph Each of v’s in-edges are replaced with two edges,
one of which points into v1, while the other points out of v2
Likewise, each of v’s out-edges are replaced with two edges, one of which points out of v1, while the other points into v2
Bounds and costs from original graph are transferred to the new graph, and the solution of the new graph will be transferred to the original graph
Hochbaum 2004
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Efficiently Solving a Min Cost Flow Problem can now be solved with off-
the-shelf software After finding the min cost flow in the
directed graph, transfer the results to the original bidirected graph by adding the flows through the pairs of twin edges and dividing by two.
Hence, the optimal result is half integral and the monotonized flow is at worst a 2-approximation to the optimal integral flow.
Hochbaum 2004
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
74
Outline
Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost
Biflow Problem Methods: Maximizing the Global Read-Count
Likelihood Methods: Efficiently Solving a Min-cost Biflow
(Linear) Methods: Show Me the Contigs Results Discussion
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
75
Flow to Contigs
Flow’s have been solved, Now, decompose it into a collection of walks,
which translates into assembled contigs
Graph is first simplified by removing all edges with a flow of zero
Additional simplifications possible ….
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
76
Flow to Contigs
…… by removing vertices v where: There is exactly one edge going into v and one edge
leading out of v, and the flow on both edges is the same
Vertices where there is also a loop with the same flow as the other two edges, and
Split and join vertices, where the flow on the in- edges is the same as those of the out-edges
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
77
Flow to Contigs
After at most 2|V| of these simplifications, the remaining vertices are conflict vertices those that didn’t match the previous criteria
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
78
Conflict Node Resolution
Using matepair information Look for edges at these vertices with opposite
orientations supported by matepairs Use BFS to find all reads within a certain
distance from the vertex (in both direction) We have two sets of vertices L and R,
corresponding to reads that were observed on the inside of a vertex and the outside.
Match those reads that are matepairs. For those matepairs where one read is on the
incoming side and the other is on the outgoing side, find the shortest path between them using Dijkstra’s algorithm
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
A’ B’
A’ B’
A
Resolving Conflict Nodes with Mate Pairs
Does there exist a short path between A’ and B’?
B
A B
?
• Dijkstra’s shortest path algorithm -- bounded• Greedily join edges if they have enough supporting reads.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
80
Greedy Matching
Make note of the number of mates that fall within the expected insert distance
Pairs of in/out edges that have a significant number of matepairs that fall within the insert distance are joined into a common edge
The previous step is repeated until no more edges can be joined in this manner
Graph simplification continues in iterative phases until convergence
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
81
Outline
Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost
Biflow Problem Methods: Maximizing the Global Read-Count
Likelihood Methods: Efficiently Solving a Min-cost Biflow
(Linear) Methods: Show Me the Contigs Results Discussion
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
82
Results
Generated synthetic reads from E. coli genome, which has a total length of 4.6 Mega basepairs.
Simulated matepairs’ distances were uniformly distributed within 10% of the expected insert size
Reads were 25 bp long, and error-free
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Results
Coverage rates involved 50x, 75x, 100x, and 200x
Minimum overlap length varied between 17 and 21
Authors claim that, overall running time of the algorithm is approx 1 hour on one machine Question: What kind of machine??
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Copy Count Results
Authors compared the flow going through every vertex in the overlap graph to the number of times that the corresponding read appears.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}85
Read Count Results
Compared vertex flow with read frequency in the original genome
High degree of accuracy Error rate between 10-4 and 10-6
Generally more tendency to overestimate read frequency Authors claim only slight improvements beyond 75x
coverage but 200x coverage is fantastically good
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
86
Assembly Results
Take the edges of the graph produced after the conflict node resolution and generate the sequence it spells out
Compute N50: The length of the shortest contig s.t. 50% of the genome lies in longer contigs
Also compute N90: Similar to N50, but the cutoff is 90%
Finally, compute errors by aligning each contig to the reference genome and seeing how many local alignments it takes to completely tile the contig (minus one because it always takes at least one alignment to do it)
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Assembly Results
N50 Results
N90 Results
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
88
Assembly Results (cont’d)
Length of contigs that contain 50% of the genome varied between 23-28 kb
Length of contigs that contain 90% of the genome varied between 7-8 kb
N50 error rate: ~1/100-180 kb N90 error rate: ~1/100-160 kb Greedy algorithm can be fooled by several
strong edge matches Contig size is good relative to other whole
genome assemblies involving small read sizes
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
89
Outline
Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost
Biflow Problem Methods: Maximizing the Global Read-Count
Likelihood Methods: Efficiently Solving a Min-cost Biflow
(Linear) Methods: Show Me the Contigs Results Discussion
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
90
Discussion
Demonstrated that bidirected flow is a powerful method for gnome-assembly.
Introduced a maximum likelihood framework for sequence assembly
By unifying Pevzner’s work on de Bruijn graphs, Kececioglu and Myer’s work on bidirected graphs in assembly, and Edmond and Gabow’s work on bidirected flow. The paper gives an exact polynomial time
assembly algorithm in the parsimony setting explicitly dealing with double-strandedness.
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
91
Discussion
First major assumption: Reads are error-free Can be overcome with higher coverage
Second major assumption: Uniform sampling of all genomic regions Reality: certain portions of the genome are
easier to sample than others More difficult to overcome Could be overcome by establishing the biases
of the sequencing apparatus used
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Future Research
Exploration of the exact biases of the NGS platforms
Correction for these Is there any better heuristic for the
greedy resolution ??
04/20/23
Maximum Likehood Genome Assembly {Medvedev, Brundo}
Questions ??
Thank you