alignment problem
DESCRIPTION
Alignment Problem. (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal (heuristic) alignment algorithms are also very important: e.g. BLAST. Key Issues. Types of alignments (local vs. global) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/1.jpg)
Alignment Problem (Optimal) pairwise alignment consists of
considering all possible alignments of two sequences and choosing the optimal one.
Sub-optimal (heuristic) alignment algorithms are also very important: e.g. BLAST
![Page 2: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/2.jpg)
Key Issues Types of alignments (local vs.
global) The scoring system The alignment algorithm Measuring alignment significance
![Page 3: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/3.jpg)
Types of Alignment Global—sequences aligned from end-
to-end. Local—alignments may start in the
middle of either sequence Ungapped—no insertions or deletions
are allowed Other types: overlap alignments,
repeated match alignments
![Page 4: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/4.jpg)
Local vs. Global Pairwise Alignments A global alignment includes all elements of
the sequences and includes gaps. A global alignment may or may not include "end
gap" penalties. Global alignments are better indicators of
homology and take longer to compute. A local alignment includes only
subsequences, and sometimes is computed without gaps. Local alignments can find shared domains in
divergent proteins and are fast to compute
![Page 5: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/5.jpg)
How do you compare alignments? Scoring scheme
What events do we score? Matches Mismatches Gaps
What scores will you give these events? What assumptions are you making?
Score your alignment
![Page 6: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/6.jpg)
Scoring Matrices How do you determine scores? What is out there already for your use? DNA versus Amino Acids?
TTACGGAGCTTC CTGAGATCC
![Page 7: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/7.jpg)
Multiple Sequence AlignmentGlobal versus Local Alignments
Progressive alignment Estimate guide tree Do pairwise alignment on subtreesClustalX
![Page 8: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/8.jpg)
Improvements Consistency-based Algorithms
T-Coffee - consistency-based objective function to minimize potential errors
Generates pair-wise global (Clustal) Local (Lalign) Then combine, reweight, progressive alignment
![Page 9: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/9.jpg)
Iterative Algorithms Estimate draft progressive alignment
(uncorrected distances) Improved progressive (reestimate guide
tree using Kimura 2-parameter) Refinement - divide into 2 subtrees,
estimate two profiles, then re-align 2 profiles
Continue refinement until convergence
![Page 10: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/10.jpg)
Software Clustal T-Coffee MUSCLE (limited models) MAFFT (wide variety of models)
![Page 11: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/11.jpg)
Comparisons Speed
Muscle>MAFFT>CLUSTALW>T-COFFEE
Accuracy MAFFT>Muscle>T-COFFEE>CLUSTALW
Lots more work to do here!
![Page 12: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/12.jpg)
Why Genome Sequencing?
![Page 13: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/13.jpg)
Modern Sequencing Methods Sanger (1982) introduced a sequencing
method amenable to automation.
Whole-genome sequencing: Clone-By-Clone vs. Shotgun Assembly
Drosophila melongaster sequenced (Myers et al. 2000)
Homo sapien sequenced (Venter et al. 2001)
![Page 14: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/14.jpg)
Main idea: Obtain fragments of all possible lengths, ending in A, C, T, G.
Using gel electrophoresis, we can separate fragments of differing lengths, and then assemble them.
Sanger (1982) introduced chain-termination sequencing.
![Page 15: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/15.jpg)
Automated Sequencing
Perkin-Elmer 3700:Can sequence ~500bp with 98.5% accuracy
![Page 16: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/16.jpg)
Reads and Contigs
Sequencing machines are limited to about ~500-750bp, so we must break up DNA into short and long fragments, with reads on either end.
Reads are then assembled into contigs, then scaffolds.
![Page 17: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/17.jpg)
Clone-by-Clone vs. Shotgun Traditionally, long fragments are mapped, and
then assembled by finding a minimum tiling path. Then, shotgun assembly is used to sequence long fragments.
Shotgun assembly is cheaper, but requires more computational resources.
Drosophila was successfully sequenced using shotgun assembly.
![Page 18: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/18.jpg)
In a Perfect World
![Page 19: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/19.jpg)
Difficulties? Good coverage does not guarantee that we can
“see” repeats.
Read coverage is generally not “truly” random, due to complications in fragmentation and cloning.
Any automated approach requires extensive post-processing.
Phrap www.phrap.org
![Page 20: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/20.jpg)
The Fruit Fly Drosophila melongaster was sequenced in
2000 using whole genome shotgun assembly.
Genome size is ~120Mbp for euchromatic (coding) portion, with roughly 13,600 genes.
The genome is still being refined.
![Page 21: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/21.jpg)
NIH used a Clone-By-Clone strategy; Celera used shotgun assembly.
Celera used 300 sequencing machines in parallel to obtain 175,000 reads per day.
Efforts were combined, resulting in 8x coverage of the human genome; consensus sequence is 2.91 billion base pairs.
![Page 22: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/22.jpg)
Abstraction The basic question is: given a set of
fragments from a long string, can we reconstruct the string?
What is the shortest common superstring of the given fragments?
![Page 23: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/23.jpg)
Overlap-Layout-Consensus Construct a (directed) overlap graph, where
nodes represent reads and edges represent overlap. Paths are contigs in this graph.
Problem: Find the consensus sequence by finding a path that visits all nodes in layout graph.
Note: This is an idealization, since we must handle errors!
![Page 24: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/24.jpg)
Approximation Algorithms The shortest common superstring problem is
NP-complete.
Greedily choosing edges is a 4-approximation, conjectured to be a 2-approximation.
Another idea: TSP has a 2-approximation if the edge weights are metric (Waterman et al. 1976 gives such metrics).
![Page 25: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/25.jpg)
Handling Repeats We can estimate how much coverage a given
set of overlapping reads should yield, based on coverage.
Repeats will “seem” to have unusually good coverage.
Celera’s algorithms are proprietary, but there is no explicit way to handle repeats in the overlap-layout-consensus paradigm.
![Page 26: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/26.jpg)
The Big Picture
![Page 27: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/27.jpg)
HybridizationSuppose we had a way to probe fragments of length k that were present in our sequence, from a hybridization assay.
Commercial products: Affymetrix GeneChip, Agilent, Amersham, etc.
![Page 28: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/28.jpg)
Sequencing-By-Hybridization Then instead of reads, we have regularly
sized fragments, k-mers.
Construct a multigraph G with (k-1)-mers as nodes, with edges representing k-mers. G is a de Bruijn graph.
Idea: An Eulerian path in G corresponds to the assembled sequence, and we don’t lose repeats (Pevzner 1989).
![Page 29: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/29.jpg)
Bridges of Königsberg
Theorem (Euler 1736): A graph has a path visiting every edge exactly once if and only if it is connected and has 2 or fewer vertices of odd degree.
![Page 30: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/30.jpg)
Pros and Cons An Eulerian path in a graph can be found in
linear time, if one exists.
Errors in the hybridization experiments may prevent us from finding a solution.
Can we just use reads as “virtual” hybridization data?
![Page 31: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/31.jpg)
Graph Preprocessing Read errors mean up to k missing/erroneous
edges. But we cannot correct this until we are done assembling!
Greedily mutate reads to minimize size of set of k-mers.
We also need to deal with repeats, which requires contracting certain paths to single edges…
![Page 32: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/32.jpg)
![Page 33: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/33.jpg)
Sizes of genomes and numbers of genes
![Page 34: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/34.jpg)
Sequencing parameters Difficulty and cost of large-scale
sequencing projects depend on the following parameters: Accuracy
How many errors are tolerated Coverage
How many times the same region is sequenced The two parameters are related
More coverage usually means higher accuracy Accuracy is also dependent on the finishing
effort
![Page 35: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/35.jpg)
Sequence accuracy Highly accurate sequences are needed for the following:
Diagnostics e.g., Forensics, identifying disease alleles in a
patient Protein coding prediction
One insertion or deletion changes the reading frame
Lower accuracy sufficient for homology searches Differences in sequence are tolerated by search
programs
![Page 36: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/36.jpg)
Sequence accuracy and sequencing cost Level of accuracy determines cost of
project Increasing accuracy from one error in 100 to one
error in 10,000 increases costs three to fivefold Need to determine appropriate level of
accuracy for each project If reference sequence already exists, then a
lower level of accuracy should suffice Can find genes in genome, but not their position
![Page 37: Alignment Problem](https://reader036.vdocument.in/reader036/viewer/2022070501/56816953550346895de0fbae/html5/thumbnails/37.jpg)
Sequencing coverage Coverage is the number of times the
same region is sequenced Ideally, one wants an equal number of
sequences in each direction To obtain accuracy of one error in
10,000 bases, one needs the following: 10x coverage
Stringent finishing Complete sequence
Base-perfect sequencing