multiple sequence alignments algorithms. mlagan: progressive alignment of dna given n sequences,...
Post on 22-Dec-2015
222 views
TRANSCRIPT
Multiple Sequence Alignments
Algorithms
MLAGAN: progressive alignment of DNA
Given N sequences, phylogenetic tree
Align pairwise, in order of the tree (LAGAN)
Human
Baboon
Mouse
Rat
MLAGAN: main steps
Given a collection of sequences, and a phylogenetic tree
1. Find local alignments for every pair of sequences x, y
2. Find anchors between every pair of sequences, similar to LAGAN anchoring
3. Progressive alignment• Multi-Anchoring based on reconciling the pairwise anchors• LAGAN-style limited-area DP
4. Optional refinement steps
MLAGAN: multi-anchoring
XZ
YZ
X/Y
Z
To anchor the (X/Y), and (Z) alignments:
Whole-genome alignment Human/Mouse/Rat
Insertion/Deletion Rate Analysis
Heuristics to improve multiple alignments
• Iterative refinement schemes
• A*-based search
• Consistency
• Simulated Annealing
• …
Iterative Refinement
One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear correct y = GA-CTT
Iterative Refinement
Algorithm (Barton-Stenberg):
1. Align most similar xi, xj
2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned
4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN
5. Repeat 4 until convergence
Note: Guaranteed to converge
Iterative Refinement
For each sequence y1. Remove y2. Realign y
(while rest fixed)x
y
z
x,z fixed projection
allow y to vary
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
Iterative Refinement
Example not handled well:
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
Realigning any single yi changes nothing
Restricted MDP
Run MDP, restricted to radius R from m
x
y
z
Running Time: O(2N RN-1 L)
Tree Refinement
Run 3D-DP, restricted to radius R from m, for each tree node
x
y
z
Running Time: ~7R2 LN
R: RadiusL: Alignment LengthN: Number of Sequences
A* for Multiple Alignments
Review of the A* algorithm
v
START
GOAL
g(v)h(v)
• g(v) is the cost so far• h(v) is an estimate of the minimum cost from v to GOAL• f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v
1. Expand v with the smallest f(v)2. Never expand v, if f(v) ≥ shortest path to the goal found so
far
A* for Multiple Alignments
• Nodes: Cells in the DP matrix• g(v): alignment cost so far• h(v): sum-of-pairs of individual pairwise alignments
• Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments
v
START
GOAL
g(v)h(v)
Consistency – T-Coffee
z
x
y
xi
yj yj’
zk
T – Coffee Layout
LALIGN CLUSTALW
Generating Primary Library
A A A
B B B A AC C
B B BC C C
ClustalW Primary Library Lalign Primary Library (10 top scoring non–intersecting
Local (Global pairwise alignment) (Pairwise alignment)
Library has information for each N(N-1)/2 sequence pairs.
Primary Library
Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT - - -
Prim. Weight = 88
Seq A GARFIELD THE LAST FA-T CAT Seq C GARFIELD THE VERY FAST ---
Prim. Weight = 77
Seq A GARFIELD THE LAST FAT CAT Seq D - - - -- - -- THE - - - - FAT CAT
Prim. Weight = 100
Seq B GARFIELD THE - - - - FAST CAT Seq C GARFIELD THE VERY FAST CAT
Prim. Weight = 100
Seq B GARFIELD THE FAST CAT Seq D - - - - -- - THE FA-T CAT
Prim. Weight = 100
Seq C GARFIELD THE VERY FAST CAT Seq D - - - -- -- - THE - - - - FA-T CAT
Prim. Weight = 100
Combining the libraries
Seq A GARFIELD THE LAST FAT CAT
Seq B GARFIELD THE FAST CAT - - -
Primary weight(ClustalW)=88
Primary Weight(Lalign)=88
W(A(G),B(G)) = 88 + 88 = 176
If a pair is duplicated across the two libraries, it is merged into
single entry with weight = sum of two weights
pairs of residue that did not occur are not present ( weight 0 )
Library Extension
• Complete extension requires examination of all triplets.• Not all bring information ( eg. A and B through D ).
• Weight of a pair = weights gathered through examination of all triplets involving that pair.
Running Time
• Complexity of entire procedure:
O(N2 * L2) + O(N3*L) + O(N3) + O(N*L2)
O(N2 L2) - pair-wise library computationO(N3 L) - library extensionO(N3) - computation of NJ treeO(N L2) - progressive alignment computation
Where:L – average sequence lengthN – number of sequences
T-Coffee compared with other methods
Method Cat1(81) Cat2(23) Cat3(4) Cat4(12) Cat5(11) Total(141)
Dialign 71.0 25.2 35.1 74.7 80.4 61.5
ClustalW 78.5 32.2 42.5 65.7 74.3 66.4
Prrp 78.6 32.5 50.2 51.1 82.7 66.4
T-Coffee 80.7 37.3 52.9 83.2 88.7 72.1
Gene Recognition
Credits for slides:Marina AlexanderssonLior PachterSerge Saxonov
Reading
• GENSCAN
• EasyGene
• SLAM
• Twinscan
Optional:
Chris Burge’s Thesis
Gene expression
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = codingintron = non-coding
Finding genes
Start codonATG
5’ 3’
Exon 1 Exon 2 Exon 3Intron 1 Intron 2
Stop codonTAG/TGA/TAA
Splice sites
Approaches to gene finding
• Homology BLAST, Procrustes.
• Ab initio Genscan, Genie, GeneID.
• Hybrids GenomeScan, GenieEST, Twinscan, SGP, ROSETTA,
CEM, TBLASTX, SLAM.
HMMs for single species gene finding: Generalized HMMs
HMMs for gene finding
GTCAGAGTAGCAAAGTAGACACTCCAGTAACGC
exon exon exonintronintronintergene intergene
GHMM for gene finding
TAA A A A A A A A A A A AA AAT T T T T T T T T T T T T T TG GGG G G G GGGG G G G GCC C C C C C
Exon1 Exon2 Exon3
duration
Observed duration times
Better way to do it: negative binomial
• EasyGene:
Prokaryotic
gene-finder
Larsen TS, Krogh A
• Negative binomial with n = 3
Biology of Splicing
(http://genes.mit.edu/chris/)
Consensus splice sites
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
Donor: 7.9 bitsAcceptor: 9.4 bits(Stephens & Schneider, 1996)
Splice site detection
5’ 3’Donor site
Position
-8 … -2 -1 0 1 2 … 17
A 26 … 60 9 0 1 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 99 0 41 … 27T 23 … 13 8 1 98 3 … 25
Splice Site Models
• WMM: weight matrix model = PSSM (Staden 1984)
• WAM: weight array model = 1st order Markov (Zhang & Marr 1993)
• MDD: maximal dependence decomposition (Burge & Karlin 1997) decision-tree like algorithm to take significant pairwise dependencies into
account
atg
tga
ggtgag
ggtgag
ggtgag
caggtg
cagatg
cagttg
caggccggtgag