multiple sequence alignments algorithms. mlagan: progressive alignment of dna given n sequences,...

43
Multiple Sequence Alignments Algorithms

Post on 22-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Multiple Sequence Alignments

Algorithms

Page 2: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

MLAGAN: progressive alignment of DNA

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Human

Baboon

Mouse

Rat

Page 3: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

MLAGAN: main steps

Given a collection of sequences, and a phylogenetic tree

1. Find local alignments for every pair of sequences x, y

2. Find anchors between every pair of sequences, similar to LAGAN anchoring

3. Progressive alignment• Multi-Anchoring based on reconciling the pairwise anchors• LAGAN-style limited-area DP

4. Optional refinement steps

Page 4: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

MLAGAN: multi-anchoring

XZ

YZ

X/Y

Z

To anchor the (X/Y), and (Z) alignments:

Page 5: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Whole-genome alignment Human/Mouse/Rat

Page 6: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Insertion/Deletion Rate Analysis

Page 7: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Heuristics to improve multiple alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Page 8: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT

Page 9: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Iterative Refinement

Algorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned

4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Note: Guaranteed to converge

Page 10: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Iterative Refinement

For each sequence y1. Remove y2. Realign y

(while rest fixed)x

y

z

x,z fixed projection

allow y to vary

Page 11: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Page 12: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Iterative Refinement

Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Page 13: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Restricted MDP

Run MDP, restricted to radius R from m

x

y

z

Running Time: O(2N RN-1 L)

Page 14: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Tree Refinement

Run 3D-DP, restricted to radius R from m, for each tree node

x

y

z

Running Time: ~7R2 LN

R: RadiusL: Alignment LengthN: Number of Sequences

Page 15: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

A* for Multiple Alignments

Review of the A* algorithm

v

START

GOAL

g(v)h(v)

• g(v) is the cost so far• h(v) is an estimate of the minimum cost from v to GOAL• f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v

1. Expand v with the smallest f(v)2. Never expand v, if f(v) ≥ shortest path to the goal found so

far

Page 16: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

A* for Multiple Alignments

• Nodes: Cells in the DP matrix• g(v): alignment cost so far• h(v): sum-of-pairs of individual pairwise alignments

• Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments

v

START

GOAL

g(v)h(v)

Page 17: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Consistency – T-Coffee

z

x

y

xi

yj yj’

zk

Page 18: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree
Page 19: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

T – Coffee Layout

LALIGN CLUSTALW

Page 20: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Generating Primary Library

A A A

B B B A AC C

B B BC C C

ClustalW Primary Library Lalign Primary Library (10 top scoring non–intersecting

Local (Global pairwise alignment) (Pairwise alignment)

Library has information for each N(N-1)/2 sequence pairs.

Page 21: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Primary Library

Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT - - -

Prim. Weight = 88

Seq A GARFIELD THE LAST FA-T CAT Seq C GARFIELD THE VERY FAST ---

Prim. Weight = 77

Seq A GARFIELD THE LAST FAT CAT Seq D - - - -- - -- THE - - - - FAT CAT

Prim. Weight = 100

Seq B GARFIELD THE - - - - FAST CAT Seq C GARFIELD THE VERY FAST CAT

Prim. Weight = 100

Seq B GARFIELD THE FAST CAT Seq D - - - - -- - THE FA-T CAT

Prim. Weight = 100

Seq C GARFIELD THE VERY FAST CAT Seq D - - - -- -- - THE - - - - FA-T CAT

Prim. Weight = 100

Page 22: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Combining the libraries

Seq A GARFIELD THE LAST FAT CAT

Seq B GARFIELD THE FAST CAT - - -

Primary weight(ClustalW)=88

Primary Weight(Lalign)=88

W(A(G),B(G)) = 88 + 88 = 176

If a pair is duplicated across the two libraries, it is merged into

single entry with weight = sum of two weights

pairs of residue that did not occur are not present ( weight 0 )

Page 23: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Library Extension

• Complete extension requires examination of all triplets.• Not all bring information ( eg. A and B through D ).

• Weight of a pair = weights gathered through examination of all triplets involving that pair.

Page 24: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Running Time

• Complexity of entire procedure:

O(N2 * L2) + O(N3*L) + O(N3) + O(N*L2)

O(N2 L2) - pair-wise library computationO(N3 L) - library extensionO(N3) - computation of NJ treeO(N L2) - progressive alignment computation

Where:L – average sequence lengthN – number of sequences

Page 25: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

T-Coffee compared with other methods

Method Cat1(81) Cat2(23) Cat3(4) Cat4(12) Cat5(11) Total(141)

Dialign 71.0 25.2 35.1 74.7 80.4 61.5

ClustalW 78.5 32.2 42.5 65.7 74.3 66.4

Prrp 78.6 32.5 50.2 51.1 82.7 66.4

T-Coffee 80.7 37.3 52.9 83.2 88.7 72.1

Page 26: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Gene Recognition

Credits for slides:Marina AlexanderssonLior PachterSerge Saxonov

Page 27: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Reading

• GENSCAN

• EasyGene

• SLAM

• Twinscan

Optional:

Chris Burge’s Thesis

Page 28: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Gene expression

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Page 29: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Gene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = codingintron = non-coding

Page 30: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Finding genes

Start codonATG

5’ 3’

Exon 1 Exon 2 Exon 3Intron 1 Intron 2

Stop codonTAG/TGA/TAA

Splice sites

Page 31: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree
Page 32: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree
Page 33: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Approaches to gene finding

• Homology BLAST, Procrustes.

• Ab initio Genscan, Genie, GeneID.

• Hybrids GenomeScan, GenieEST, Twinscan, SGP, ROSETTA,

CEM, TBLASTX, SLAM.

Page 34: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

HMMs for single species gene finding: Generalized HMMs

Page 35: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

HMMs for gene finding

GTCAGAGTAGCAAAGTAGACACTCCAGTAACGC

exon exon exonintronintronintergene intergene

Page 36: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

GHMM for gene finding

TAA A A A A A A A A A A AA AAT T T T T T T T T T T T T T TG GGG G G G GGGG G G G GCC C C C C C

Exon1 Exon2 Exon3

duration

Page 37: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Observed duration times

Page 38: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Better way to do it: negative binomial

• EasyGene:

Prokaryotic

gene-finder

Larsen TS, Krogh A

• Negative binomial with n = 3

Page 39: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Biology of Splicing

(http://genes.mit.edu/chris/)

Page 40: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Consensus splice sites

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Donor: 7.9 bitsAcceptor: 9.4 bits(Stephens & Schneider, 1996)

Page 41: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Splice site detection

5’ 3’Donor site

Position

-8 … -2 -1 0 1 2 … 17

A 26 … 60 9 0 1 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 99 0 41 … 27T 23 … 13 8 1 98 3 … 25

Page 42: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

Splice Site Models

• WMM: weight matrix model = PSSM (Staden 1984)

• WAM: weight array model = 1st order Markov (Zhang & Marr 1993)

• MDD: maximal dependence decomposition (Burge & Karlin 1997) decision-tree like algorithm to take significant pairwise dependencies into

account

Page 43: Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttg

caggccggtgag