multiple sequence alignments algorithms. mlagan: progressive alignment of dna given n sequences,...

Multiple Sequence Alignments

Algorithms

MLAGAN: progressive alignment of DNA

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Human

Baboon

Mouse

Rat

MLAGAN: main steps

Given a collection of sequences, and a phylogenetic tree

1. Find local alignments for every pair of sequences x, y

2. Find anchors between every pair of sequences, similar to LAGAN anchoring

3. Progressive alignment• Multi-Anchoring based on reconciling the pairwise anchors• LAGAN-style limited-area DP

4. Optional refinement steps

MLAGAN: multi-anchoring

XZ

YZ

X/Y

Z

To anchor the (X/Y), and (Z) alignments:

Whole-genome alignment Human/Mouse/Rat

Insertion/Deletion Rate Analysis

Heuristics to improve multiple alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT


Algorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned

4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Note: Guaranteed to converge


For each sequence y1. Remove y2. Realign y

(while rest fixed)x

y

z

x,z fixed projection

allow y to vary


Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA


Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Restricted MDP

Run MDP, restricted to radius R from m

x

y

z

Running Time: O(2N RN-1 L)

Tree Refinement

Run 3D-DP, restricted to radius R from m, for each tree node

x

y

z

Running Time: ~7R2 LN

R: RadiusL: Alignment LengthN: Number of Sequences

A* for Multiple Alignments

Review of the A* algorithm

v

START

GOAL

g(v)h(v)

• g(v) is the cost so far• h(v) is an estimate of the minimum cost from v to GOAL• f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v

1. Expand v with the smallest f(v)2. Never expand v, if f(v) ≥ shortest path to the goal found so

far

A* for Multiple Alignments

• Nodes: Cells in the DP matrix• g(v): alignment cost so far• h(v): sum-of-pairs of individual pairwise alignments

• Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments

v

START

GOAL

g(v)h(v)

Consistency – T-Coffee

z

x

y

xi

yj yj’

zk

T – Coffee Layout

LALIGN CLUSTALW

Generating Primary Library

A A A

B B B A AC C

B B BC C C

ClustalW Primary Library Lalign Primary Library (10 top scoring non–intersecting

Local (Global pairwise alignment) (Pairwise alignment)

Library has information for each N(N-1)/2 sequence pairs.

Primary Library

Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT - - -

Prim. Weight = 88

Seq A GARFIELD THE LAST FA-T CAT Seq C GARFIELD THE VERY FAST ---

Prim. Weight = 77

Seq A GARFIELD THE LAST FAT CAT Seq D - - - -- - -- THE - - - - FAT CAT

Prim. Weight = 100

Seq B GARFIELD THE - - - - FAST CAT Seq C GARFIELD THE VERY FAST CAT

Prim. Weight = 100

Seq B GARFIELD THE FAST CAT Seq D - - - - -- - THE FA-T CAT

Prim. Weight = 100

Seq C GARFIELD THE VERY FAST CAT Seq D - - - -- -- - THE - - - - FA-T CAT

Prim. Weight = 100

Combining the libraries

Seq A GARFIELD THE LAST FAT CAT

Seq B GARFIELD THE FAST CAT - - -

Primary weight(ClustalW)=88

Primary Weight(Lalign)=88

W(A(G),B(G)) = 88 + 88 = 176

If a pair is duplicated across the two libraries, it is merged into

single entry with weight = sum of two weights

pairs of residue that did not occur are not present ( weight 0 )

Library Extension

• Complete extension requires examination of all triplets.• Not all bring information ( eg. A and B through D ).

• Weight of a pair = weights gathered through examination of all triplets involving that pair.

Running Time

• Complexity of entire procedure:

O(N2 * L2) + O(N3*L) + O(N3) + O(N*L2)

O(N2 L2) - pair-wise library computationO(N3 L) - library extensionO(N3) - computation of NJ treeO(N L2) - progressive alignment computation

Where:L – average sequence lengthN – number of sequences

T-Coffee compared with other methods

Method Cat1(81) Cat2(23) Cat3(4) Cat4(12) Cat5(11) Total(141)

Dialign 71.0 25.2 35.1 74.7 80.4 61.5

ClustalW 78.5 32.2 42.5 65.7 74.3 66.4

Prrp 78.6 32.5 50.2 51.1 82.7 66.4

T-Coffee 80.7 37.3 52.9 83.2 88.7 72.1

Gene Recognition

Credits for slides:Marina AlexanderssonLior PachterSerge Saxonov

Reading

• GENSCAN

• EasyGene

• SLAM

• Twinscan

Optional:

Chris Burge’s Thesis

Gene expression

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Gene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = codingintron = non-coding

Finding genes

Start codonATG

5’ 3’

Exon 1 Exon 2 Exon 3Intron 1 Intron 2

Stop codonTAG/TGA/TAA

Splice sites

Approaches to gene finding

• Homology BLAST, Procrustes.

• Ab initio Genscan, Genie, GeneID.

• Hybrids GenomeScan, GenieEST, Twinscan, SGP, ROSETTA,

CEM, TBLASTX, SLAM.

HMMs for single species gene finding: Generalized HMMs

HMMs for gene finding

GTCAGAGTAGCAAAGTAGACACTCCAGTAACGC

exon exon exonintronintronintergene intergene

GHMM for gene finding

TAA A A A A A A A A A A AA AAT T T T T T T T T T T T T T TG GGG G G G GGGG G G G GCC C C C C C

Exon1 Exon2 Exon3

duration

Observed duration times

Better way to do it: negative binomial

• EasyGene:

Prokaryotic

gene-finder

Larsen TS, Krogh A

• Negative binomial with n = 3

Biology of Splicing

(http://genes.mit.edu/chris/)

Consensus splice sites

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Donor: 7.9 bitsAcceptor: 9.4 bits(Stephens & Schneider, 1996)

Splice site detection

5’ 3’Donor site

Position

-8 … -2 -1 0 1 2 … 17

A 26 … 60 9 0 1 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 99 0 41 … 27T 23 … 13 8 1 98 3 … 25

Splice Site Models

• WMM: weight matrix model = PSSM (Staden 1984)

• WAM: weight array model = 1st order Markov (Zhang & Marr 1993)

• MDD: maximal dependence decomposition (Burge & Karlin 1997) decision-tree like algorithm to take significant pairwise dependencies into

account

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttg

caggccggtgag

multiple sequence alignments algorithms. mlagan: progressive alignment of dna given n sequences,...

Documents