multiple sequence alignments algorithms. mlagan: progressive alignment of dna given n sequences,...

Multiple Sequence Alignments

Algorithms

MLAGAN: progressive alignment of DNA

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Baboon

MLAGAN: main steps

Given a collection of sequences, and a phylogenetic tree

1. Find local alignments for every pair of sequences x, y

2. Find anchors between every pair of sequences, similar to LAGAN anchoring

3. Progressive alignment• Multi-Anchoring based on reconciling the pairwise anchors• LAGAN-style limited-area DP

4. Optional refinement steps

MLAGAN: multi-anchoring

To anchor the (X/Y), and (Z) alignments:

Whole-genome alignment Human/Mouse/Rat

Insertion/Deletion Rate Analysis

Heuristics to improve multiple alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT

Algorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned

4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Note: Guaranteed to converge

For each sequence y1. Remove y2. Realign y

(while rest fixed)x

x,z fixed projection

allow y to vary

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Restricted MDP

Run MDP, restricted to radius R from m

Running Time: O(2N RN-1 L)

Tree Refinement

Run 3D-DP, restricted to radius R from m, for each tree node

Running Time: ~7R2 LN

R: RadiusL: Alignment LengthN: Number of Sequences

A* for Multiple Alignments

Review of the A* algorithm

g(v)h(v)

• g(v) is the cost so far• h(v) is an estimate of the minimum cost from v to GOAL• f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v

1. Expand v with the smallest f(v)2. Never expand v, if f(v) ≥ shortest path to the goal found so

A* for Multiple Alignments

• Nodes: Cells in the DP matrix• g(v): alignment cost so far• h(v): sum-of-pairs of individual pairwise alignments

• Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments

g(v)h(v)

Consistency – T-Coffee

yj yj’

T – Coffee Layout

LALIGN CLUSTALW

Generating Primary Library

B B B A AC C

B B BC C C

ClustalW Primary Library Lalign Primary Library (10 top scoring non–intersecting

Local (Global pairwise alignment) (Pairwise alignment)

Library has information for each N(N-1)/2 sequence pairs.

Primary Library

Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT - - -

Prim. Weight = 88

Seq A GARFIELD THE LAST FA-T CAT Seq C GARFIELD THE VERY FAST ---

Prim. Weight = 77

Seq A GARFIELD THE LAST FAT CAT Seq D - - - -- - -- THE - - - - FAT CAT

Prim. Weight = 100

Seq B GARFIELD THE - - - - FAST CAT Seq C GARFIELD THE VERY FAST CAT

Prim. Weight = 100

Seq B GARFIELD THE FAST CAT Seq D - - - - -- - THE FA-T CAT

Prim. Weight = 100

Seq C GARFIELD THE VERY FAST CAT Seq D - - - -- -- - THE - - - - FA-T CAT

Prim. Weight = 100

Combining the libraries

Seq A GARFIELD THE LAST FAT CAT

Seq B GARFIELD THE FAST CAT - - -

Primary weight(ClustalW)=88

Primary Weight(Lalign)=88

W(A(G),B(G)) = 88 + 88 = 176

If a pair is duplicated across the two libraries, it is merged into

single entry with weight = sum of two weights

pairs of residue that did not occur are not present ( weight 0 )

Library Extension

• Complete extension requires examination of all triplets.• Not all bring information ( eg. A and B through D ).

• Weight of a pair = weights gathered through examination of all triplets involving that pair.

Running Time

• Complexity of entire procedure:

O(N2 * L2) + O(N3*L) + O(N3) + O(N*L2)

O(N2 L2) - pair-wise library computationO(N3 L) - library extensionO(N3) - computation of NJ treeO(N L2) - progressive alignment computation

Where:L – average sequence lengthN – number of sequences

T-Coffee compared with other methods

Method Cat1(81) Cat2(23) Cat3(4) Cat4(12) Cat5(11) Total(141)

Dialign 71.0 25.2 35.1 74.7 80.4 61.5

ClustalW 78.5 32.2 42.5 65.7 74.3 66.4

Prrp 78.6 32.5 50.2 51.1 82.7 66.4

T-Coffee 80.7 37.3 52.9 83.2 88.7 72.1

Gene Recognition

Credits for slides:Marina AlexanderssonLior PachterSerge Saxonov

Reading

• GENSCAN

• EasyGene

• SLAM

• Twinscan

Optional:

Chris Burge’s Thesis

Gene expression

Protein

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Gene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = codingintron = non-coding

Finding genes

Start codonATG

5’ 3’

Exon 1 Exon 2 Exon 3Intron 1 Intron 2

Stop codonTAG/TGA/TAA

Splice sites

Approaches to gene finding

• Homology BLAST, Procrustes.

• Ab initio Genscan, Genie, GeneID.

• Hybrids GenomeScan, GenieEST, Twinscan, SGP, ROSETTA,

CEM, TBLASTX, SLAM.

HMMs for single species gene finding: Generalized HMMs

HMMs for gene finding

GTCAGAGTAGCAAAGTAGACACTCCAGTAACGC

exon exon exonintronintronintergene intergene

GHMM for gene finding

TAA A A A A A A A A A A AA AAT T T T T T T T T T T T T T TG GGG G G G GGGG G G G GCC C C C C C

Exon1 Exon2 Exon3

duration

Observed duration times

Better way to do it: negative binomial

• EasyGene:

Prokaryotic

gene-finder

Larsen TS, Krogh A

• Negative binomial with n = 3

Biology of Splicing

(http://genes.mit.edu/chris/)

Consensus splice sites

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Donor: 7.9 bitsAcceptor: 9.4 bits(Stephens & Schneider, 1996)

Splice site detection

5’ 3’Donor site

Position

-8 … -2 -1 0 1 2 … 17

A 26 … 60 9 0 1 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 99 0 41 … 27T 23 … 13 8 1 98 3 … 25

Splice Site Models

• WMM: weight matrix model = PSSM (Staden 1984)

• WAM: weight array model = 1st order Markov (Zhang & Marr 1993)

• MDD: maximal dependence decomposition (Burge & Karlin 1997) decision-tree like algorithm to take significant pairwise dependencies into

account

ggtgag

caggtg

cagatg

cagttg

caggccggtgag

multiple sequence alignments algorithms. mlagan: progressive alignment of dna given n sequences,...

Documents

pairwise sequence alignment. learning objectives define...

pairwise alignments biology 224 instructor: tom peavy sept...

pairwise alignments introduction introduction why do...

pairwise sequence alignment: dynamic programming...

e1 alignments - national institutes of health · ii-e1-1...

multiple sequence alignments - unibo.it · • find...

pairwise sequence alignments - rdrr.iopairwise sequence...

msa- multiple sequence alignment aligning many sequences is...

an introduction to bioinformatics database searching -...

pairwise sequence alignment part 2. outline summary local...

the biological meaning of pairwise alignments

pairwise alignment - score-based alignments, dynamic...

pairwise sequence alignments - bioinformatics · pairwise...

pairwise sequence alignments. comparison methods global...

page 1 august 2006 pairwise sequence alignments etienne de...

3d visualization of drugs-protein...

pairwise sequence alignments -...

pairwise alignments biology 224 instructor: tom peavy sept...

e4 alignments · e4 alignments e4 alignments e4 alignments...

pairwise sequence alignments - bioconductor...pairwise...