Download - Project Phase II Report
Project Phase II Report Due on 10/20, send me through email Write on top of Phase I report. 5-20 Pages Free style in writing (use 11pt font or larger) Methods
Overview (high-level description) Source of data Algorithm (pseudo code) Prove or argue why the algorithm will work Analyze the computational complexity of the algorithm Plan of implementation
Multiple Sequence Alignment
Dong Xu
Computer Science Department109 Engineering Building WestE-mail: [email protected]
573-882-7064http://digbio.missouri.edu
As we proceed …
Warning:Muddy Road
Ahead!!!
Outline
Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast
Introduction
The multiple sequence alignment of a set of sequences may be viewed as an evolutionary history of the sequences.
No sequence ordering is required.
An Example of Multiple Alignment
VTISCTGSESNIGAG-NHVKWYQQLPGVTISCTGTESNIGS--ITVNWYQQLPGLRLSCSSSDFIFSS--YAMYWVRQAPGLSLTCTVSETSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKEFYPSD--IAVEWWSNG--
Why Multiple Alignment (1)
Natural extension of Pairwise Sequence Alignment
“Pairwise alignment whispers… multiple alignment shouts out loud” Hubbard et al 1996
Much more sensitive in detecting sequence relationship and patterns
Why Multiple Alignment (2)
Give hints about the function and evolutionary history of a set of sequences
Foundation for phylogenic tree construction and protein family classification
Useful for protein structure prediction…
Outline
Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast
Idea of Scoring
Requirement of a good quality of alignment measureAdditive function
Function must be independent of order of arguments
Should reward presence of many equal or strongly related symbols (in the same column) and penalize unrelated symbols and spaces.
Distance from Consensus
Consensus sequnece: single sequence which represents the most common amino acid/base in that positionD D G A V - E A LD G G - - - E A LE G G I L V E A LD - G I L V Q A VE G G A V V Q A LD G G A/I V/L V E A L
distance from consensus: total number of characters in the alignment that differs from the consensus character of their columns
Sum-of-pairs’ (SP) measure
The Sum of Pairs (SP) method is as follows:Given (1) a set of N aligned sequences each of length L in the form of a L x N MSA alignment matrix M and (2) a substitution matrix (PAM or BLOSUM) that gives a cost c(x,y) for aligning two characters x, y .
The SP score SP(mi) for the I-th column of M denoted by mi is calculated:
SP(mi) = SUM(j, k) [ c( mij , mi
k) ] The SP score for M is SUM(i) [SP(mi) ]
Feature of SP Measure
Theorem: Let alpha be a multiple alignment of the set of sequences s1, …, sk; and alpha(I,j) denote the pairwise alignment of si and sj as induced by alpha. Then
SP-score(alpha) = Sum over i,j [score(alpha(i,j)]
The above is only true if we have s(-,-) = 0.
This is because in pairwise alignment the presence of two aligned spaces (–) in the two sequences are ignored.
In the minimum entropy approach the score of multiple alignment is defined as the sum of entropies of the columns (the entropy of a column is related with the frequency of letter x in the column)
Information content Profile score
Profile Distance
Outline
Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast
Dynamic Programming
Multiple sequence alignment can be done with dynamic programming, using (n+1)k memory cells, where n is the length of each of the k sequences to be aligned.
Sequence 1
Sequence 2
Sequence 3
Complexity Analysis of Dynamic Programming
Running time: (n+1)k number of entries in the table
For each entry we need to find the maximum of 2k -1 elements
Finding the SP-score corresponding to each element means adding O(k2) numbers
Total = O(k22knk) i.e., NP-hard
Outline
Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast
Center Star Method
The center star method is an approximation algorithm, if scores can be decoupled into pairwise distances.
minimizing sum of pairwise distances. Works well when a distance satisfies
the triangle inequality: dst <= dsc + dct for all sequences s, t, and c.
Star Alignments
Select a sequence sc as the center of the star
For each sequence s1, …, sk such that index i c, perform a Needleman-Wunsch global alignment
Aggregate alignments with the principle “once a gap, always a gap.”
Star Alignments Example
s2
s1 s3
s4
s1: MPEs2: MKEs3: MSKEs4: SKE
MPE
| |
MKE
MSKE
- ||
MKE
SKE
||
MKE
MPEMKE
-MPE-MKEMSKE
-MPE-MKEMSKE-SKE
Choosing a Center
Try them all and pick the one with the best score
Calculate all O(k2) alignments, and pick the sequence sc that maximizes
ci
ci ssscore ),(
Performance
Therefore, if C is the center star method SP score, then C/2 is a lower bound to the SP score of any multiple alignment.
C is at most as bad as twice of the optimum, and this bound on how bad the result can be makes the center star method an approximation algorithm to the NP-hard problem.
Complexity Analysis
Assuming k sequences have length n
O(n2) to calculate 1 global alignment
O(k2) global alignments to calculate
O(k2n2) overall cost
Outline
Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast
Progressive Alignment
Devised by Feng and Doolittle in 1987.
Essentially a heuristic method, not guaranteed to find the ‘optimal’ alignment.
Multiple alignment is achieved by successive application of pairwise methods.
Basic Algorithm
Compare all sequences pairwise. Perform cluster analysis on the pairwise
data to generate a hierarchy for alignment (guide tree).
Build alignment step by step according to the guide tree. Build the multiple alignment by first aligning the most similar pair of sequences, then add another sequence or another pairwise alignments.
Steps in Progressive Multiple Alignment
Compare pairwise sequences Perform cluster analysis on pairwise data
to generate hierarchy for alignment
Building Guided Tree
Using WeightHbb_Human -
Hbb_Horse .17 -Hba_Human .59 .60 -
Hba_Horse .59 .59 .13 -Myg_Whale .77 .77 .75 .75 -
Hbb_Human
Hbb_Horse
Hba_Horse
Hba_Human
Myg_Whale
2
1
3
4
Quick pairwise alignment: calculate distance matrix
Neighbor-joining tree (guide tree)
Alignment (1)
Build multiple alignments by first aligning most similar pair, then next similar pair etc.
Alignment (2)
Scoring
PQRRZWYQRKZX
YZTUOPTZZ_FO
Total Score = [w(R, U ) + w(R, ) + w(K, U) + w(K, ) ] / 4
Most successful implementation of progressive alignment (Des Higgins)
CLUSTAL - gives equal weight to all sequences
CLUSTALW - has the ability to give different weights to the sequences
CLUSTALX - provides a GUI to CLUSTAL
http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html
CLUSTAL
Problems with Progressive Alignments
Greedy Nature: No guarantee for global optimum
Depends on pairwise alignments If sequences are very distantly related,
much higher likelihood of errors Care must be made in choosing scoring
matrices and penalties Other approaches using Bayesian
methods such as hidden Markov models
Outline
Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast
Concept of Profile
Seq1>-
Seq3>-Seq4>-
Seq2>-
Information about the degree of conservation of sequence positions is included
Position-specific Score Matrix (PSSM)
For protein of length L, scoring matrix is L x 20, PSSM(i,j) --“Profile”: specific scores for each of the 20 amino acids at each position in a sequence.
For highly conserved residues at a particular position, a high positive score is assigned, and others are assigned high negatives.
For a weakly conserved position, a value close to zero is assigned to all the amino acid types.
Building a Profile
First, get multiple sequence alignments using substitution matrix, Sjk.
Second, count the number of occurrences of amino acid k at position i, Cik.
(1) Average-score method: Wij = k CikSjk / N. (2) log-odds-ratio formula: Wij = log(qij/pj).
qij = Cij / N.
pj : background probability of residue j.
Calculating Profiles (1)
Gribskov et al, Proc. Natl. Acad. Sci. USA 84, 4355-4358, 1987
ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL
C1A = 4, C1G = 3W1A = (4 SAA + 3 SAG) / 7 = (4 4 + 3 0) / 7 = 2.3
Wij = k CikSjk / N
Calculating profile (2)
Wij = log(qij/pj).qij = Cij / N.
pj : background probability of residue j.
For small N, formula qij = Cij / N is not good A large set of too closely related sequences carries little more information than a single member. Absence of Leu does not mean no Leu at this position when Ile is abundant! Pseudocount frequency, gij
Frequency Matrix
Effective frequency, fij
ij iij
q gf
Frequency matrix element, fij, is the
probability of amino acid j at position i.
j
gij = pj k qik exp ( Skj )pj: : background frequency
Frequency matrix, example
0.03 0.02 0.01 0.01 0.01 0.02 0.02 0.02 0.01 0.22 0.27 0.02 0.12 0.03 0.01 0.02 0.03 0.00 0.01 0.080.04 0.02 0.17 0.03 0.01 0.02 0.03 0.03 0.01 0.03 0.08 0.03 0.01 0.01 0.02 0.07 0.34 0.00 0.01 0.040.07 0.03 0.02 0.03 0.01 0.12 0.13 0.03 0.01 0.01 0.02 0.04 0.01 0.01 0.38 0.07 0.07 0.00 0.01 0.020.07 0.03 0.09 0.06 0.01 0.03 0.14 0.03 0.06 0.01 0.02 0.09 0.01 0.01 0.15 0.16 0.05 0.00 0.01 0.020.03 0.11 0.08 0.03 0.01 0.12 0.13 0.02 0.01 0.04 0.04 0.13 0.09 0.01 0.02 0.04 0.03 0.00 0.01 0.040.03 0.12 0.05 0.03 0.01 0.10 0.05 0.02 0.01 0.04 0.14 0.10 0.16 0.01 0.01 0.04 0.03 0.00 0.01 0.040.04 0.04 0.12 0.06 0.00 0.06 0.12 0.03 0.03 0.01 0.02 0.18 0.02 0.02 0.02 0.15 0.04 0.03 0.01 0.020.06 0.02 0.08 0.03 0.00 0.08 0.19 0.02 0.04 0.05 0.07 0.09 0.04 0.01 0.02 0.12 0.04 0.00 0.01 0.030.03 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.02 0.15 0.31 0.01 0.02 0.12 0.01 0.02 0.02 0.01 0.02 0.130.06 0.03 0.05 0.06 0.03 0.07 0.19 0.02 0.02 0.03 0.06 0.10 0.01 0.05 0.02 0.07 0.07 0.00 0.05 0.030.22 0.10 0.03 0.08 0.01 0.03 0.12 0.02 0.01 0.01 0.06 0.11 0.01 0.01 0.03 0.05 0.07 0.00 0.02 0.030.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.08 0.34 0.01 0.02 0.26 0.01 0.02 0.02 0.01 0.02 0.100.03 0.05 0.02 0.03 0.00 0.10 0.29 0.12 0.01 0.01 0.08 0.13 0.02 0.03 0.02 0.03 0.03 0.00 0.01 0.020.06 0.10 0.10 0.07 0.00 0.06 0.13 0.03 0.06 0.01 0.02 0.18 0.01 0.01 0.02 0.08 0.04 0.00 0.03 0.010.09 0.22 0.07 0.03 0.00 0.06 0.04 0.02 0.02 0.04 0.07 0.18 0.01 0.01 0.01 0.07 0.05 0.00 0.01 0.030.03 0.01 0.02 0.01 0.01 0.01 0.02 0.01 0.01 0.17 0.41 0.01 0.03 0.05 0.01 0.03 0.04 0.00 0.04 0.050.04 0.02 0.10 0.05 0.00 0.06 0.02 0.48 0.02 0.01 0.02 0.05 0.01 0.01 0.01 0.05 0.03 0.00 0.01 0.010.05 0.01 0.01 0.01 0.01 0.02 0.08 0.01 0.06 0.13 0.09 0.02 0.04 0.05 0.01 0.02 0.02 0.01 0.25 0.080.06 0.06 0.07 0.05 0.01 0.06 0.09 0.02 0.01 0.02 0.02 0.07 0.01 0.02 0.09 0.17 0.13 0.00 0.02 0.020.02 0.01 0.01 0.04 0.00 0.02 0.05 0.01 0.01 0.06 0.11 0.01 0.01 0.43 0.03 0.03 0.03 0.01 0.04 0.03
i (p
ositi
on) 1
,…
,L
j (amino acid type) 1,…,20
Profile Alignment (1)
ACD……VWY
sequence
profile
alignment of alignments• Sequence – Profile Alignment.• Profile – Profile Alignment.
Penalize gaps in conserved regions more heavily than gaps in more variable regions
Dynamic Programming. (same idea as in Pairwise Sequence Alignment)Optimal alignment in time O(a2l2)a = alphabet size, l = sequence length
Profile Alignment (2)
PsiBlast Psi (Position Specific Iterated) is an
automatic profile-like search The program first performs a gapped blast
search of the database. The information of the significant alignments is then used to construct a “position specific” score matrix. This matrix replaces the query sequence in the next round of database searching
The program may be iterated until no new significant are found
Search (Building Profile) with PSI-BLAST
Query sequence is “master” for multiple alignment. Profile length = query sequence length. Find family members with low E-value (e.g. <0.01). Exclude sequences with >98% identity. Build frequency matrix. Build PSSM. Build alignment using PSSM. Add new sequences. Iterate…
PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.
Much more sensitive than BLAST E-values can be misleading!
Significance of Psi Blast
Reading Assignments
Suggested reading: Chapter 4 in “Current Topics in
Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press. 2002.”
Optional reading: Chapter 7 in “Pavel Pevzner: Computational
Molecular Biology - An Algorithmic Approach. MIT Press, 2000.”
Develop a program that implement the Center Star algorithm
1. Modify your code in the first assignment for global alignment.
2. Use edit distance (match 1; otherwise 0) with gap penalty –1 – k (k is gap size).
Project Assignment