lecture 8

1

Lecture 8

Chapter 6Multiple Sequence Alignment Methods

2

The major goal of computational sequence analysis is to predict the structure and function of genes and

proteins from their sequence.

3

Biological Motivation• Compare a new sequence with the sequences in a protein family. Proteins can be categorized into families. A protein family is a collection of homologous proteins with similar sequence, 3-D structure, function, and/or similar evolutionary history. • Gain insight into evolutionary relationships. By looking at the number of mutations that are necessary to go from an ancestor sequence to an extant sequence, one can get an estimate for the amount of time that the two sequences diverged in the evolutionary history.

4

The Global Alignment problem

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

5

Contents• What a multiple alignment means• Scoring a multiple alignment

– Position specific (minimum entropy) scores– Sum of pair scores

• Multidimensional dynamic programming• Progressive alignment methods• Multiple alignment by profile HMM training

6

Multiple Sequence Alignment

• In chapter 5, we assumed that a reasonable multiple sequence alignment was already known and provided the starting point for constructing a profile HMM

• We know look at what a “reasonable” multiple alignment is, and at ways to construct one automatically from unaligned sequence

• MSA must usually be inferred from primary sequences alone (一個蛋白的一級（維）結構是指由特定序列的氨基酸排列形成的胜肽鍵串 )

7

MSABiological sequences are typically grouped into functional families. Biologists produce high quality multiple sequencealignments by hand using expert knowledge. Important factors are:• Specific sorts of columns in alignments, such as highly c

onserved residues or buried hydrophobic residues;• The influence of the secondary structure (α-helices, β-str

ands etc.) and tertiary structure, the alteration of by hydrophobic and hydrophilic columns in exposed β-strands, etc;

• Expected patterns of insertions and deletions, that tend to alternate with blocks of conserved sequence.

• Phylogenetic relationships between sequences, that dictate constraints on the changes that occur in columns and in the patterns of gaps.

8

MSA• Manual multiple alignment is tedious• An automatic method must have a way to

assign a score so that better multiple alignment get better scores

9

Multiple Sequence Alignment: Why?

• Identify highly conserved residues– Likely to be essential sites for structure/function– More precision from multiple sequences– Better structure/function prediction, pairwise alignments

• Building gene/protein families– Use conserved regions to guide search

• Basis for phylogenetic analysis– Infer evolutionary relationships between genes

10

Multiple Sequence Alignment: Why?

• Remember: The goal of biological sequence comparison is to discover functional (or structural ) similarities.

• Unfortunately, if the sequence similarity is weak, pairwise alignment can fail to identify biologically related sequences (because weak pairwise similarities may fail the statistical test for significance). Indeed, similar proteins may not exhibit a strong sequence similarity.

• The good news is that simultaneous comparison of many sequences often allows one to find similarities that are invisible in pairwise sequence comparison.

• [Hubbard et al., 1996]: “Pairwise alignment whispers… multiple alignment shouts out loud.”

11

What a multiple alignment means

• In a MSA, homologous residues among a set of sequences are aligned together in columns

• Homologous is meant in both the structural and evolutionary sense

• Ideally, a column of aligned residues occupy similar 3-D structural positions and all diverge from a common ancestral residue

13

Figure 6.1• A manually generated multiple alignment o

f 10 immunoglobulin superfamily sequence ( 一群都帶著部分免疫球蛋白構形的蛋白質便統稱為 Immunoglobulin superfamily)

• A crystal structure (晶體結構 )of one of the sequences (ltlk, telokin) is known

14

Figure 6.1• At the top: β-strands (a-g). At the bottom: identic

al residues (letter), or highly conservative residues (+).

• The conserved regions include 8 β-strands and certain key residues such as two completely conserved cysteines (C) in the b and f strands

• The other 9 sequences have been manually aligned to ltlk based on this expert structural knowledge

15

Issues• Automatic multiple sequence alignment methods are a topic of exte

nsive research in bioinformatics• Except for trivial cases of highly identical sequences, it is not possibl

e to unambiguously identify structurally or evolutionarily homologous positions and create a single “correct” multiple alignment

• Since protein structures also evolve, we do not expect two protein structures with different sequences to be entirely superposable

• Very similar sequences will generally be aligned unambiguously (a simple program can get the alignment right)

• For cases of interest (e.g. a family of proteins with only 30% average pairwise sequence identity), there is no objective way to define an unambiguously correct alignment

• Once again, in general, an automatic method must assign a score so that better multiple, alignments get better scores

16

Issues• For cases of interest (e.g. a family of protein

s with only 30% average pairwise sequence identity), there is no objective way to define an unambiguously correct alignment

• The globin family, often used as a “typical” protein family in computational work, is in fact exceptional: almost the entire structure is conserved among divergent sequences

18

• The Choice of the sequences:Sequences sharing a common ancestor

(homologous sequences)– PSI-BLAST, FASTA, Various Search Tools

• The Choice of an objective functionBiological problem that lies in the definition of correctness– Sum of pair, Entropy score, Consistency based,

…• The Optimization of that function

– Exact Algorithms (Dynamic Programming)– Progressive alignment (ClustalW)– Iterative approaches (SA, GA, …)

19

Problem Statement

What are the conserved regions among a set of sequencesover the same alphabet?

12345678 Position Index EMQPILLL Sequence 1 DMLR-LL- Sequence 2 NMK-ILLL Sequence 3 DMPPVLIL Sequence 4 DM LL Consensus sequence

20

Scoring a Multiple Alignment

The scoring system should take into account that:• Some positions are more conserved th

an others, e.g. position-specific scoring;• The sequences are not independent, b

ut instead related by a phylogenetic tree.

21

Complex Scoring• Specify a complete probabilistic model of

molecular sequence evolution• Given the correct phylogenetic tree for the

sequences to be aligned, the probability for a multiple alignment is the product of the probabilities of all the evolutionary events necessary to produce that alignment via ancestral intermediate sequences times the prior probability for the root ancestral sequence

22

Complex Scoring• The probabilities of evolutionary change would d

epend on the evolution-ary times along each branch of the tree, as well as position-specific structural and functional constraints imposed by natural selection, so that the key residues and structural elements would be conserved

• High-probability alignments would then be good structural and evolution-ary alignments under this model

• Unfortunately, we do not have enough data to parametrise such a complex evolutionary model

23

Simplifying Assumptions

• Partly or (as we do in this chapter) entirely ignore the phylogenetic tree

• Consider that individual columns of an alignment are statistically independent

24

Defining a scoring function for multiple alignment

• Almost all multiple alignment methods assume that the individual columns are statistically independent.

• However, most multiple alignment methods use affine gap scoring functions, so successive gap residues are not treated independently.

• For simplicity, here we will focus on definitions of S(mi) for scoring a column of aligned residues with no gaps, which leads to

S(m) = i S(mi) m: multiple alignment, mi are columns

25

Sum of Pairs (SP) ScoresThis is the standard method for scoring multiple alignments• Assumes the statistical independence of column

s.• Columns are scored by a “sum of pairs” (SP) fun

ction. The SP score for a column is defined as:

S(mi) = k<l s(mik, mi

l) , where scores s(a,b) come from a substitution matrix such as PAM or BLOSUM.

26

Sum of Pairs (SP) ScoresDrawback:• There is no probabilistic justification of the SP sc

ore.• Each sequence is scored as if it descended form

N-1 other sequences instead of a single ancestor. Evolutionary events are over-counted, a problem which increases as the number of sequences increases. Altschul, Carroll & Lipman [1989] proposed a weighting scheme designed to partially compensate for this defect in SP scores.

27

Scoring Function: Sum Of PairsDefinition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example: x: AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

28

Example : Sum of pair score

Seq A: ARGTCAGATACGLAG---PGMCTETWVSeq B: ARATCGGAT---IAGTIYPGMCTHTWVScoring substitutions are represented in matrices. The popular ones are PAM or BLOSUM.

L

iiiLL basBAS

1

),(),(

Sequence alignments

29

Similarity Measurement: SP-score

Sum of Pairs (SP) -score is the similarity score among amino acids (or bases) at a particular position of a multiple sequence alignment.

The gap-gap alignment has 0 similarity & distance score: s(-,-) = 0

S(M) = SUM s(mi mj) i<j

is the collection of amino acids at a position of alignment.

S(P,R,-,P) = s(P,R) + s(P,-) + s(P,P) + s(R,-) +s(R,P) + s(-,P)

s i j( , )

30


Multiple alignment:

1 PEAALFGKFT---IKSDVW2 AESALYGRFT---IKSDVW3 PDTAIWGKF---SIKSETW4 PEVIRMGDDNPFSFQSDVW

Use only sequences 2 and 3:

2 AESALYGRFT---IKSDVW3 PDTAIWGKF---SIKSETW

Remove positions which contain only gaps (which produces an induced pairwise alignment, or a projection of the multiple alignment in 2 dimensions):

2 AESALYGRFT-IKSDVW3 PDTAIWGKF-SIKSETW

31

12345678 Position Index EMQPILLL Sequence 1 DMLR-LL- Sequence 2 NMK-ILLL Sequence 3 DMPPVLIL Sequence 4 Given a multiple sequence alignment with Sum of Pairs (SP-score), we may compute the score of each position of the alignment and then add all the position scores to get the total score of the whole alignment. Or, we may compute the score for each induced pairwise alignment and add these scores.

If we have N sequences, the number of pairs is N*(N-1)/2


32

3-D Dynamic Programming Hyper-lattice (similarity matrix)

33

n

iikj

L

k

n

i

n

ijkin AGAAsAS

1,

1

1

1 1, )(),()(

Seq A1: ARGTCAGATACGLAG---PGMCTETWV----Seq A2: ARATCGGAT---IAGTIYPGMCTHTWVIAGQSeq A3: ARATCE--TACG--GTI-PGMCTHTWVIA--

bnaAG i )(

Example : Sum of pair score (Cont.)Multiple Sequence alignments

34

A problem with SP scores: Example• Consider an alignment of N sequences which all have leu

cine (L) at a certain position. The score of an L aligned to L is 5 (BLOSUM), so the score of the column is 5xN(N-1)/2, where N(N-1)/2 is the number of symbol pairs in the column.

• If there were one glycine (G) in the column and N-1 Ls, the score would be 9x(N-1) less, because a G-L pair scores –4 and N-1 pairs are affected.

• So, the SP score for a column with one G is worse than the score for a column of all Ls by a fraction of .

• Notice the inverse dependent on N: the relative difference in score between the correct alignment and the incorrect alignment decreases with the number of sequences in the alignment. This is counter-intuitive, because the relative difference ought to increase with the more evidence we have for a conserved leucine.

NNNN

518

2/)1(5)1(9

35

Position Specific Scores

• m is a multiple alignment; mi the column of aligned

symbols in column i; the symbol in column i for sequence j; • is the observed counts for residue a in

column i; where is 1

if and 0 otherwise

Nii mm ,...,1

jim

iaC

j

jiia amC )( )( am j

i

am ji

36

Position Specific Scores

• Ci the count vector of observed symbols in column i for an alphabet of K different residues

cc K

ii,...,1

37

Minimum Entropy Scores

• We assume that residues within the column are independent, as well as between columns.

• The probability of a column mi is:

where pia is the probability of residue a in col

umn i.

a

Cia

iai pmP )(

38

Minimum Entropy Scores• We define a column score as: The column score is an entropy measure.

A conserved column would score 0.• The maximum likelihood estimate for the pa

remeter pia is

pCmPmSiaa

iaii log)(log)(

''

aia

iaia

CCP

39

Simultaneous multiple sequence alignment by Multidimensional dynamic programming

Assumptions: - the columns of an alignment are statistically independe

nt - gaps are scored with a linear gap cost γ=gd for a gap

of length g and some gap cost d. Note: Extension to affine gap a costs is possible but th

e formalism becomes tedious.Therefore the overall score for an alignment can be comp

uted as sum of the scores for each column i

)()( ii

mSmS

42

Note

Using the notation if and ifthe recursion relation becomes:

ComplexityIn general, if we assume that the sequences are roughly the same length the memory complexity of the (naive) dynamic programming algorithm for multiple sequence alignment is and the time complexity is .

i x x 1,i i x 0,i

1 1 1 11

1, , 10

max ( , ,N N N N

N

Ni i i i i N iS x x

( ),NO L (2 )N NO L

,L

43

Carillo-Lipman Algorithm(1988)

Implementation:MSA by Lipman, Altschul & Kececioglu(1989)

• This algorithm reduces the volume of the multidimensional dynamic programming matrix.

• MSA can optimally align up to five to seven protein sequences of reasonable length (200-300 residues).

• Assumption: the score of a multiple alignments is the sum of the scores of all pairwise alignments defined by the multiple alignment.

44

• The score of a complete alignment a is defined as

where denotes the pairwise alignment between sequences k and l.

• Let be the optimal pairwise alignment of k, l. Obviously,• Assume that we have a lower boundσ(a) and S

(a), the score of the optimal multiple alignment a, i.e

( ) ( ),klk l

S a S a

kla

ˆ klaˆ( ) ( ).kl klS a S a

( ) ( ).a S a

45

(We can obtain a good bound σ(a) by any fast heuristic multiple alignment algorithm, for instance progressive alignment algorithms, to be introduced in the sequel).

• Due to the sum of pairs (SP) score definition, we have:

and thus

• Therefore we can set a a lower bound on where

' ' ' '

' ' ' 'ˆ ˆ( ) ( ) ( ) ( ) ( )k l kl kl k lk l k l

S a S a S a S a S a

' '

' 'ˆ ˆ( ) ( ) ( ) ( )kl kl k lk l

a S a S a S a

( ) ,kl klS a

' '

' 'ˆ ˆ( ) ( ) ( ).kl kl k lk l

a S a S a

( ) :klS a

46

• The N(N-1)/2 optimum pairwise alignments are each calculated and scored by stand

ard pairwise alignment.

• The higher the bounds are, the smaller the volume of multidimensional dynamic programming matrix that must be calculated.

ˆ lka

kl

47

• For each pair k, l we can find the complete set of coordinate pairs such that the best alignment of to through scores more than This set is calculated in time by multiplying the forward and backward Viterbi (…!) scores for each cell of the complete pairwise dynamic programming table.• The costly multidimensional dynamic programming algorithm can then restricted to evaluate only cells in the intersection of these sets: i.e. cells for which is in for all k, l.

klB

( , )k li ikx

2( )O L

( , )k li i.kl

1 2( , , )Ni i i( , )k li i

klB

lx

klB

49

Progressive Multiple Alignment Methods

These (greedy) methods are the most commonlyused approach to multiple sequence alignment. The general idea:• Most progressive alignment algorithms build a “

guide tree”, a binary tree whose leaves represent sequences and whose interior nodes represent a alignments.

(The methods for constructing guide trees can be “quick and dirty” versions of those for phylogenetic trees.)

50

Progressive Multiple Alignment Methods• Main heuristic: first align the most similar pairs of s

equences, using a pairwise alignment method. Then walk up the tree and compute at each interior node the alignment of (alignments of) sequences associated with the direct descendants of that node.

• The root node will represent a complete multiple alignment of the input sequences.

Note: Progressive alignment methods use no global scoring function of alignment correctness.

51

Feng-Doolittle Progressive MA Alignment (1987)Specific points (I)

• The guide tree is constructed using the clustering alignment by Fitch & Margoliash (1967), starting from a distance matrix obtained by converting pairwise alignment scores to (approximate) pairwise distances:

where is the observed pairwise alignment scores; is the maximum scores, the average of the s

core of aligning either sequence to itself ; is the expected score for aligning two rando

m shuffling of the two sequence (or by an approximate calculation given in [Feng & Doolittle, 1996 ].

max

log log obs randeff

rand

S SD SS S

obsS

randS

maxS

52

Note: The effective score can be viewed as a normalized percentage similarity; it is expected to decay exponentially towards 0 with increasing evolutionary distance, hence the –log to make the measure more approximately linear with evolutionary distance.

53

Feng-Doolittle Algorithm’s Specific Points (II)

• Sequences to group alignments: A sequences is added to an existing group by alig

ning it pairwise to each sequence in the group in turn.

The highest scoring pairwise alignment determines how the sequences will be aligned to the group.

• Group to group alignment: All sequence pairs between the two groups are tried; the best pairwise sequence alignment determines the alignment of the two groups.

54

Feng-Doolittle Algorithm’s Specific Points (II)

• After an alignment is completed, gap symbols are replaced with a neutral X character. The cost for aligning an X with anything (including a gap symbol) is 0, hence a desirable effect(“once a gap always a gap”) is obtained: gap (tend to ) occur in the same columns in subsequent pairwise alignments.

Note: The X rewriting is not needed in the profile-based progressive alignment algorithms(to be introduced in the sequel).

55

Profile-based progressive multiple alignment Aligning Mas using SP scoring with linear gaps

The gap scores can be included in the SP score by setting andHere an alignment of two Mas will be done so that gaps are inserted in whole columns, so the alignment within each one of the two MAs is not changed.Assuming that we have two Mas, one containing sequence 1 to n, and the other containing sequence n+1 to N, the global alignment score is:

( , ) ( , )s a s a g ( , ) 0.s

( ) ( , )k li i ii i k l N

S m s m m

( , ) ( , )k l k l

i i i ii k l n i n k l Ns m m s m m

,( , )k l

i ii k n n l Ns m m

56

Aligning MAs using SP scoring with linear gaps (cont’d)

• Note that the first two sums are unaffected by the global alignment, since adding columns of gap characters to a profile adds 0 score (s(-,-)=0).

• Therefore the optimal alignment of the two profiles can be obtained by only optimising the last sum with the cross terms. This can be done exactly like standard pairwise alignment, where columns are scored against columns by adding pair scores.

• Obviously, one of the profile can consist of a single sequence only, which corresponds to aligning a single sequence to a profile.

57

Remark

Once an aligned group has been built up, it is advantageous to use position-specific information from the group’s multiple alignment to align a new sequence to it.• The degree of sequence conservation at each position

should be taken into account and mismatches at highly conserved positions penalized more stringently than mismatches at variable positions.

• Gap penalties might be reduced where lots of gaps occur in the cluster alignment, and increased where no gaps occur.

58

Profile-based Progressive Alignment:The CLUSTALW algorithm

• Construct a distance matrix of all N(N-1)/2 pairs by pairwise dynamic programming alignment followed by approximate conversion of similarity scores to evolutionary distances using the model of Kimura [1983]• Construct a guide tree by using the Neighbor- Joining clustering algorithm [Saitou & Nei, 1987].• Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence- profile, and profile-profile alignment.

59

Additional HeuristicsContributing to CLUSTALW’s Accuracy

• In order to compensate for biased representation in large subfamilies, individual sequences are weighted according to the branch length in the Neighbor-Joining tree.•The substitution matrix is chosen on the basis of the similarity expected of the alignment, e.g. BLOSUM80 for closely related sequences, and BLOSUM50 for distant sequences.• Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position

60

Additional HeuristicsContributing to CLUSTALW’s Accuracy

• Both gap-open and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby in the alignment. This rule tries o force all the gaps to occur in the same places in an alignment.• In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the fly to defer the low-scoring alignment until later in the progressive alignment phase when more profile information has been accumulated.

61

Iterative Refinement Methods for Multiple Sequence Alignment

A problem with progressive alignment methods: The subalignments are ‘frozen’, i.e. once a group sequences has been aligned, their alignment to each other cannot be changed at a later.

Example: align ),( ,),( ,),( zwxywzyx

x: GAAGTTy: GAC-TT

frozen!z: GAACTGw: GTACTG now clearly correct y = GA-CTT

62

Basic ideafor iterative refinement MA methods

• An initial alignment is generated.• Then one sequence (or a set of sequences) is taken out and realigned to a profile of the remaining aligned sequences. If a meaningful(!) score is being optimized, this either increases the overall score or results in the same score. Another sequence is chosen an realigned, an so on, until the alignment does not change.• The procedure is guaranteed to converge to a local maximum of the score provided that all the sequences are tried and a maximum score exists, simply because the sequence space is finite.

63

Barton-Sternberg algorithm (1987)• Find the two sequences with the highest pairwise similarity and align them using standard pairwise dynamic programming alignment. Find the sequence that is most similar to a profile of the alignment of the first two, and align it to them by profile- sequence alignment. Repeat until all sequences have been included in MA.• Remove sequence and realign it to a profile of the other aligned sequences by profile-sequence alignment. Repeat for sequences

Repeat the previous realignment step a fixed number of times, or until the alignment score converges.

1xNxx ,,2

Nxx ,,2

lecture 8

Documents

similar sequence

new sequence

ancestor sequence

extant sequence

better multiple alignment

reasonable multiple

msamanual multiple alignment

blocks of conserved