sequence alignment

Sequence Alignment

Sequence Alignment• Why:

– To match a new sequence to others with known functions– To search for ESTs and other signs of gene expression– To understand population dynabmics and evolutionary relationships between

genes and species– To find important regions within proteins

• Issues:– Alignment should mimic evolutionary descent: the actual history of mutation and

selection that led to this gene• But it is too complicated to get perfectly correct• Protein alignments work over larger evolutionary distances than nucleotide

– How to treat substitutions, insertions and deletions (gaps)– How to score possible alignments

• Global vs. local alignment– Multiple alignment (as an extension of pairwise alignment

• Hidden Markov Models and other ways of abstracting multiple alignment information

• Homology: related by evolutionary descent. As opposed to similarity, which is not necessarily based on descent from a common ancestor

– But in practice, long aligned sequences seem to only arise by evolution– Short alignments can be due to chance or convergent evolution.

Example Alignments• THISSEQUENCE vs. THATSEQUENCE

– Same length, just 2 mismatches• THISISASEQUENCE vs. THATSEQUENCE

– Length is different, need to introduce gaps to maximize identities.

Scoring by Identity• One simple way to score an alignment is by counting the number of

perfect matches.– Get percentage of identities by dividing number of matches by total

positions (including gap positions). This is a measure of relatedness between 2 proteins.

– For previous example, 11 matches with 16 positions = 68.75% (69%) identities

• Length matters: it is harder to get a high percentage of identities in a long sequence than in a short one.

• Problem of random matches. For nucleotides, 25% of all positions in random sequences match, and it’s 5% for proteins.– General rule, based on proteins with known structural similarity:

• Two proteins are probably structurally similar (and thus probably homologous) if they have 30% or more identical amino acids over their whole length when aligned.

• Less than 20% amino acid identity means probably not homologous• Between 20% and 30% is a gray zone• My personal happiness with matches increases when it’s above 35%• Except for very unusual proteins, 100% identity doesn’t occur between

homologous proteins in different species

Dotplots

• Dotplots are a simple way of seeing alignments– We really like to see good

visual demonstrations, not just tables of numbers

• It’s a grid: put one sequence along the top and the other down the side, and put a dot wherever they match.

• You see the alignment as a diagonal

Dotplot Noise• A big problem is noise: there are lots of random matches (roughly 5% for

proteins) that confuse the image.– Standard solution: create a sliding window (say 10 residues) and only mark a

dot if a minimum number of matches occur in that window (say 3).– A lot of noise goes away

• This is a sequence compared to itself, so there is a perfect diagonal.

A Real Dotplot• Two haptoglobin sequences. (Haptoglobin is a blood protein that binds to hemoglobin that has

gotten out of the red blood cells).• You can see a gap in one sequence, a region of poor similarity just before it, and a simple

sequence repeat near the beginning.

Similarity Matching• In proteins, many substitutions occur that have little effect on structure or

function– or, they alter the protein to make it more adapted for the lifestyles of the different

species– This depends on where in the protein they occur and on the chemical and physical

properties of the amino acids.

• Substitution matrices: scores of the probability of changing one amino acid into another.

– Amino acids are similar if they can frequently be substituted for each other.– These are just overall numbers compiled over many sequences, not adapted to

specific cases.

• Early attempts were based on amino acid properties, or on the nubmer of nucleotide substitutions needed to change form one amino acid to the other.

• Now they are based on actual comparison between sequences.– The two most popular types: PAM and BLOSUM– There are other, more specialized substitution matrices, for comparing

transmembrane regions, for example.

BLOSUM62 Matrix

Similarity Matrix Theory• Think about aligning 2 proteins from similar species that are orthologs: same function

and syntenic. At some point back in evolutionary time, there was a single DNA sequence that is the common ancestor of both proteins.

– Most paired amino acids are identical, but a few are different.• Reduce the problem: consider a single aligned pair of amino acids, that are not

identical. T-S• We are comparing 2 theories of how these amino acids were derived from a common

ancestor.1 Random mutation followed by natural selection. Some substitutions will happen more

frequently than others because they lead to functional proteins more often.

• The frequency with which T and S are substituted for each other by evolution is derived from counting them in well-aligned sequences. = freq(T-S)

2 Completely random changes: every possible substitution happens in proportion to the relative frequencies of the different amino acids, the two amino acids are unrelated to each other.

• In this case, the frequency of a T and an S is just the product of the frequency of T’s and the frequency of S’s in the entire protein (or proteome). -= freq(T) • freq(S)

• The odds ratio is the evolutionary theory (observed data) frequency divided by the random theory frequency. OR = freq(T-S) / freq(T) • freq(S)

More Theory• We want to get the odds that a given alignment fits the evolutionary model

better than a random model.– Good alignments give high odds ratios

• Need to multiply the OR’s for all amino acids in the alignment

• It is easier (and doesn’t overflow the computer’s floating point calculator) to take the logarithm of the odds ratio for each amino acid, and then add the logarithms.

– This is the lod score (log of odds).

• A negative score means that the given substitution is less likely than chance, and a positive score means it is more likely than chance.

• You can score each possible alignment by adding up over the whole protein

• Some fooling with constants (which don’t distort the results but are either more pleasing to the human eye or make further calculations easier: multiply lod score by 10, or add a constant to make al values 0 or greater

PAM• PAM = “Point Accepted Mutations”, meaning single amino acid substitutions (point

mutations) that have been “accepted” by natural selection: they are functional in different species.

• Derived by Dayhoff and colleagues in the 1960’s and 1970’s (although there are some newer versions around)

• They give a measure of the frequency of changing from one amino acid to another, as compared to the frequency of random change

• Derived from global alignments of homologus sequences from different, but closely related, species. The sequences had an average of 1 amino acid change per hundred residues. Thus we assume at most 1 mutation has occurred at each position.

– Do an phylogenetic analysis of the sequences to determine which mutations have occurred– Calculate the lod scores. Then multiply all of them by 10 and round to integers.– This set of scores derived from sequence alignments is the PAM1 matrix.

• Since most sequences being aligned are not between such closely related species, the PAM1 matrix is multiplied by itself many times to mimic lots of small changes.

– This concept is a serious weakness: multiplying of errors magnifies them.– The number after “PAM” is the number of times the matrix has been multiplied by itself.– Common ones: PAM30, PAM70, PAM120, PAM250. Bigger number = better for more

distant relationships

BLOSUM• =BLOck Substitution Matrix. Derived in the 1990’s by Henikoff and

Henikoff.• Based on local alignments of Blocks, which are short, highly homologous

regions, with no gaps• Sequences were grouped together if they were very similar, and then

comparisons were made between the groups as in the PAM matrices.– No attempt at phylogenetic trees– The different BLOSUM matrices have specific cutoffs for amino acid identities.

For example, the BLOSUM62 matrix is based on sequence blocks with at least 62% identity.

– The odds ratio for each substitution is calculated, but instead of taking the base 10 log and multiplying the result by 10 as in PAM, BLOSUM takes the base 2 log and multiplies by 2. This gives scores in “half-bits”.

• Bigger numbers imply closer evolutionary distance, so BLOSUM80 is better for closely related species than BLOSUM 45.

• BLOSUM seems to work better than PAM– BLOSUM62 is the default used in BLAST searches.

BLOSUM62 and PAM120 Matrices

The colors represent different physiochemical properties.

Note that some substitutions arepositive, which indicates that theyoccur more frequently than chance.

The average value is negative: it ismore likely than an amino acid will stay the same than change.

The diagonal values are unchangedamino acids, all of which have positivevalues. Some are less changeablethan others: tryptophan and cysteineespecially.

Gaps• Gaps occur with roughly 1/10 the frequency of base substitutions, so they

are common in most alignments.• Symbolized by hyphens ( --- ) paired with residues: like a mismatch with a

blank space.• You can assign a penalty for each gap position.

– This is called a linear gap penalty: the total penalty is proportional to the gap length.

• The problem is, once you start putting them in, you can get almost anything aligned.

• Alignment programs usually distinguish between creating a gap and extending a gap. Thus, the gap opening penalty and a (smaller) gap extension penalty.

– This is called an affine gap penalty.• Although substitutions have a lot of theory behind them, gap penalties are

generally determined by heuristic means. – Heuristic = a method or value determined by trial-and-error experiments,

without a strong guiding theory. – In this case, gap opening and extension penalties are the result of trying many

possibilities and seeing which ones give the most pleasing alignments.– The BLAST default is a -11 penalty for opening the gap and -1 for each

additional base of gap. (11/1)• Other options on BLAST at NCBI are 7/2, 8/2, 9/2, 10/1, and 12/1

Comparing 2 distantly related sequences with different gap penalties:•Top sequence has fewer gaps and longer matches.•Bottom sequence has more identities and similarities overall, but lots of little gaps. The matches near the C-terminal are absurd.•Look at the short segment after the first gap in the lower sequence: gained 3 identities

How Do We Make Alignments?• We have been working on scoring an alignment: identities and similarities, and gap

penalties.• But, how do you get an alignment to score in the first place?

– Trying all possibilities is one of those “more possibilities than there are atoms in the Universe” problems.

• The general solution: “dynamic programming”, a technique first applied to DNA sequences by Needleman and Wunsch (1970)

– Their original method gave global alignments.

– Smith and Waterman (1981) provided a slight (but critical) modification that produced local alignments, which work better than global for most genes.

• These methods provide an optimal alignment, for a given substitution matrix and set of gap penalties.

• They are much faster than trying all possibilities, but still not quick enough. Various refinements and heuristic methods improve the speed.

Smith-Waterman Algorithm• Start with a 2-dimensional matrix with one

sequence along the top and the other sequence down the left side. All possible pairs of nucleotides or amino acids are represented by the cells of the matrix.

– “Edge rows” along the top and left side. • All possible alignments are represented by

the paths through the matrix.– a diagonal step is an alignment between the

query and the subject sequences at that position

– a vertical step is a gap in the query sequence– a horizontal step is a gap in the subject

sequence.• Have a match reward and penalties for

mismatches, gap openings, and gap extensions. For our example, we will use the BLOSUM62 matrix, with a linear gap penalty of -6

• Initialize the edge rows to scores of 0.

BLOSUM62

With positivescores marked

Calculating Cell Scores

• The cell at row i and column j has a score S(i, j)

• Starting at top left cell, proceed row-by-row, calculating each cell’s score S(i, j). S(i, j) is the maximum of:– 0 (i.e. set to 0 if the calculated

score is less than 0)– S(i-1, j-1) + match/mismatch

score for cell (i, j)– S(i, j-1) + match/mismatch

score for cell (i, j) + gap penalty

– S(i-1, j) + match/mismatch score for cell (i, j) + gap penalty

T A

T 5 7

G 2 ?For the cell in question, the bases don’t match, so it starts with a match/mismatch score of -1. There are 3 possible alignment paths to this cell:

1. diagonal (query/subject alignment). Score = 5 – 1 = 4.

2. vertical (query gap). Score = 7 – 4 – 1 = 2

3. horizontal (subject gap). Score = 2 – 4 – 1 = -3 (set to 0)

Since 4 is the maximum, the cell’s value is set to 4.

Smith-Waterman Details

• Start at the first row: T doesn’t match anything, and looking at BLOSUM62, the only positive score for a mismatch is +1 with S.

– We keep track of the 0 -> 1 diagonal

• Second row: H matches N = +1, but nothing else..

– The diagonal staring with the 1 in the previous row is a H-A mismatch = -2, so 1 -22 = -1, which is scored as 0.

• Third row: I gives positive scores with M. L, and V. But, nothing builds on the previous row.

More S-W• Fourth row: S has positive scores

with N, A, and T.– S-S = +4 match, added to 4 from

the diagonal = 8

– S-A = 1. For a horizontal move (subject gap), 8 + 1 – 6 = 3.

– S-I is -2 mismatch, added to 2 from the diagonal = 0.

– S-G = 0 mismatch, added to 4 from the diagonal

More S-W

Still More!

Traceback• Then, start at the highest score in the

matrix and trace back the path leading through the highest previous scores to 0. Go left and up only, preferring the diagonal path if a choice needs to be made.

– High score is 16, in the bottom row (but it could have been elsewhere).

• Write the alignment starting at the top.– It doesn’t cover the entire sequence: it

is a local alignment, not global.– It isn’t perfect: the strong diagonal from

LI and the 0 mismatch score from a G-N match overcame the gap penalty needed to put a gap where the G is.

– Nevertheless, given the BLOSUM62 matrix and the -6 linear gap penalty, this is an optimal alignment,

ISALIGNEIS-LIN-E

Changing the Gap Penalty

• The top one has a -4 gap penalty and the bottom one has a -8 gap penalty (both linear). They give somewhat different alignments.

A Needleman-Wunsch Alignment

Speeding Things Up

sequence alignment

Documents

homologous proteins

simple sequence

unusual proteins

long sequence

similarity matchingin

number of perfect matches

problem of random matches

new sequence