tm biological sequence comparison / database homology searching aoife mclysaght summer intern,...

TM

Biological Sequence Comparison / Database Homology Searching

Aoife McLysaght

Summer Intern,

Compaq Computer Corporation

Ballybrit Business Park, Galway, Ireland

TM

Database Homology Searching

• Use algorithms to increase efficiency and to provide a mathematical basis for searches which can be translated into statistical significance

• Assumes that sequence, structure and function are inter-related

• BLAST (Basic Local Alignment Search Tool) and FastA (Fast Alignment) – heuristic approximations of Needleman-Wunsch and Smith-

Waterman algorithms– reduce computation

TM

Needleman-Wunsch Algorithm

• General algorithm for sequence comparison

• Maximise a similarity score, to give ‘maximum match’

• Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions

• Finds the best GLOBAL alignment of any two sequences

• N-W involves an iterative matrix method of calculation– All possible pairs of residues (bases or amino acids) - one from

each sequence - are represented in a 2-dimensional array– All possible alignments (comparisons) are represented by

pathways through this array

TM

Needleman-Wunsch Algorithm (cont.)

• Three main steps

1. Assign similarity values

2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway

3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment

TM


Similarity values

• A numerical value is assigned to every cell in the array depending on the similarity/dissimilarity of the two residues

• These may be simple scores or more complicated, e.g. related to chemical similarities or frequency of observed substitutions

• The example shown has

– match = +1

– mismatch = 0

M P R C L C Q R J N C B AP 1B 1R 1 1C 1 1 1KC 1 1 1R 1N 1J 1C 1 1 1J 1A 1

TM


Score pathways through array

• For each cell want to know the maximum possible score for an alignment ending at that point

• Searches subrow and subcolumn, as shown, for the highest score

• Adds this to the score for the current cell

• Proceeds row by row through the array

• Gap penalty for the introduction of gaps in the alignment (presumed insertions or deletions into one sequence) … here = 0

M P R C L C Q R J N C B AP 0 1 0 0 0 0 0 0 0 0 0 0 0B 0 0 1 1 1 1 1 1 1 1 1 2 1R 0 0 2 1 1 1 1 2 1 1 1 1 2C 0 0 1 3 2 3 2 2 2 2 3 2 2K 0 0 1 2 3 3 3 3 3 3 3 3 3C 0 0 1 3 3 4 3 3 3 3 4 3 3R 0 0 2 2 3 3 4 ?NJ 1C 1 1 1J 1A 1

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}

TM


Construct alignment

• The alignment score is cumulative by adding along a path through the array

• The best alignment has the highest score i.e. the maximum match

• Maximum match = largest number resulting from summing the cell values of every pathway

• The maximum match will ALWAYS be somewhere in the outer row or column shown

• The alignment is constructed by working backwards from the maximum match

M P R C L C Q R J N C B AP 0 1 0 0 0 0 0 0 0 0 0 0 0B 0 0 1 1 1 1 1 1 1 1 1 2 1R 0 0 2 1 1 1 1 2 1 1 1 1 2C 0 0 1 3 2 3 2 2 2 2 3 2 2K 0 0 1 2 3 3 3 3 3 3 3 3 3C 0 0 1 3 3 4 3 3 3 3 4 3 3R 0 0 2 2 3 3 4 5 4 4 4 4 4N 0 0 1 2 3 3 4 4 5 6 5 5 5J 0 0 1 2 3 3 4 4 6 5 6 6 6C 0 0 1 3 3 4 4 4 5 6 7 6 6J 0 0 1 2 3 3 4 4 6 6 6 7 7A 0 0 1 2 3 3 4 4 5 6 6 7 8

MP-RCLCQR-JNCBA | || | | | | |-PBRCKC-RNJ-CJA

TM


Statistical Significance

• Maximum match is a function of sequence relationship and composition

• Would like to know probability of obtaining result (maximum match) from a pair of random sequences

• Estimate this experimentally– form pairs of random sequences by randomly drawing one

member from each set (I.e. have same composition as the real proteins)

– if the value found for the real proteins is significantly different from that for the random proteins then the difference is a function of the sequences alone and not of their composition

TM

Smith-Waterman Algorithm

• Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximise the similarity measure

• For every cell the algorithm calculates ALL possible paths leading to it. These paths can be of any length and can contain insertions and deletions

TM

Smith-Waterman Algorithm (cont.)

• Only works effectively when gap penalties are used

• In example shown

– match = +1

– mismatch = -1/3

– gap = -1+1/3k (k=extent of gap)

• Start with all cell values = 0

• Looks in subcolumn and subrow shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account

C A G C C U C G C U U A GA 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3C 1.0 0.7 0.0 1.0 3.0 1.7 ?AUUGACGG

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}

TM


• Four possible ways of forming a path

For every residue in the query sequence

1. Align with next residue of db sequence … score is previous score plus similarity score for the two residues

2. Deletion (i.e. match residue of query with a gap) … score is previous score minus gap penalty dependent on size of gap

3. Insertion (i.e. match residue of db sequence with a gap) … score is previous score minus gap penalty dependent on size of gap

4. Stop … score is zero

• Choose whichever of these is the highest

TM


Construct Alignment

• The score in each cell is the maximum possible score for an alignment of ANY LENGTH ending at those coordinates

• Trace pathway back from highest scoring cell

• This cell can be anywhere in the array

• Align highest scoring segment

C A G C C U C G C U U A GA 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0

GCC-UCGGCCAUUG

TM

Differences

• Needleman-Wunsch

1. Global alignments

2. Requires alignment score for a pair of residues to be >=0

3. No gap penalty required

4. Score cannot decrease between two cells of a pathway

• Smith-Waterman

1. Local alignments

2. Residue alignment score may be positive or negative

3. Requires a gap penalty to work effectively

4. Score can increase, decrease or stay level between two cells of a pathway

tm biological sequence comparison / database homology searching aoife mclysaght summer intern,...

Documents