bioinformatics workshop, fall 2003 algorithms in bioinformatics lawrence d’antonio ramapo college...

57
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey

Post on 22-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Bioinformatics Workshop, Fall 2003

Algorithms in Bioinformatics

Lawrence D’Antonio

Ramapo College of New Jersey

Bioinformatics Workshop, Fall 2003

Topics

• Algorithm basics

• Types of algorithms in bioinformatics

• Sequence alignment

• Database Searches

Bioinformatics Workshop, Fall 2003

Algorithm basics

• What is an algorithm?

• Algorithm complexity

• P vs. NP

• NP completeness

Bioinformatics Workshop, Fall 2003

What is an algorithm?

• An algorithm is a step-by-step procedure to solve a problem

• The word “algorithm” comes from the 9th century Islamic mathematician al-Khwarizmi

Bioinformatics Workshop, Fall 2003

Algorithm Complexity

• If the algorithm works with n pieces of data and the number of steps is proportional to n, then we say that the running time is O(n).

• If the number of steps is proportional to log n, then the running time is O(log n).

Bioinformatics Workshop, Fall 2003

Example

• Problem: find the largest element in a sequence of n elements.

• Solution idea: Iteratively compare size of elements in sequence.

Bioinformatics Workshop, Fall 2003

Algorithm:

1. Initialize first element as largest.

2. For each remaining element.

If current element larger than largest, make that element largest.

Running time: O(n)

Bioinformatics Workshop, Fall 2003

Polynomial Time

• An algorithm is said to run in polynomial time if its running time can be written in the form O(nk) for some power k.

• The underlying problem is said to be of class P.

Bioinformatics Workshop, Fall 2003

Polynomial Time Examples

• Searching

Binary Search: O(log n)

• Sorting

Quick Sort: O(n log n)

Bioinformatics Workshop, Fall 2003

NP Algorithms

• An algorithm is nondeterministic if it begins with guessing a solution to the problem and then verifies the guess.

• A problem is of category NP if there is a nondeterministic algorithm for that problem which runs in polynomial time.

Bioinformatics Workshop, Fall 2003

NP Complete

• A problem is NP-complete if it has an NP algorithm, and solutions to this problem can be used to solve all other NP problems.

• A problem is NP-hard if it is at least as hard as the NP-complete problems

Bioinformatics Workshop, Fall 2003

NP Complete Examples

• Traveling salesman

• Knapsack problem

• Partition problem

• Graph coloring

Bioinformatics Workshop, Fall 2003

P = NP ?

• P NP

• If P NP then NP-complete problems have exponential running time.

Bioinformatics Workshop, Fall 2003

Polynomial vs. Exponential

Bioinformatics Workshop, Fall 2003

Algorithms in Bioinformatics

• Algorithms to compare DNA, RNA, or protein sequences

• Database searches to find homologous sequences

• Sequence assembly

• Construction of evolutionary trees

• Structure prediction

Bioinformatics Workshop, Fall 2003

Edit operations on sequences

AATAAGC

ATTAAGC

AAT-AAGC

AATTAAGC

AATAAGC

AA-AAGC

Substitution Insertion Deletion

Bioinformatics Workshop, Fall 2003

What is sequence alignment?

• Compare two sequences using matches, substitutions and indels.

G A A - - T C A T

G - T G G - C A -

• 3 matches, 1 substitution, 5 indels

Bioinformatics Workshop, Fall 2003

Complexity of DNA Problems

• 3 billion base pairs in human genome

• Many NP complete problems

• 10600 possible alignments for two 1000 character sequences

Bioinformatics Workshop, Fall 2003

Types of sequence alignment

• Determine the alignment of two sequences that maximizes similarity (global alignment)

• Determine substrings of two sequences with maximum similarity (local alignment)

• Determine the alignment for several sequences that maximizes the sum of pairs similarity (multiple alignment)

Bioinformatics Workshop, Fall 2003

Significance of Alignment

• Functional similarity

• Structural similarity

• Homology

Bioinformatics Workshop, Fall 2003

Scoring System

• Assign a score for each possible match, substitution and indel

• Distance functions – Find alignment to minimize distance between sequences

• Similarity functions – Find alignment to maximize similarity between sequences

Bioinformatics Workshop, Fall 2003

Edit Distance

G A A - - T C A T

G - T G G - C A -

• Similarity function: 1 for match, -1 for substitution, -2 for indel

• Score: -8

Bioinformatics Workshop, Fall 2003

Dynamic Programming

• Used on optimization problems

• Bottom-up approach

• Recursively builds up solution from subproblem optimal solutions

Bioinformatics Workshop, Fall 2003

Dynamic Programming Alignment Algorithm (Needleman-Wunsch)

• Given sequences a1,a2,…,an and b1,b2,…,bm to be aligned:

• Initialize alignment matrix (aligning with spaces)

• Entry [i,j] gives optimal alignment score for sequences a1,a2,…,ai and b1,b2,…,bj (where 1 i n, 1 j m)

Bioinformatics Workshop, Fall 2003

Computing Alignment Matrix

• Match ai+1 with bj+1

• Match ai+1 with a space —

• Match bj+1 with a space —

If a1,a2,…,ai and b1,b2,…,bj have been aligned,

there are three possible next moves:

Choose the move that maximizes the similarity of the two sequences

Bioinformatics Workshop, Fall 2003

Global Alignment Matrix

— G G A C A

— 0 -2 -4 -6 -8 -10

G -2 1 -1 -3 -5 -7

G -4 -1 2 0 -2 -4

G -6 -3 0 1 -1 -3

C -8 -5 -2 -1 2 0

A -10 -7 -4 -1 0 3

T -12 -9 -6 -3 -2 1

Bioinformatics Workshop, Fall 2003

Optimal Global Alignment

G G G C A T

G G A C A —

Bioinformatics Workshop, Fall 2003

Alignment Running Time

• Assuming two sequences n characters each

• Running time is O(n2) (each entry of matrix must be calculated)

Bioinformatics Workshop, Fall 2003

Variations of Alignment Algorithm

• Gap penalty

• Local alignment

• Multiple alignment

Bioinformatics Workshop, Fall 2003

Gap Penalty

• A gap is a number k of consecutive spaces

• k consecutive spaces are more probable than k isolated spaces

• Typical gap penalty function: a + b·k (affine gap penalty)

• Here the first space in a gap is penalized a+b, further spaces are penalized b each.

Bioinformatics Workshop, Fall 2003

Gap Penalty Example

• Use penalty, 1 + k

A - A - C - A

A C T A T C A

• Score: -6

A A C - - - A

A C T A T C A

• Score: -4

Bioinformatics Workshop, Fall 2003

Local Alignment

• Find conserved regions in otherwise dissimilar sequences (e.g., viral and host DNA)

• Smith-Waterman algorithm

• Includes a fourth possibility at each step (don’t align)

Bioinformatics Workshop, Fall 2003

Local Alignment Example

• Align the following

G C T C T G C G A A T A

C G T T G A G A T A C T

Bioinformatics Workshop, Fall 2003

Optimal Local Alignment

G C T C T G C G A A T A

C G T T G A G A T A C T

(G C T C) T G C G A A T A

(C G T) T G A G - A T A (C T)

Bioinformatics Workshop, Fall 2003

Multiple Alignment

• Find the alignment among a set of sequences that maximizes the sum of scores for all pairs of sequences

• Dynamic programming run-time for k sequences of length n: O(k2 2k nk)

• Multiple alignment is NP-complete

Bioinformatics Workshop, Fall 2003

Other Features

• Usually used for protein alignment

• Can be used for global or local alignment

Bioinformatics Workshop, Fall 2003

Multiple Alignment Example

P E A A L Y G R F T - - - I K S D V W

P E S L A Y N K F - - - S I K S D V W

P E A L N Y G R Y - - - S S E S D V W

P E A L N Y G W Y - - - S S E S D V W

P E V I R M Q D D N P F S F Q S D V Y

Bioinformatics Workshop, Fall 2003

Multiple vs. Pairwise Alignment

• Optimal multiple alignment does not imply optimal pairwise alignment

AT A -

A - - T

- T

Bioinformatics Workshop, Fall 2003

Substitution Matrices

• In homologous sequences certain amino acid substitutions are more likely to occur than others

• Types of substitution matrices* PAM* BLOSUM

Bioinformatics Workshop, Fall 2003

PAM Matrices

• Defines units of evolutionary distance

• 1 PAM unit represents an average of one mutation per 100 amino acids

• Start with a set of highly similar sequences and compute* pa = probability of occurrence of amino acid a

* Mab = probability of a mutating to b

Bioinformatics Workshop, Fall 2003

PAM Matrix Formula

• Entries in a k-PAM matrix

1010 logkab

b

M

p

Bioinformatics Workshop, Fall 2003

PAM250 MatrixC S T P A G N D E Q H R K M I L V F Y W

C 12

S 0 2

T -2 1 3

P -3 1 0 6

A -2 1 1 1 2

G -3 1 0 -1 1 5

N -4 1 0 -1 0 0 2

D -5 0 0 -1 0 1 2 4

E -5 0 0 -1 0 0 1 3 4

Q -5 -1 -1 0 0 -1 1 2 2 4

H -3 -1 -1 0 -1 -2 2 1 1 3 6

R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6

K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5

M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6

I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5

L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6

V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4

F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9

Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10

W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17

Bioinformatics Workshop, Fall 2003

BLOSUM Matrices (Omit)

• Uses log-odds ratio similar to PAM

• Uses short highly conserved sequences

• BLOSUM x matrices created after removing sequences that are more than x percent identical

• Better at local alignments

Bioinformatics Workshop, Fall 2003

BLOSUM Matrices

• A motif is a conserved amino acid pattern found in a group of proteins with similar biological meaning (PROSITE)

• A block is a conserved amino acid pattern in a group of proteins (no spaces allowed in the pattern) (BLOCKS)

Bioinformatics Workshop, Fall 2003

Motif Example

• Motif obtained from a group of 34 tubulin proteins

M[FYW] . . F[VLI]H . [FYW] . . EGM

Bioinformatics Workshop, Fall 2003

Defining BLOSUM (I)

• BLOSUMn uses blocks that are n% identical (BLOSUM62 is most common)

• Consider all pairs of amino acids appearing in the same column in the blocks

Bioinformatics Workshop, Fall 2003

Defining BLOSUM (II)

• Define n(i,j) to be the frequency that amino acids i,j appear in a column pair

• Define e(i,j) to be the frequency that amino acids i,j appear in any pair

• Define BLOSUM entry

2

( , )( , ) log

( , )

n i js i j

e i j

Bioinformatics Workshop, Fall 2003

PAM vs. BLOSUM

• PAM derived from highly similar sequences (evolutionary model)

• BLOSUM derived from protein families sharing a common ancestor (conserved domain model)

Bioinformatics Workshop, Fall 2003

Database Searches

• FASTA

• BLAST

Bioinformatics Workshop, Fall 2003

FASTA

• Looks for sequences in a database similar to a query sequence

• Heuristic, exclusion method

• Compares query sequence to each database sequence (called the text)

Bioinformatics Workshop, Fall 2003

FASTA Algorithm (I)

• Look for small substrings in query and text that exactly match (“hot spots”)

• Find ten best “diagonal runs” of hot spots

Bioinformatics Workshop, Fall 2003

Hot Spot Example

E K L A S R K L

H

A *

S *

H

K *

L *

Bioinformatics Workshop, Fall 2003

FASTA Algorithm (II)

• Find best local alignment for each run

• Combine these into larger alignment

• Do multiple alignment on query and texts having highest score in last step

Bioinformatics Workshop, Fall 2003

BLAST

• Basic Local Alignment Search Tool

• Heuristic, exclusion method

• Computes statistical significance of alignment scores

Bioinformatics Workshop, Fall 2003

BLAST Algorithm

• Find all w-length substrings in text that align to some w-length substring in query with score above a given threshold (called “hits”)

• Extend these hits as far as possible (“segment pairs”)

• Report the highest scoring segment pairs

Bioinformatics Workshop, Fall 2003

Other Bioinformatics Algorithms

• Palindromes

• Tandem Repeats

• Longest Common Subsequence

• Double Digest (NP complete)

• Shortest Common Superstring (NP complete)

Bioinformatics Workshop, Fall 2003

References

• Clote and Backofen, Computational Molecular Biology, Wiley

• Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press

• Mount, Bioinformatics, Cold Spring Harbor Press• Setubal and Meidanis, Introduction to

Computational Molecular Biology, PWS• Waterman, Introduction to Computational Biology,

CRC Press