bioinformatics workshop, fall 2003 algorithms in bioinformatics lawrence d’antonio ramapo college...
Post on 22-Dec-2015
215 views
TRANSCRIPT
Bioinformatics Workshop, Fall 2003
Algorithms in Bioinformatics
Lawrence D’Antonio
Ramapo College of New Jersey
Bioinformatics Workshop, Fall 2003
Topics
• Algorithm basics
• Types of algorithms in bioinformatics
• Sequence alignment
• Database Searches
Bioinformatics Workshop, Fall 2003
Algorithm basics
• What is an algorithm?
• Algorithm complexity
• P vs. NP
• NP completeness
Bioinformatics Workshop, Fall 2003
What is an algorithm?
• An algorithm is a step-by-step procedure to solve a problem
• The word “algorithm” comes from the 9th century Islamic mathematician al-Khwarizmi
Bioinformatics Workshop, Fall 2003
Algorithm Complexity
• If the algorithm works with n pieces of data and the number of steps is proportional to n, then we say that the running time is O(n).
• If the number of steps is proportional to log n, then the running time is O(log n).
Bioinformatics Workshop, Fall 2003
Example
• Problem: find the largest element in a sequence of n elements.
• Solution idea: Iteratively compare size of elements in sequence.
Bioinformatics Workshop, Fall 2003
Algorithm:
1. Initialize first element as largest.
2. For each remaining element.
If current element larger than largest, make that element largest.
Running time: O(n)
Bioinformatics Workshop, Fall 2003
Polynomial Time
• An algorithm is said to run in polynomial time if its running time can be written in the form O(nk) for some power k.
• The underlying problem is said to be of class P.
Bioinformatics Workshop, Fall 2003
Polynomial Time Examples
• Searching
Binary Search: O(log n)
• Sorting
Quick Sort: O(n log n)
Bioinformatics Workshop, Fall 2003
NP Algorithms
• An algorithm is nondeterministic if it begins with guessing a solution to the problem and then verifies the guess.
• A problem is of category NP if there is a nondeterministic algorithm for that problem which runs in polynomial time.
Bioinformatics Workshop, Fall 2003
NP Complete
• A problem is NP-complete if it has an NP algorithm, and solutions to this problem can be used to solve all other NP problems.
• A problem is NP-hard if it is at least as hard as the NP-complete problems
Bioinformatics Workshop, Fall 2003
NP Complete Examples
• Traveling salesman
• Knapsack problem
• Partition problem
• Graph coloring
Bioinformatics Workshop, Fall 2003
P = NP ?
• P NP
• If P NP then NP-complete problems have exponential running time.
Bioinformatics Workshop, Fall 2003
Algorithms in Bioinformatics
• Algorithms to compare DNA, RNA, or protein sequences
• Database searches to find homologous sequences
• Sequence assembly
• Construction of evolutionary trees
• Structure prediction
Bioinformatics Workshop, Fall 2003
Edit operations on sequences
AATAAGC
ATTAAGC
AAT-AAGC
AATTAAGC
AATAAGC
AA-AAGC
Substitution Insertion Deletion
Bioinformatics Workshop, Fall 2003
What is sequence alignment?
• Compare two sequences using matches, substitutions and indels.
G A A - - T C A T
G - T G G - C A -
• 3 matches, 1 substitution, 5 indels
Bioinformatics Workshop, Fall 2003
Complexity of DNA Problems
• 3 billion base pairs in human genome
• Many NP complete problems
• 10600 possible alignments for two 1000 character sequences
Bioinformatics Workshop, Fall 2003
Types of sequence alignment
• Determine the alignment of two sequences that maximizes similarity (global alignment)
• Determine substrings of two sequences with maximum similarity (local alignment)
• Determine the alignment for several sequences that maximizes the sum of pairs similarity (multiple alignment)
Bioinformatics Workshop, Fall 2003
Significance of Alignment
• Functional similarity
• Structural similarity
• Homology
Bioinformatics Workshop, Fall 2003
Scoring System
• Assign a score for each possible match, substitution and indel
• Distance functions – Find alignment to minimize distance between sequences
• Similarity functions – Find alignment to maximize similarity between sequences
Bioinformatics Workshop, Fall 2003
Edit Distance
G A A - - T C A T
G - T G G - C A -
• Similarity function: 1 for match, -1 for substitution, -2 for indel
• Score: -8
Bioinformatics Workshop, Fall 2003
Dynamic Programming
• Used on optimization problems
• Bottom-up approach
• Recursively builds up solution from subproblem optimal solutions
Bioinformatics Workshop, Fall 2003
Dynamic Programming Alignment Algorithm (Needleman-Wunsch)
• Given sequences a1,a2,…,an and b1,b2,…,bm to be aligned:
• Initialize alignment matrix (aligning with spaces)
• Entry [i,j] gives optimal alignment score for sequences a1,a2,…,ai and b1,b2,…,bj (where 1 i n, 1 j m)
Bioinformatics Workshop, Fall 2003
Computing Alignment Matrix
• Match ai+1 with bj+1
• Match ai+1 with a space —
• Match bj+1 with a space —
If a1,a2,…,ai and b1,b2,…,bj have been aligned,
there are three possible next moves:
Choose the move that maximizes the similarity of the two sequences
Bioinformatics Workshop, Fall 2003
Global Alignment Matrix
— G G A C A
— 0 -2 -4 -6 -8 -10
G -2 1 -1 -3 -5 -7
G -4 -1 2 0 -2 -4
G -6 -3 0 1 -1 -3
C -8 -5 -2 -1 2 0
A -10 -7 -4 -1 0 3
T -12 -9 -6 -3 -2 1
Bioinformatics Workshop, Fall 2003
Alignment Running Time
• Assuming two sequences n characters each
• Running time is O(n2) (each entry of matrix must be calculated)
Bioinformatics Workshop, Fall 2003
Variations of Alignment Algorithm
• Gap penalty
• Local alignment
• Multiple alignment
Bioinformatics Workshop, Fall 2003
Gap Penalty
• A gap is a number k of consecutive spaces
• k consecutive spaces are more probable than k isolated spaces
• Typical gap penalty function: a + b·k (affine gap penalty)
• Here the first space in a gap is penalized a+b, further spaces are penalized b each.
Bioinformatics Workshop, Fall 2003
Gap Penalty Example
• Use penalty, 1 + k
A - A - C - A
A C T A T C A
• Score: -6
A A C - - - A
A C T A T C A
• Score: -4
Bioinformatics Workshop, Fall 2003
Local Alignment
• Find conserved regions in otherwise dissimilar sequences (e.g., viral and host DNA)
• Smith-Waterman algorithm
• Includes a fourth possibility at each step (don’t align)
Bioinformatics Workshop, Fall 2003
Local Alignment Example
• Align the following
G C T C T G C G A A T A
C G T T G A G A T A C T
Bioinformatics Workshop, Fall 2003
Optimal Local Alignment
G C T C T G C G A A T A
C G T T G A G A T A C T
(G C T C) T G C G A A T A
(C G T) T G A G - A T A (C T)
Bioinformatics Workshop, Fall 2003
Multiple Alignment
• Find the alignment among a set of sequences that maximizes the sum of scores for all pairs of sequences
• Dynamic programming run-time for k sequences of length n: O(k2 2k nk)
• Multiple alignment is NP-complete
Bioinformatics Workshop, Fall 2003
Other Features
• Usually used for protein alignment
• Can be used for global or local alignment
Bioinformatics Workshop, Fall 2003
Multiple Alignment Example
P E A A L Y G R F T - - - I K S D V W
P E S L A Y N K F - - - S I K S D V W
P E A L N Y G R Y - - - S S E S D V W
P E A L N Y G W Y - - - S S E S D V W
P E V I R M Q D D N P F S F Q S D V Y
Bioinformatics Workshop, Fall 2003
Multiple vs. Pairwise Alignment
• Optimal multiple alignment does not imply optimal pairwise alignment
AT A -
A - - T
- T
Bioinformatics Workshop, Fall 2003
Substitution Matrices
• In homologous sequences certain amino acid substitutions are more likely to occur than others
• Types of substitution matrices* PAM* BLOSUM
Bioinformatics Workshop, Fall 2003
PAM Matrices
• Defines units of evolutionary distance
• 1 PAM unit represents an average of one mutation per 100 amino acids
• Start with a set of highly similar sequences and compute* pa = probability of occurrence of amino acid a
* Mab = probability of a mutating to b
Bioinformatics Workshop, Fall 2003
PAM250 MatrixC S T P A G N D E Q H R K M I L V F Y W
C 12
S 0 2
T -2 1 3
P -3 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
N -4 1 0 -1 0 0 2
D -5 0 0 -1 0 1 2 4
E -5 0 0 -1 0 0 1 3 4
Q -5 -1 -1 0 0 -1 1 2 2 4
H -3 -1 -1 0 -1 -2 2 1 1 3 6
R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
Bioinformatics Workshop, Fall 2003
BLOSUM Matrices (Omit)
• Uses log-odds ratio similar to PAM
• Uses short highly conserved sequences
• BLOSUM x matrices created after removing sequences that are more than x percent identical
• Better at local alignments
Bioinformatics Workshop, Fall 2003
BLOSUM Matrices
• A motif is a conserved amino acid pattern found in a group of proteins with similar biological meaning (PROSITE)
• A block is a conserved amino acid pattern in a group of proteins (no spaces allowed in the pattern) (BLOCKS)
Bioinformatics Workshop, Fall 2003
Motif Example
• Motif obtained from a group of 34 tubulin proteins
M[FYW] . . F[VLI]H . [FYW] . . EGM
Bioinformatics Workshop, Fall 2003
Defining BLOSUM (I)
• BLOSUMn uses blocks that are n% identical (BLOSUM62 is most common)
• Consider all pairs of amino acids appearing in the same column in the blocks
Bioinformatics Workshop, Fall 2003
Defining BLOSUM (II)
• Define n(i,j) to be the frequency that amino acids i,j appear in a column pair
• Define e(i,j) to be the frequency that amino acids i,j appear in any pair
• Define BLOSUM entry
2
( , )( , ) log
( , )
n i js i j
e i j
Bioinformatics Workshop, Fall 2003
PAM vs. BLOSUM
• PAM derived from highly similar sequences (evolutionary model)
• BLOSUM derived from protein families sharing a common ancestor (conserved domain model)
Bioinformatics Workshop, Fall 2003
FASTA
• Looks for sequences in a database similar to a query sequence
• Heuristic, exclusion method
• Compares query sequence to each database sequence (called the text)
Bioinformatics Workshop, Fall 2003
FASTA Algorithm (I)
• Look for small substrings in query and text that exactly match (“hot spots”)
• Find ten best “diagonal runs” of hot spots
Bioinformatics Workshop, Fall 2003
FASTA Algorithm (II)
• Find best local alignment for each run
• Combine these into larger alignment
• Do multiple alignment on query and texts having highest score in last step
Bioinformatics Workshop, Fall 2003
BLAST
• Basic Local Alignment Search Tool
• Heuristic, exclusion method
• Computes statistical significance of alignment scores
Bioinformatics Workshop, Fall 2003
BLAST Algorithm
• Find all w-length substrings in text that align to some w-length substring in query with score above a given threshold (called “hits”)
• Extend these hits as far as possible (“segment pairs”)
• Report the highest scoring segment pairs
Bioinformatics Workshop, Fall 2003
Other Bioinformatics Algorithms
• Palindromes
• Tandem Repeats
• Longest Common Subsequence
• Double Digest (NP complete)
• Shortest Common Superstring (NP complete)
Bioinformatics Workshop, Fall 2003
References
• Clote and Backofen, Computational Molecular Biology, Wiley
• Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press
• Mount, Bioinformatics, Cold Spring Harbor Press• Setubal and Meidanis, Introduction to
Computational Molecular Biology, PWS• Waterman, Introduction to Computational Biology,
CRC Press