tics algorithms
Post on 05-Apr-2018
221 views
Embed Size (px)
TRANSCRIPT
7/31/2019 tics Algorithms
1/47
BIOINFORMATICSALGORITHMS
BASED ON THE 2009TEACHING OF THE CAMBRIDGE COMPUTERSCIENCEPARTIIBIOINFORMATICSCOURSE BYPIETRO LI
Vaughan EveleighJesus College, Cambridge University
7/31/2019 tics Algorithms
2/47
2
7/31/2019 tics Algorithms
3/47
3
CONTENTS
1 DNA and Protein Sequences ......................................................................................... 51.1 Preparation ................................................................................................................ 5
1.1.1 Manhattan Tourist ............................................................................................ 51.1.1.1 Naive Algorithm....................................................................................... 51.1.1.2 Dynamic Algorithm ................................................................................. 5
1.2 Strings ......................................................................................................................... 61.2.1 Longest Common Subsequence ..................................................................... 71.2.2 Neeleman-Wunsch (Global Alignment) ....................................................... 81.2.3
Smith Waterman (Local Alignment) .............................................................. 9
1.2.4 Affine Gaps ..................................................................................................... 101.2.5 Banded Dynamic Programming ................................................................... 111.2.6 Computing Path with Linear Space ............................................................. 111.2.7 Block Alignment ............................................................................................. 121.2.8 Four Russians Block Alignment Speedup .................................................. 141.2.9 Four Russians Technique - Longest Common Sub-Expression ............. 141.2.10 Nussinov Algorithm ...................................................................................... 151.2.11 BLAST (Multiple Alignment) ....................................................................... 171.2.12 Pattern Hunter (Multiple Alignment) .......................................................... 181.2.13 BLAT (Multiple Alignment) ......................................................................... 19
1.3 Trees ......................................................................................................................... 191.3.1 Parsimony ........................................................................................................ 19
1.3.1.1 Sankoff Algorithm ................................................................................. 201.3.1.2 Fitchs Algorithm ................................................................................... 21
1.3.2 Large Parsimony Problem ............................................................................. 221.3.3 Distance ........................................................................................................... 23
1.3.3.1 UPGMA .................................................................................................. 241.3.3.2 Neighbour Joining ................................................................................. 25
1.3.4 Likelihood ........................................................................................................ 271.3.5 Bootstrapping Algorithm .............................................................................. 291.3.6 Prims Algorithm ............................................................................................ 29
1.4 Information Theory and DNA ............................................................................. 29
7/31/2019 tics Algorithms
4/47
4
1.4.1 Information Content of a DNA Motif ....................................................... 301.4.2 Entropy of Multiple Alignment.................................................................... 311.4.3 Information Content of a String .................................................................. 311.4.4 Motifs ............................................................................................................... 311.4.5 Exhaustive Search .......................................................................................... 331.4.6 Gibbs Sampling .............................................................................................. 33
1.5 Hidden Markov Models ......................................................................................... 351.5.1 Forward Algorithm ........................................................................................ 361.5.2 Viterbi Algorithm ........................................................................................... 371.5.3 Backward Algorithm ...................................................................................... 38
2 Working with Microarray .............................................................................................. 392.1 Clustering ................................................................................................................. 39
2.1.1 Lloyd Algorithm (k-means) ........................................................................... 402.1.2 Greedy Algorithm (k-means) ........................................................................ 412.1.3 CAST (Cluster Affinity Search Technique) ................................................ 412.1.4 QT clustering .................................................................................................. 422.1.5 Markov Clustering Algorithm ...................................................................... 43
2.2 Genetic Networks Analysis ................................................................................... 432.3 Systems Biology ...................................................................................................... 45
2.3.1 Gillespie Algorithm ........................................................................................ 46Complexity Summary ............................................................................................................. 47
7/31/2019 tics Algorithms
5/47
5
1 DNAAND PROTEIN SEQUENCES
1.1 PREPARATION
DNA (Deoxyribonucleic acid) uses a 4-letter alphabet (A,T,C,G)
RNA (Ribonucleic acid) also uses a 4-letter alphabet (A,U,C,G)Proteins use 20 amino acids (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)
1.1.1 MANHATTANTOURIST
The problemGiven a weighted grid G, travel from source (top left) to thesink (bottom right) along the highest scoring path onlytravelling south and east.
The solutionThe problem can be generalised finding the longest path from
the source to an arbitrary destination .1.1.1.1 Naive Algorithm
Start at the destination node and calculate which of the immediately adjacent nodes
has the highest path score from the source. For each of these edges, recurse.
path(i,j)
if (i = 0 or j = 0)
return 0
else
X = path(i-1, j) + edge (i-1,j) to (i,j)
Y = path(i, j-1) + edge (i,j-1) to (i,j)return max(X,Y)= (!!)= (1)
Although this exhaustive algorithm produces accurate results it is not efficient. Many
path values are repeatedly computed.
1.1.1.2Dynamic Algorithm
Dynamic programming improves the naive algorithm by storing the results of previous
computations and reusing them when required at a later stage. The idea behind a
dynamic algorithm is that unnecessary calculations are not re-computed. Although this
significantly improves time complexity, in many cases the space complexity can be
quite demanding.
In the case of the Manhattan tourist problem we only need to store the values of 1 row
and 1 column at any time.
7/31/2019 tics Algorithms
6/47
6
DynamicPath(i,j)
S0,0=0
for x=1 to i
Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)
for y=1 to j
S0,y = S0,y-1+edge (0,y-1) to (0,y)for x=1 to i
for y=1 to j
A = Sx,y-1+edge (x,y-1) to (x,y)
B = Sx-1,y+edge (x-1,y) to (x,y)
Sx,y = max (A,B)
Return Si,j
Where Sx,yare stored values = () = ( +)If our DAG representing the city were to also contain diagonal paths we would require
a 3rd condition in the final for loop.
DynamicDiagonalPath(i,j)
S0,0=0
for x=1 to i
Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)
for y=1 to j
S0,y = S0,y-1+edge (0,y-1) to (0,y)
for x=1 to i
for y=1 to j
A = Sx,y-1+edge (x,y-1) to (x,y)
B = Sx-1,y+edge (x-1,y) to (x,y)C = Sx-1,y-1+edge (x-1,y-1) to (x,y)
Sx,y = max (A,B,C)
Return Si,j = () = ( + + 1) = ( +)Many of the future algorithms will resemble the Manhattan tourist problem
1.2 STRINGS
There are several ways by which we can compare the similarity of strings.
Edit Distance (non trivial) the minimum number of operations (insertions,deletions and substitutions) required to transform 1 string into another
Hamming Distance (trivial) the number of differences when comparing the value of a string against the value of another
Consider the two strings of length 7 and of length 6 : ATCTGAT
: TGCATA
7/31/2019 tics Algorithms
7/47
7
After comparison of string against we can count the number of matches,insertions and deletions.
1.2.1 LONGESTCOMMONSUBSEQUENCEAlthough the hamming distance is commonly used in computer science, the edit
distance is of greater use in biology. By aligning two strings by their longest common