tics algorithms

Download tics Algorithms

Post on 05-Apr-2018

221 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • 7/31/2019 tics Algorithms

    1/47

    BIOINFORMATICSALGORITHMS

    BASED ON THE 2009TEACHING OF THE CAMBRIDGE COMPUTERSCIENCEPARTIIBIOINFORMATICSCOURSE BYPIETRO LI

    Vaughan EveleighJesus College, Cambridge University

  • 7/31/2019 tics Algorithms

    2/47

    2

  • 7/31/2019 tics Algorithms

    3/47

    3

    CONTENTS

    1 DNA and Protein Sequences ......................................................................................... 51.1 Preparation ................................................................................................................ 5

    1.1.1 Manhattan Tourist ............................................................................................ 51.1.1.1 Naive Algorithm....................................................................................... 51.1.1.2 Dynamic Algorithm ................................................................................. 5

    1.2 Strings ......................................................................................................................... 61.2.1 Longest Common Subsequence ..................................................................... 71.2.2 Neeleman-Wunsch (Global Alignment) ....................................................... 81.2.3

    Smith Waterman (Local Alignment) .............................................................. 9

    1.2.4 Affine Gaps ..................................................................................................... 101.2.5 Banded Dynamic Programming ................................................................... 111.2.6 Computing Path with Linear Space ............................................................. 111.2.7 Block Alignment ............................................................................................. 121.2.8 Four Russians Block Alignment Speedup .................................................. 141.2.9 Four Russians Technique - Longest Common Sub-Expression ............. 141.2.10 Nussinov Algorithm ...................................................................................... 151.2.11 BLAST (Multiple Alignment) ....................................................................... 171.2.12 Pattern Hunter (Multiple Alignment) .......................................................... 181.2.13 BLAT (Multiple Alignment) ......................................................................... 19

    1.3 Trees ......................................................................................................................... 191.3.1 Parsimony ........................................................................................................ 19

    1.3.1.1 Sankoff Algorithm ................................................................................. 201.3.1.2 Fitchs Algorithm ................................................................................... 21

    1.3.2 Large Parsimony Problem ............................................................................. 221.3.3 Distance ........................................................................................................... 23

    1.3.3.1 UPGMA .................................................................................................. 241.3.3.2 Neighbour Joining ................................................................................. 25

    1.3.4 Likelihood ........................................................................................................ 271.3.5 Bootstrapping Algorithm .............................................................................. 291.3.6 Prims Algorithm ............................................................................................ 29

    1.4 Information Theory and DNA ............................................................................. 29

  • 7/31/2019 tics Algorithms

    4/47

    4

    1.4.1 Information Content of a DNA Motif ....................................................... 301.4.2 Entropy of Multiple Alignment.................................................................... 311.4.3 Information Content of a String .................................................................. 311.4.4 Motifs ............................................................................................................... 311.4.5 Exhaustive Search .......................................................................................... 331.4.6 Gibbs Sampling .............................................................................................. 33

    1.5 Hidden Markov Models ......................................................................................... 351.5.1 Forward Algorithm ........................................................................................ 361.5.2 Viterbi Algorithm ........................................................................................... 371.5.3 Backward Algorithm ...................................................................................... 38

    2 Working with Microarray .............................................................................................. 392.1 Clustering ................................................................................................................. 39

    2.1.1 Lloyd Algorithm (k-means) ........................................................................... 402.1.2 Greedy Algorithm (k-means) ........................................................................ 412.1.3 CAST (Cluster Affinity Search Technique) ................................................ 412.1.4 QT clustering .................................................................................................. 422.1.5 Markov Clustering Algorithm ...................................................................... 43

    2.2 Genetic Networks Analysis ................................................................................... 432.3 Systems Biology ...................................................................................................... 45

    2.3.1 Gillespie Algorithm ........................................................................................ 46Complexity Summary ............................................................................................................. 47

  • 7/31/2019 tics Algorithms

    5/47

    5

    1 DNAAND PROTEIN SEQUENCES

    1.1 PREPARATION

    DNA (Deoxyribonucleic acid) uses a 4-letter alphabet (A,T,C,G)

    RNA (Ribonucleic acid) also uses a 4-letter alphabet (A,U,C,G)Proteins use 20 amino acids (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)

    1.1.1 MANHATTANTOURIST

    The problemGiven a weighted grid G, travel from source (top left) to thesink (bottom right) along the highest scoring path onlytravelling south and east.

    The solutionThe problem can be generalised finding the longest path from

    the source to an arbitrary destination .1.1.1.1 Naive Algorithm

    Start at the destination node and calculate which of the immediately adjacent nodes

    has the highest path score from the source. For each of these edges, recurse.

    path(i,j)

    if (i = 0 or j = 0)

    return 0

    else

    X = path(i-1, j) + edge (i-1,j) to (i,j)

    Y = path(i, j-1) + edge (i,j-1) to (i,j)return max(X,Y)= (!!)= (1)

    Although this exhaustive algorithm produces accurate results it is not efficient. Many

    path values are repeatedly computed.

    1.1.1.2Dynamic Algorithm

    Dynamic programming improves the naive algorithm by storing the results of previous

    computations and reusing them when required at a later stage. The idea behind a

    dynamic algorithm is that unnecessary calculations are not re-computed. Although this

    significantly improves time complexity, in many cases the space complexity can be

    quite demanding.

    In the case of the Manhattan tourist problem we only need to store the values of 1 row

    and 1 column at any time.

  • 7/31/2019 tics Algorithms

    6/47

    6

    DynamicPath(i,j)

    S0,0=0

    for x=1 to i

    Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)

    for y=1 to j

    S0,y = S0,y-1+edge (0,y-1) to (0,y)for x=1 to i

    for y=1 to j

    A = Sx,y-1+edge (x,y-1) to (x,y)

    B = Sx-1,y+edge (x-1,y) to (x,y)

    Sx,y = max (A,B)

    Return Si,j

    Where Sx,yare stored values = () = ( +)If our DAG representing the city were to also contain diagonal paths we would require

    a 3rd condition in the final for loop.

    DynamicDiagonalPath(i,j)

    S0,0=0

    for x=1 to i

    Sx,0 = Sx-1,0+edge (x-1,0) to (x,0)

    for y=1 to j

    S0,y = S0,y-1+edge (0,y-1) to (0,y)

    for x=1 to i

    for y=1 to j

    A = Sx,y-1+edge (x,y-1) to (x,y)

    B = Sx-1,y+edge (x-1,y) to (x,y)C = Sx-1,y-1+edge (x-1,y-1) to (x,y)

    Sx,y = max (A,B,C)

    Return Si,j = () = ( + + 1) = ( +)Many of the future algorithms will resemble the Manhattan tourist problem

    1.2 STRINGS

    There are several ways by which we can compare the similarity of strings.

    Edit Distance (non trivial) the minimum number of operations (insertions,deletions and substitutions) required to transform 1 string into another

    Hamming Distance (trivial) the number of differences when comparing the value of a string against the value of another

    Consider the two strings of length 7 and of length 6 : ATCTGAT

    : TGCATA

  • 7/31/2019 tics Algorithms

    7/47

    7

    After comparison of string against we can count the number of matches,insertions and deletions.

    1.2.1 LONGESTCOMMONSUBSEQUENCEAlthough the hamming distance is commonly used in computer science, the edit

    distance is of greater use in biology. By aligning two strings by their longest common