![Page 1: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/1.jpg)
Part 2: Sequence search and comparison
![Page 2: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/2.jpg)
Some books
! D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997
! E.Ohlebusch, Bioinformatics algorithms, 2013, www.oldenbusch-verlag.de
! V.Makinen et al, Genome-scale algorithm design, Cambridge University Press, 2015
![Page 3: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/3.jpg)
Sequence alignment
![Page 4: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/4.jpg)
Sequence comparison
! Sequence comparison: most ubiquitous task in bioinformatics ─ genome analysis: gene prediction, phylogeny
reconstruction, repeats, … ─ RNA analysis ─ protein analysis
! Main assumptions: ─ similar sequences correspond to similar biological
functions ─ similar sequences witness phylogenetic proximity ─ similar sequences fold to similar structures
![Page 5: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/5.jpg)
Example: insulin
elephant
elephant
elephant
hamster
whale
alligator
![Page 6: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/6.jpg)
Another example
Image from: http://www.ncbi.nlm.nih.gov/
![Page 7: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/7.jpg)
Sequence alignment
! Given two sequences RDISLVKNAGI and RNILVSDAKNVGI
! 3 types of columns corresponding to 3 elementary evolutionary events ! matches ! substitution (mismatch) ! insertion, deletion (indel)
! Assign a score (positive or negative) to each event. Alignment score = sum of scores over all columns. Optimal alignment = one that maximizes the score
![Page 8: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/8.jpg)
Sequence alignment: scoring
! scoring function:
Score=19 Score=-11
Score=25
![Page 9: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/9.jpg)
Sequence alignment: scoring
! BLOSUM62 matrix for protein sequences
![Page 10: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/10.jpg)
LCS: Longest Common Subsequence
! consider score match:1, indel: 0, mismatch: -1
-AGGCTCACCTGACT-CCAGGC-CGA--TGCC--- || ||||| ||| | || ||| |||| TAG-CTCAC--GAC-GC--GG-TCGATTTGCCGAC
![Page 11: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/11.jpg)
LCS: Longest Common Subsequence
! consider score match:1, indel: 0, mismatch: -1
-AGGCTCACCTGACT-CCAGGC-CGA--TGCC--- || ||||| ||| | || ||| |||| TAG-CTCAC--GAC-GC--GG-TCGATTTGCCGAC
! optimal alignment ~ longest common subsequence (LCS) ! LCS(AGCGA,CAGATAGAG)=4 ! Score(S,T)=LCS(S,T) ! d(S,T)=|S|+|T|-2·LCS(S,T) minimal number of indels required to transform S into T
![Page 12: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/12.jpg)
Levenshtein distance
! consider score match:0, indel: -1, mismatch: -1
-AGGCTCACCTGACTCCAGGCCGA--TGCC--- || ||||| ||| | || ||| |||| TAG-CTCAC--GACGC--GGTCGATTTGCCGAC
! optimal alignment ~ Levenshtein (edit) distance ! minimal number of indels and substitutions required to
transform S into T ! edit(S,T) = -Score(S,T) ! edit(ACAGT,CCGA)=3 ACAGT
| | CC GA
![Page 13: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/13.jpg)
Bioinformatics: "CIGAR strings"
part of SAM format
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19!Reference: C C A T A C T G A A C T G A C T A A C!Read : A C T A G A A T G G C T!
POS: 5!CIGAR: 3M1I3M1D5M!
![Page 14: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/14.jpg)
Computing Score(S,T)
! assume -d is indel penalty, s(x,y) score of aligning x and y (match or mismatch), S[0..n-1] and T[0..m-1] are input strings
! Idea: compute Score[i,j]: optimal score between S[0..i-1] and T[0..j-1]
Score[i,j] = max
Score[i-1,j-1] + s(S[i-1],T[j-1])
Score[i-1,j] - d
Score[i,j-1] - d
! initialization: Score[0,0]=0, Score[0,j]=-jd, Score[i,0]=-id ! resulting score: Score[n,m] ! Dynamic Programming!
![Page 15: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/15.jpg)
Example
s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
![Page 16: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/16.jpg)
Example
0 1 2 3 4 5 6 7 8 0
1
2
3
4
5
6
7
A
C
T
G
T
A
T
A C G G C T A T s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
![Page 17: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/17.jpg)
Example
0 1 2 3 4 5 6 7 8 0 0 -2 -4 -6 -8 -10 -12 -14 -16
1 -2
2 -4
3 -6
4 -8
5 -10
6 -12
7 -14
A
C
T
G
T
A
T
A C G G C T A T s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
![Page 18: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/18.jpg)
Example
0 1 2 3 4 5 6 7 8 0 0 -2 -4 -6 -8 -10 -12 -14 -16
1 -2 2 0 -2 -4 -6 -8 -10 -12
2 -4
3 -6
4 -8
5 -10
6 -12
7 -14
A
C
T
G
T
A
T
A C G G C T A T s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
![Page 19: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/19.jpg)
Example
0 1 2 3 4 5 6 7 8 0 0 -2 -4 -6 -8 -10 -12 -14 -16
1 -2 2 0 -2 -4 -6 -8 -10 -12
2 -4 0 4 2 0 -2 -4 -6 -8
3 -6
4 -8
5 -10
6 -12
7 -14
A
C
T
G
T
A
T
A C G G C T A T s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
![Page 20: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/20.jpg)
Example
0 1 2 3 4 5 6 7 8 0 0 -2 -4 -6 -8 -10 -12 -14 -16
1 -2 2 0 -2 -4 -6 -8 -10 -12
2 -4 0 4 2 0 -2 -4 -6 -8
3 -6 -2 2 3 1 -1 0 -2 -4
4 -8 -4 0 4 5 3 1 -1 -3
5 -10 -6 -2 2 3 4 5 3 1
6 -12 -8 -4 0 1 2 3 7 5
7 -14 -10 -6 -2 -1 0 4 5 9
A
C
T
G
T
A
T
A C G G C T A T s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
![Page 21: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/21.jpg)
Example
0 1 2 3 4 5 6 7 8 0 0 -2 -4 -6 -8 -10 -12 -14 -16
1 -2 2 0 -2 -4 -6 -8 -10 -12
2 -4 0 4 2 0 -2 -4 -6 -8
3 -6 -2 2 3 1 -1 0 -2 -4
4 -8 -4 0 4 5 3 1 -1 -3
5 -10 -6 -2 2 3 4 5 3 1
6 -12 -8 -4 0 1 2 3 7 5
7 -14 -10 -6 -2 -1 0 4 5 9
A
C
T
G
T
A
T
A C G G C T A T s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
Score(S,T)
![Page 22: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/22.jpg)
How to recover the alignment?
0 1 2 3 4 5 6 7 8 0 0 -2 -4 -6 -8 -10 -12 -14 -16
1 -2 2 0 -2 -4 -6 -8 -10 -12
2 -4 0 4 2 0 -2 -4 -6 -8
3 -6 -2 2 3 1 -1 0 -2 -4
4 -8 -4 0 4 5 3 1 -1 -3
5 -10 -6 -2 2 3 4 5 3 1
6 -12 -8 -4 0 1 2 3 7 5
7 -14 -10 -6 -2 -1 0 4 5 9
A
C
T
G
T
A
T
A C G G C T A T s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
Score(S,T)
![Page 23: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/23.jpg)
How to recover the alignment?
0 1 2 3 4 5 6 7 8 0 0 -2 -4 -6 -8 -10 -12 -14 -16
1 -2 2 0 -2 -4 -6 -8 -10 -12
2 -4 0 4 2 0 -2 -4 -6 -8
3 -6 -2 2 3 1 -1 0 -2 -4
4 -8 -4 0 4 5 3 1 -1 -3
5 -10 -6 -2 2 3 4 5 3 1
6 -12 -8 -4 0 1 2 3 7 5
7 -14 -10 -6 -2 -1 0 4 5 9
A
C
T
G
T
A
T
A C G G C T A T s(x,x)=2, s(x,y)=-1 for x≠y, d=-2
Score(S,T)
ACGGCTAT ACTG TAT
![Page 24: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/24.jpg)
Alignment: graph formulation
A C G G C T A T
A
A
T
T
T
C
G
-2 indels
-1 mismatch
2 match
![Page 25: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/25.jpg)
Alignment: graph formulation
A C G G C T A T
A
A
T
T
T
C
G
-2 indels
-1 mismatch
2 match
max-cost path cost=9
![Page 26: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/26.jpg)
Exercise
! Give all optimal alignments between ACCGTTG and CGAATGAA if the match score is 2, the mismatch penalty is -1 and the gap penalty (indel score) is -2
![Page 27: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/27.jpg)
Comments
! algorithm known as Needleman-Wunsch algorithm (1970) ! note that optimal alignment is generally not unique ! the problem considered is called global alignment ! both time and space complexity is O(n2) ! space complexity is O(n) if only the optimal score has to be
computed (e.g. line-by-line, keep two lines at a time) ! time can be reduced to O(n2/log2n) (assuming RAM model)
[Masek, Paterson 80] using “four-russians technique” (another solution in [Crochemore, Landau, Ziv-Ukelson 03])
! proved to be unlikely solvable in time O(n2-ε) [Abboud, Williams, Weimann 14] (by reduction from 3SUM to some versions of alignment problem)
![Page 28: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/28.jpg)
Exercises (1)
! End-space free alignment of S and T ─ compute the best alignment of S and T such that
spaces at string borders contribute 0
S
T
S
T
T
S
suffix-prefix overlap
![Page 29: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/29.jpg)
Exercises (2)
P
T
! Approximate occurrences of P in T ─ compute all alignments such that Score(S,T[i..j])>δ
![Page 30: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/30.jpg)
Exercises (2)
P
T
! Approximate occurrences of P in T ─ compute all alignments such that Score(S,T[i..j])>δ
! Particular cases ─ edit distance (<k): O(kn) [Landau&Vishkin 85,
Galil&Park 89, …] ─ Hamming distance: O(n·log(m)) [Fischer&Paterson
73], O(nk) [Galil&Giancarlo 86], O(n√k·log(k)) [Amir&Lewenstein&Porat 04], …
![Page 31: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/31.jpg)
Computing alignment in linear space
! Hirschberg (1975) proposed a nice trick in order to compute the optimal alignment in linear space (at the price of doubling the time)
! Key observation:
T
S k*
n/2
![Page 32: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/32.jpg)
Computing alignment in linear space
! Hirschberg (1975) proposed a nice trick in order to compute the optimal alignment in linear space (at the price of doubling the time)
! Key observation:
T
S k*
n/2
k*= argmaxk (Score(n/2, k)+ScoreR(n/2, m-k))
Score(n/2, k) ScoreR(n/2, m-k)
![Page 33: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/33.jpg)
Computing alignment in linear space
! Hirschberg (1975) proposed a nice trick in order to compute the optimal alignment in linear space (at the price of doubling the time)
! Key observation:
T
S k*
n/2
k*= argmaxk (Score[n/2, k]+ScoreR[n/2, m-k])
Score(n/2, k) ScoreR(n/2, m-k)
compute ScoreR(n/2,m-k) for all k
compute Score(n/2,k) for all k
![Page 34: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/34.jpg)
n
m
![Page 35: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/35.jpg)
compute k*=argmaxk(Score[n/2,k]+ScoreR[n/2,m-k])
k*
n
m
![Page 36: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/36.jpg)
n
m k*
![Page 37: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/37.jpg)
n
m k*
![Page 38: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/38.jpg)
Resulting complexity
! if the Score computation on a p×q matrix takes time c·pq, then computing the first “cut” takes 2·c·(n/2)·m=c·nm
! the first halving results in time c·(n/2)·k*+c·(n/2)·(m-k*)= 1/2·c·nm
! all recursive calls take time c·nm+1/2·c·nm+1/4·c·nm+…≤ 2c·nm
![Page 39: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/39.jpg)
Local alignment
! Biologists are mostly interested in local alignments that may ignore arbitrary prefixes and suffixes of input sequences
T
S
![Page 40: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/40.jpg)
Local alignment
! Biologists are mostly interested in local alignments that may ignore arbitrary prefixes and suffixes of input sequences
T
S
! Problem: Compute all significant local alignments, i.e. all alignments of score above a threshold
![Page 41: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/41.jpg)
Smith-Waterman algorithm (1981)
! Assume matches are scored positively and mismatches/indels are scored negatively
! Score[i,j]: maximal score over all substrings of S that end at position i and all substrings of T that end at position j
Score[i,j] = max
0
Score[i-1,j-1] + s(S[i],T[j])
Score[i-1,j] - d
Score[i,j-1] - d
! initialization: Score[0,j]=Score[i,0]=0
![Page 42: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/42.jpg)
Smith-Waterman: example
EAWACQGKL vs ERDAWCQPGKWY!s(x,x)=1, s(x,y)=-3 for x≠y, d=-1
resulting local alignment:
![Page 43: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/43.jpg)
Comments
! Score matrix is important ! The average value of score matrix should be negative ! There exists a statistical model (Karlin&Altschul 90)
that allows to relate the score of a local alignment and the probability for this alignment to appear in random sequences (p-value)
![Page 44: Part 2: Sequence search and comparisonkoutcher/lectures/lecture1-2.pdf · D.Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge](https://reader034.vdocument.in/reader034/viewer/2022042401/5f108c227e708231d449a5b8/html5/thumbnails/44.jpg)
More complex gap penalty systems
! Affine gap penalty: h+q·i h: gap opening penalty q: gap extension penalty O(mn) algorithm [Gotoh 82] ! Convex gap penalty O(mn·log n) ! Arbitrary gap penalty O(mn2+nm2)