a fast algorithm for approximate string matching on gene sequences
DESCRIPTION
A fast algorithm for approximate string matching on gene sequences. Zheng Liu, Xin Chen, James Borneman and Tao Jiang University of California, Riverside. Outline. Background and motivation Idea and analysis for FAAST Experimental results Conclusion. Background. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/1.jpg)
04/19/23 1
A fast algorithm for approximate string matching on gene sequencesZheng Liu, Xin Chen, James Borneman and Tao Jiang
University of California, Riverside
![Page 2: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/2.jpg)
204/19/23
Outline
Background and motivation Idea and analysis for FAAST Experimental results Conclusion
![Page 3: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/3.jpg)
304/19/23
Background
Approximate string matching
pattern: P = p1p2…pm
text: T = t1t2…tn
K-mismatch K-difference
Applications: text processing and gene sequence analysis.
![Page 4: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/4.jpg)
404/19/23
Motivation of FAAST
Motivation: Gene sequence acquisition
Modeled as the k-mismatch problem
Primers: AAGTC CCGTA
AAGTC………CCGTATACTT………CCGTT
…ACGTC………GCGTA
…
AAGTC………CCGTA…
ACGTC………GCGTA
![Page 5: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/5.jpg)
504/19/23
Algorithms for the k-mismatch problem 1992, Shift-Add by Baeza-Yates and Gonnet. 1996, BM with Shift-Add by El-Mabrouk and
Crochemore. 1993, BM extention (bad-charcter rule) by
Tarhio-Ukkonen. 1994, BM extention (good-suffix rule) by Baeza-
Yates and Gonnet.
![Page 6: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/6.jpg)
604/19/23
FAAST
Further generalization on Tarhio-Ukkonen algorithm.
tj-m+1 tj-m+2 …… tj-k … tj-2 tj-1 tj
p1 p2 …… pm-k … pm-2 pm-1 pm --
check last k+1
tj-m+1 tj-m+2 … tj-k-x+1… tj-k … tj-2 tj-1 tj
p1 p2 … pm-k-x+1 …pm-k …pm-2 pm-1 pm --check
last k+x
![Page 7: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/7.jpg)
704/19/23
Algorithm outline
T: AACTGTTAACTTGCGACTAG (k=2, x=2)
P: AAATCGTAAC
AAATCGTAAC Χ AAATCGTAAC Χ
……… Χ AAATCGTAAC ☺ -after first shift
(6)
![Page 8: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/8.jpg)
804/19/23
An example
k=2, x=3, m=10, n=20
T: AACTGTTAACTTGCGACTAG
P: AAATCGTAAC
T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 1 by Tarhio-Ukkonen
T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 6 by FAAST
![Page 9: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/9.jpg)
904/19/23
Construction of shift table
Heuristic: Guarantee the last k+x (or y, if y ≤ k+x) aligned text characters to have at least x (or y-k , if y ≤ k+x) matches.
T:AACTGTTAACTTGCGACTA [K=2,X=3] P:AAGTCGTAAC
…. AAGTCGTAAC
![Page 10: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/10.jpg)
1004/19/23
Construction details
Vkx[tj-k-x+1…tj, l] :Marks the characters that match with P after shifting P by l.
dkx[tj-k-x+1…tj] : Stores the minimum distance l, s.t. Vkx[tj-k-x+1…tj, l] contains at least min[x, m-k-l] items.
![Page 11: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/11.jpg)
1104/19/23
Construction details – cont’d P: AAGTCGTAAC (k=2, x=3, l=[1..8]), Vkx[tj-k-x+1…tj, l] and dkx[tj-k-
x+1…tj]
l 1 2 3 4 5 6 7 8 dkx
AAAAA 0,1 0 4 3,4 2,3 1,2 0,1 6
…
GCGAC 1 2,3 4 0,2 1 1 7
…
GTCGT 0,1,2,3,4
0,1 3
…
TTAAC 0 4 3 0 2 1,2 1 7
…
TTTTT 2 1,4 0,3 2 1 0 8
![Page 12: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/12.jpg)
1204/19/23
Theoretical support
Correctness of FAAST We use random string assumption Average shift distance Total number of character comparisons
![Page 13: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/13.jpg)
1304/19/23
Correctness of FAAST
Theorem 1. When P is aligned with tj-k-x+1…tj, we can always shift P by dkx[tj-k-x+1…tj] to the right without miss approximate occurrences of P.
tj-m+1 tj-m+2 … tj-k-x +1 … tj-2 tj-1 tj
p1 p2 …pi-k-x+1 … pi-2 pi-1 pi ……
pm – current
p1 p2 … pi-k-x+1 … pi’-k-x+1 … pi-’2 pi’-1 pi’...pm -- (i
< i’)
![Page 14: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/14.jpg)
1404/19/23
Average shift distance
Lemma 1. The prob. Pkx for the last k+x chars of T to have at least x matches is:
Pkx = 1- Σi=0x-1Ck+X
i(1-p)k+x-ipi
Theorem 1. The avg. shift distance of FAAST is:
Ekxd = Σs=0
∞s(1-Pkx)s-1Pkx = 1/Pkx
![Page 15: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/15.jpg)
1504/19/23
Average shift distance under diff x.
![Page 16: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/16.jpg)
1604/19/23
Total character comparisons
Lemma 2. The expected number of comparisons between two shifts is:
Ekxc = (k+X) / (1-p)
Theorem 2. The expected total comparisons for text of length n is:
TEkxc = nPkx (k+X) / (1-p)
![Page 17: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/17.jpg)
1704/19/23
Total character comparisons
![Page 18: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/18.jpg)
1804/19/23
Difference of total character comparisons under different x
![Page 19: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/19.jpg)
1904/19/23
Experimental result
A PC with 2.8GHz CPU and 1G memory
Simulated random string testing
Real DNA gene sequence data
![Page 20: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/20.jpg)
2004/19/23
Result on simulated sequences
Text: 2M bases sequence, Pattern: 39 bases, k=3.
x 1 2 3 4 5 6 7Ave. shift
dist.1.41 2.76 5.59 16.38 31.31 37.37 38.87
Total comp. 6.70 3.68 1.86 0.65 0.34 0.28 0.27
Running time(sec.)
210.2 114.4 58.1 20.6 11.2 10.8 16.7
Prepro. Time(sec.)
0.01 0.01 0.03 0.08 0.36 1.58 6.90
![Page 21: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/21.jpg)
2104/19/23
Result on real sequences Text: 150 bacteria DNA sequences, k=3
x 1 2 3 4 5 6 7
Running time (sec.)
18.87 13.05 7.74 3.84 2.63 3.21 8.55
Prepro. Time(sec.)
0.01 0.01 0.02 0.09 0.35 1.57 6.96
matching Time(sec.)
18.77 13.04 7.72 3.75 2.28 1.64 1.59
Text: 150 fungi DNA sequences, k=3
x 1 2 3 4 5 6 7
Running time (sec.)
16.45 11.43 9.24 6.78 5.62 8.24 26.48
Prepro. Time(sec.)
0.02 0.03 0.08 0.32 1.34 5.77 23.86
matching Time(sec.)
16.43 11.40 9.16 6.46 4.28 2.47 2.62
![Page 22: A fast algorithm for approximate string matching on gene sequences](https://reader035.vdocument.in/reader035/viewer/2022062517/56812d9e550346895d92bf9c/html5/thumbnails/22.jpg)
2204/19/23
Conclusion
Competitive algorithm for k-mismatch problem on gene sequence.
Time and memory increase with larger x and alphabet size.