approximate indexing: gapped suffix array
Post on 01-Jul-2015
167 Views
Preview:
TRANSCRIPT
King’s College London, University of London
MSc in Advanced Software Engineering
Approximate Indexing: Gapped Suffix Array
KyungHoon Park
King’s College London, University of London
Agenda
Research Objective
Gapped suffix array
Application
Going beyond gSA
Q&A
King’s College London, University of London
Research Objective
King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped suffix array be developed in O(n) time?
2. What are the limitations of gapped suffix array? How can these can be overcome?
King’s College London, University of London
Research aims
1. To fully understand and implement suffix array and LCP.
2. Implement a gapped suffix array from the suffix array in O(n) time.
3. To study and implement the paper gapped suffix array.
4. If there are possibilities to develop to multiple gapped suffix array, to research other limitations.
King’s College London, University of London
Gapped Suffix Array
King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped suffix array be developed in O(n) time?
2. 2. What are the limitations of gapped suffix array? How can these can be overcome?
King’s College London, University of London
Definitions
T = t1t2 … tn, P = p1 p2 … pn , strings of symbols in finite alphabet
m = length of search string
n = length of text
k = k-mistake (Hamming distance)
King’s College London, University of London
Suffix Array
i T[i] SA T[SA[i]] LCP
0 mississippi 10 i 0
1 ississippi 7 ippi 1
2 ssissippi 4 issippi 1
3 sissippi 1 ississippi 4
4 issippi 0 mississippi 0
5 ssippi 9 pi 0
6 sippi 8 ppi 1
7 ippi 6 sippi 0
8 ppi 3 sissippi 2
9 pi 5 ssippi 1
10 i 2 ssissippi 3
T = mississippi
King’s College London, University of London
Gapped Suffix Array
1. First introduced by Crochemore and Tischler(2010)
2. Constructed after SA
3. SA that has a Gap within a specific range to provide approximate index.
4. The range of gap defined before constructing the gapped suffix array.
King’s College London, University of London
Gapped Suffix ArrayT = mississippi, (1, 2)-gSA (3,1)
i T[i] SA gSA (1, 2)- gSA(3,1)
1 mississippi 10 10 i#
2 ississippi 7 7 i#pi
3 ssissippi 4 4 i#sippi
4 sissippi 1 1 i#sissippi
5 issippi 0 0 m#ssissippi
6 Ssippi 9 9 p#
7 Sippi 8 8 p#i
8 Ippi 6 5 s#ppi
9 ppi 3 2 s#ssippi
10 pi 5 6 s#ippi
11 i 2 3 s#issippi
Definition
(g0, g1)-gSA (m, k)
gSA = Gapped suffix array
g0 = start cursor of the gap
g1 = end cursor of the gap
m = length of search string
k = Hamming distance
King’s College London, University of London
Flow of constructing the gSA
• Skew Algorithm
1. Constructing the SA
• Figure of the k-mistake
• Range of gap
2. Defining the limitations
• Sorting based on GRANK & HRANK
3. Constructing the gSA
King’s College London, University of London
Limitations of gSA
1. Hamming distance, length of pattern and gap range should define prior to constructing.
2. gSA cannot cover all of approximate string matching based on defined k-mistake.ex) k = 2, gap=(1,3) coat -> c##t, ##at, co## (support)
#o#t, c#a# (cannot support)
3. gSA cannot support multiple gapsEX) coach -> c#a#h
King’s College London, University of London
Constructing gSA - #1. GRANK
i 0 1 2 3 4 5 6 7 8 9 10
T[i] m i s s i s s i p p i
GRANK 5 1 8 8 1 8 8 1 6 6 1
GRANK contains the ranks of factors of y with length up to g0. That is, rank created by cutting the characters before the beginning of the gap at position g0
For Example, m = 3, gap range = (1,2)
King’s College London, University of London
Constructing gSA - #2. HRANK
HRANK contains the RANKs of the suffixes that are at the end of the gap.
As we have now already created the suffix array before constructing the gapped suffix, it is possible to easily bring the suffix of where the gap ends.
HRANK[r] = ISA[SA[r]+g1]
King’s College London, University of London
GRANK & HRANK
For example, the structure of the GRANK and HRANK of the fourth suffix sissippi is constructed as below.
s i s s i p p i
GRANK Gap HRANK
If we perform the radix sort by combining both GRANK and HRANK created in this way, it is possible to create gSA in linear time.
King’s College London, University of London
Example of (1,2)-gSA(3,1)
i T[i] SA gSA (1, 2)- gSA GRANK HRANK
1 mississippi 10 10 i# 5 0
2 ississippi 7 7 i#pi 1 6
3 ssissippi 4 4 i#sippi 8 8
4 sissippi 1 1 i#sissippi 8 9
5 issippi 0 0 m#ssissippi 1 11
6 Ssippi 9 9 p# 8 0
7 Sippi 8 8 p#i 8 1
8 Ippi 6 5 s#ppi 1 7
9 ppi 3 2 s#ssippi 6 10
10 pi 5 6 s#ippi 6 2
11 i 2 3 s#issippi 1 3
King’s College London, University of London
Search in (1,2)-gSA(3,1)
For example, if m = mis (m0, m1, m2), it needs to search three times:
- search mi (m0, m1) in the SA- search is (m1, m2) in the SA- search ms (m0, m2) in the gSA
P = cot
(1,2)-gSA(3,1) c#t #ot co#
Searching array in the (1,2)-gSA(3,1) in the SA in the SA
King’s College London, University of London
Application
King’s College London, University of London
Platform and Language
1. Language: C#
2. Platform: Microsoft .NET (.Net Framework v4.0)
King’s College London, University of London
Algorithms
1. Construction of suffix array with LCP- Radix sort- Skew algorithm
2. Construction of gapped suffix array with gLCP- Radix sort
3. Approximate string search- pattern analysis- binary search with LCP
King’s College London, University of London
Gapped Suffix Array
King’s College London, University of London
Going beyond gSA
King’s College London, University of London
Main questions
1. Using the developed suffix array, can gappedsuffix array be developed in O(n) time?
2. What are the limitations of gappedsuffix array? How can these can beovercome?
King’s College London, University of London
Limitation of gSA
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA Cannot
support
gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA Cannot
support
Cannot
support
gSA(5,1) SA
If we suppose k is 1 and gap is ended at m-1
King’s College London, University of London
Countermeasure
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA gSA(3,1) gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA gSA(3,1) gSA(4,1) gSA(5,1) SA
King’s College London, University of London
Countermeasure
P = cot c#t, #ot, co#
gSA(3, 1) SA, gSA(3, 1)
P = coat #oat, c#at, co#t, coa#
gSA(4, 1) SA, gSA(3, 1), gSA(4, 1)
P = coast #oast, c#oast, co#st, coa#t, coas#
gSA(5, 1) SA, gSA(3, 1), gSA(4, 1), gSA(5, 1)
P = coasts #oasts, c#oasts, co#sts, coa#ts, coas#s, coast#
gSA(6, 1) SA, gSA(3, 1), gSA(4, 1), gSA(5, 1), gSA(6, 1)
gSA(m, 1) SA, gSA(3, 1) … gSA(m-2, 1), gSA(m-1, 1), gSA(m, 1)
King’s College London, University of London
Theorem If the length of the Gap is 1, the requiredcount of gSA is | m - 2 |, and it is possible for bothconstruction and search time to be performed in lineartime.
King’s College London, University of London
Total count of required gSAsgSA(m, p) Required gapped suffix arrays
gSA(3,1) SA, gSA(3,1)
gSA(4,1) SA, gSA(3,1), gSA(4,1)
gSA(4,2) SA, gSA(3,1), gSA(4,2)
gSA(5,1) SA, gSA(3,1), gSA(4,1), gSA(5,1)
gSA(5,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2)
gSA(5,3) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,3)
gSA(6,1) SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1)
gSA(6,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(6,2),
gSA(6,3) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,3)
gSA(6,4) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,4)
gSA(7,1) SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1), gSA(7,1)
gSA(7,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2) , gSA(6,1), gS
A(6,2), gSA(7,2)
gSA(7,3) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2), gSA(6,3), gSA(7,3)
gSA(7,4) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2) , gSA(6,3), gSA(6,4), gSA(7,4)
gSA(7,5) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2) , gSA(6,3) , gSA(6,4), gSA(7,5)
gC = Total count of required gSAs
𝒈𝑪 =
𝒊=𝟏
𝒑−𝟏
𝒌 − 𝒊 𝒊𝒇 𝒌 − 𝒊 > 𝟎𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
King’s College London, University of London
Multiple gaps, m is various
P = coat ##at, #o#t, #oa#, c##t, c#a#, co##
gSA(4,2) SA, gSA(3,1), gSA(4,2)
P = coast ##ast, #o#st, #oa#t, #oas#, c##st, c#a#t, c#as#, co##t, co#s#,coa##
gSA(5,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2), (1,2)(3,4)-gSA(5,2)
P = coasts ##asts, #o#sts, #oa#ts, #oas#s, #oast#, c##sts, c#a#ts, c#as#s, c#ast#, co#
#ts, co#s#s, co#st#, coa##s, coa#t#, coas##
gSA(6,2) SA, gSA(3,1) , gSA(4,1),gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(6,2), (1,2)(4,5)-gSA(6,2), (2,3)(4,5)-gSA(6,2)
P = coasts ###sts, ##a#ts, ##as#s, ##ast#, #o##ts, #o#s#s, #o#st#, #oa##s, #oa#t#, #
oas##, c###ts, c##s#s, c##st#, c#a##s, c#a#t#, c#as##, co###s, co##t#, co
#s##, coa###
gSA(6,3) SA, gSA(3,1) , gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(5,3)gSA(6,3), (1,3)(4,5)-gSA(6,3), (1,2)(3,5)-gSA(6,3)
King’s College London, University of London
Two approaches to support the multiple gaps
Second is to continuously additionally create multiple gapped suffix array as per above method.
Perform a search where the search is carried out until the first gap of the search pattern, and after that every individual character is compared.
King’s College London, University of London
First approach
c # a # t
r = gSA[i](3,1), T[r]
T[ r+2 ] T[ r+3 ] T[ r+4 ]
c # a s # s
r = gSA[i](3,1), T[r]
T[r+3] T[r+4] T[r+5]
King’s College London, University of London
Worst case for searching with it
First fragment’s length is defined fm
Binary search the first fragment with gLCP = O(logn + fm)Search rest of fragment = O((m - fm)n)
So O((m - fm)n + log n + fm)
King’s College London, University of London
Summary
King’s College London, University of London
Further work
Gapped suffix array only supports searching of specific patterns.
For it to support approximate indexing in all situations, will require more research and development into multiple gapped suffix arrays.
Future task is to study multiple gapped suffix array and its efficiency
King’s College London, University of London
Conclusion
The theory of Maxime that gSA can be created in linear time has been put into practice and confirmed to be true
Additionally to this research, further potentials of multiple gSAs were looked at and were able to conclude that it’s an area requiring more research
King’s College London, University of London
King’s College London, University of London
Q&A
top related