[ieee 2011 data compression conference (dcc) - snowbird, ut, usa (2011.03.29-2011.03.31)] 2011 data...

10
Compressed Dictionary Matching With One Error ? Wing-Kai Hon 1 , Tsung-Han Ku 1 , Rahul Shah 2 Sharma V. Thankachan 2 , and Jeffrey Scott Vitter 3 1 Dept of CS, National Tsing Hua University, Taiwan. {wkhon,thku}@cs.nthu.edu.tw 2 Dept of CS, Louisiana State University, USA. {rahul,thanks}@csc.lsu.edu 3 Dept of EECS, The University of Kansas, USA. [email protected] Abstract. Given a set D of d patterns of total length n, the dictionary matching prob- lem is to index D such that for any query text T , we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T | + occ) time by the classical AC automaton (Aho and Corasick, 1975) where occ denotes the number of occurrences. The space requirement is O(n) words. In the approximate dictionary match- ing problem with one error, we consider a substring of T [i..j ] an occurrence of P whenever the edit distance between T [i..j ] and P is at most one. For this problem, the best known indexes are by Cole et al. (2004), which requires O(n + d log d) words of space and reports all occurrences in O(|T | log d log log d + occ) time, and by Ferragina et al. (1999), which requires O(n 1+ ) words of space and reports all occurrences in O(|T | log log n + occ) time. Recently, there have been successes in compressing the dictionary matching index while keeping the query time optimal (Belazzougui, 2010; Hon et al., 2010). However, a com- pressed index for approximate dictionary matching problem is still open. In this paper, we propose the first such index which requires an optimal nH k + O(n)+ o(n log σ)-bit index space, where H k denotes the kth-order empirical entropy of D, and σ is the size of alphabet set from which all the characters in D and T are drawn. The query time of our index is O(σ|T | log 3 n log log n + occ). 1 Introduction Given a set D = {p 1 ,p 2 ..., p d } of d patterns of total length n, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem has many practical applications, such as locating genes in a genome, or detecting viruses in a computer program segments. The classical AC automaton proposed by Aho and Corasick [1] requires O(n) words of index space, and can answer a query in optimal O(|T | + occ) time, where occ denotes the number of occurrences. Over the years, several advances were made to solve the dynamic version of this problem (where patterns can be inserted to or deleted from D) [2–4]. Recently, there has been interests in making the index space compressed or succinct [6, 7, 12, 13]. The best solution requires only a compressed space of nH k + O(n) bits, where H k is the kth-order empirical entropy of D, while maintaining query time optimal. We can generalize the dictionary matching problem to approximate dictionary matching problem by relaxing the definition of an occurrence. Precisely, in approx- imate dictionary matching with k errors, we consider P occurs at position i in T ? This work is supported in part by Taiwan NSC Grant 99-2221-E-007-123 (W. Hon) and US NSF Grant CCF–1017623 (R. Shah and J. S. Vitter). 2011 Data Compression Conference 1068-0314/11 $26.00 © 2011 IEEE DOI 10.1109/DCC.2011.18 113

Upload: jeffrey-scott

Post on 10-Mar-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

Compressed Dictionary Matching

With One Error?

Wing-Kai Hon1, Tsung-Han Ku1, Rahul Shah2

Sharma V. Thankachan2, and Jeffrey Scott Vitter3

1 Dept of CS, National Tsing Hua University, Taiwan. {wkhon,thku}@cs.nthu.edu.tw2 Dept of CS, Louisiana State University, USA. {rahul,thanks}@csc.lsu.edu

3 Dept of EECS, The University of Kansas, USA. [email protected]

Abstract. Given a set D of d patterns of total length n, the dictionary matching prob-lem is to index D such that for any query text T , we can locate the occurrences of anypattern within T efficiently. This problem can be solved in optimal O(|T |+ occ) time bythe classical AC automaton (Aho and Corasick, 1975) where occ denotes the number ofoccurrences. The space requirement is O(n) words. In the approximate dictionary match-ing problem with one error, we consider a substring of T [i..j] an occurrence of P wheneverthe edit distance between T [i..j] and P is at most one. For this problem, the best knownindexes are by Cole et al. (2004), which requires O(n+d log d) words of space and reportsall occurrences in O(|T | log d log log d + occ) time, and by Ferragina et al. (1999), whichrequires O(n1+ε) words of space and reports all occurrences in O(|T | log log n+ occ) time.Recently, there have been successes in compressing the dictionary matching index whilekeeping the query time optimal (Belazzougui, 2010; Hon et al., 2010). However, a com-pressed index for approximate dictionary matching problem is still open. In this paper,we propose the first such index which requires an optimal nHk + O(n) + o(n log σ)-bitindex space, where Hk denotes the kth-order empirical entropy of D, and σ is the size ofalphabet set from which all the characters in D and T are drawn. The query time of ourindex is O(σ|T | log3 n log logn+ occ).

1 Introduction

Given a set D = {p1, p2..., pd} of d patterns of total length n, the dictionarymatching problem is to index D such that for any query text T, we can locate theoccurrences of any pattern within T efficiently. This problem has many practicalapplications, such as locating genes in a genome, or detecting viruses in a computerprogram segments. The classical AC automaton proposed by Aho and Corasick [1]requires O(n) words of index space, and can answer a query in optimal O(|T | +occ) time, where occ denotes the number of occurrences. Over the years, severaladvances were made to solve the dynamic version of this problem (where patternscan be inserted to or deleted from D) [2–4]. Recently, there has been interestsin making the index space compressed or succinct [6, 7, 12, 13]. The best solutionrequires only a compressed space of nHk + O(n) bits, where Hk is the kth-orderempirical entropy of D, while maintaining query time optimal.

We can generalize the dictionary matching problem to approximate dictionarymatching problem by relaxing the definition of an occurrence. Precisely, in approx-imate dictionary matching with k errors, we consider P occurs at position i in T

? This work is supported in part by Taiwan NSC Grant 99-2221-E-007-123 (W. Hon) and US NSFGrant CCF–1017623 (R. Shah and J. S. Vitter).

2011 Data Compression Conference

1068-0314/11 $26.00 © 2011 IEEE

DOI 10.1109/DCC.2011.18

113

Page 2: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

whenever there is some substring T [i..j] such that the edit distance between T [i..j]and P is at most k.4 In this paper, we focus on the case where k = 1. Ferragina etal. [9] proposed an index with O(n1+ε) words of space, which can answer a query inO(|T | log log n+occ) time. Then, Amir et al. [5] gave another index with a reducedspace of O(n log2 n) words, and the query time became O(|T | log3 n log log n+occ).Later, Cole et al. [8] reduced the index space further to O(n+ d log d) words, andsimultaneously improved the query time to O(|T | log d log log d+ occ).

Note that the space requirement of the above indexes are much more than op-timum. Since the information today is very large and the memory of the computeris limited, it is essential to have compressed indexes. In this paper, we propose thefirst compressed space index for approximate dictionary matching with one error,whose index space is nHk +O(n) + o(n log σ) bits, which is (almost) optimal. Thequery time of our index is O(σ|T | log3 n log log n + occ). Our approach is basedon compressing Amir et al.’s index with sampling, where the latter is one of thepowerful techniques in compressed text indexing [13–15].

The organization of the paper is as follows. Section 2 describes some existingresults that form the building blocks of our index. Section 3 reviews the indexof Amir et al. [5], while Section 4 gives the details of our compressed index. Weconclude the paper in Section 5.

2 Preliminaries

2.1 Efficient Storage Scheme for Strings

Ferragina and Venturini [10] proposed the following efficient storage scheme forstrings.

Lemma 1 ([10]) A string S[1...n] over an alphabet of size σ can be stored innHk+o(n log σ) bits space, such that any substring of S of length ` can be retrievedin O(1 + `/ logσ n) time.

2.2 Suffix Trees and Suffix Arrays

Let S be a set of strings, and let τ be a compact (i.e., path-compressed) trie overthis set of strings. For each node v in τ , we use path(v) to denote the concatenationof edge labels along the path from the root to v. Then we have the followingdefinition.

Definition 1 For any string S, the locus of S in a compact trie τ is defined tobe the lowest node v (i.e., the farthest node from the root) such that path(v) is aprefix of S.

4 The edit distance between a string X and a string Y is the minimum number of edit operations (i.e.,substitution, insertion, or deletion, of a character) to transform X into Y .

114

Page 3: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

The suffix tree [16, 18] for a set of strings S = {S1, S2, ..., Sr} is a compact triestoring all suffixes of each string Si. Let n =

∑ri=1 |Si| denote the total number

of characters in S, so that there are n suffixes in the suffix tree. For each internalnode v in the suffix tree, it is shown that there exists a unique internal node uin the tree, such that path(u) is equal to the string obtained from removing thefirst character of path(v). Usually, there is a pointer stored from v to u, and thispointer is known as the suffix link of v. By utilizing the suffix links, suffix treeadmits very efficient string matching power, as shown in the following lemma.

Lemma 2 Let T be the input text, and T (j) be the suffix of T starting at the jthcharacter. The loci of all T (j) in the suffix tree for S can be found in O(|T |) time.

The suffix array SA[1...n] is an array which stores the starting positions ofall the suffixes when the suffixes are sorted in the lexicographical order. In otherwords, SA[i] is the starting position of the ith lexicographically smallest suffix(among all the n suffixes stored). We also define its inverse, SA−1, to be an arraysuch that SA[i] = j if and only if SA−1[j] = i. Suffix trees and suffix arrays takeO(n log n) bits of space and support pattern matching query in O(|P |+ occ) andO(|P |+ log n+ occ) time, respectively.

For any pattern P , it is known that all suffixes whose prefix matches exactlywith P will correspond to a contiguous region in SA. The range of such a region iscalled the suffix range of P . By storing SA and its inverse, we have the followinglemma.

Lemma 3 Let [`1, r1] and [`2, r2] be the suffix ranges of patterns P1 and P2. Thenthe suffix range [`, r] of P1P2 (concatenation of P1 and P2) can be computed inO(log n) time.

Proof. Note that `1 ≤ ` ≤ r ≤ r1 and the task is to find the range of i, such that`1 ≤ i ≤ r1 and `2 ≤ SA−1[SA[i] + |P1|] ≤ r2. This can be performed in O(log n)time by doing a binary search on i.

Lemma 4 Let T be the input text, and T (j) be the suffix of T starting at the jthcharacter. By preprocessing T initially in O(|T | log n) time, then for any charac-ter c, the locus of cT (j) in a suffix tree for any j can be answered in O(log2 n)time, where cT (j) is a string formed by the concatenation of c and T (j).

Proof. For i = 1, 2, 4, 8, . . ., we divide the text T into |T |/i different blocks, eachof length i, and find the locus of each block. This takes a total of O(|T | log n)time by Lemma 2. Now the locus of cT (j) can be obtained as follows: PartitionT (j) into O(log n) maximal intervals, each starting and ending with a blockingboundary. Next, continuously add the partition of T (j) to c, and compute thesuffix range by Lemma 3. Once the suffix range is empty, we backtrack to get thelongest prefix of T (j) that can be added to c with a non-empty suffix range. ThusO(log n) suffix ranges are computed, taking a total of O(log2 n) time by Lemma 3.

115

Page 4: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

2.3 Centroid Path and Centroid Path Decomposition

Let ∆ be a tree with n nodes. We define the size of an internal node v to be thenumber of leaves in the subtree rooted at v. Then the centroid path of the tree ∆ isthe path starting from the root, so that each node v on the path is the largest-sizechild of its parent. The centroid path decomposition of the tree ∆ is the operationwhere we decompose each off-path subtree of the centroid path recursively; as aresult, the edges in ∆ will be partitioned into disjoint centroid paths.

Lemma 5 ([8]) The path from the root of ∆ to v traverses at most log n centroidpaths.

2.4 Sparse Suffix Trees

Let α be a sampling factor. For a string S[1..|S|], we call every substring S[1 +iα..|S|] an α-sampled suffix of S, and similarly we call every substring S[1..iα]an α-sampled prefix of S. In our index, we maintain two tries ∆f and ∆r, where∆f is a trie of all α-sampled suffixes of pi ∈ D, and ∆r is a trie containing thereverse of all α-sampled prefixes of pi ∈ D. The number of nodes in ∆f and ∆r isO(n/α), and hence their size is bounded by O((n/α) log n) bits.5 Note that alongwith ∆f and ∆r, we will also store D. By applying the storage scheme in Section2.1 (Lemma 1), we obtain the following lemma.

Lemma 6 By maintaining an index of size nHk(D) + o(n log σ) +O((n/α) log n)bits, the locus of a query text T (in ∆f or ∆r) can be computed in O(|T |) time.

Though a sparse suffix tree shares a lot of similarities with the ordinary suffixtree, Lemma 4 will not work directly in a sparse suffix tree, since the sampled treecontains only the original suffixes that are α positions apart. However, by consid-ering each character in the sparse suffix tree as a concatenation of α charactersfrom Σ, Lemma 4 can be extended as follows.

Lemma 7 Let T be the input text, and T (j) be the suffix of T starting at thejth character. Then by preprocessing T initially in O(|T | log n) time, the locus ofCαT (j) (in ∆f or ∆r) for any j can be answered in O(α+ log2 n) time, where Cαis a string of length α.

2.5 Range Minimum Query (RMQ)

Let A be an array of length n. A range minimum query (RMQ) on A takes aquery interval [i, j] and the task is to return an index k such that A[k] ≤ A[x] forall i ≤ x ≤ j. This can be answered in constant time by maintaining a structureof 2n+ o(n) bits over A [11].

5 Assume |Si| ≥ α.

116

Page 5: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

2.6 Three-Sided Range Query Structure

LetR = {(x1, y1), (x2, y2), . . . , (xn, yn)} be a set of n points in the two-dimensionalspace. Without loss of generality, we assume that xi ≤ xi+1. A three-sided queryon R is defined as follows: Given a query range [x`, xr]× [−∞, y], report all points(xi, yi) such that x` ≤ xi ≤ xr and yi ≤ y.

For answering this query efficiently, we maintain two arrays X and Y , suchthat X[i] = xi and Y [i] = yi. Further, we build a predecessor query structure [19]over X and an RMQ structure over Y . Now, whenever a query comes, we first findthe maximal range X[`′...r′] (using predecessor search in X in O(log log n) time),such that x` ≤ x`′ ≤ xr′ ≤ xr. Now our task is to retrieve all points Y [k] ≤ y suchthat `′ ≤ k ≤ r′. This can be done by performing RMQ repeatedly as follows. Firstwe obtain the minimum value in this range (say Y [k]) and we check if Y [k] ≤ y.If so, we report this as an output and we recursively perform this query in thesubrange Y [`′...(k− 1)] and Y [(k+ 1)...r′]. Whenever RMQ returns a value whichis greater than y, we stop recursing in that interval further. The total time can bebounded by O(log log n+ |output|), where |output| denotes the output size.

3 Approximate Dictionary Matching with One Error

To simplify the discussion, throughout the remainder of this paper, we will assumethat the error (if any) is due to character substitution. Insertion or deletion errorscan be handled easily in a similar fashion. In this section, we first introduce Amiret al.’s index [5] for approximate dictionary matching with one error. Then weshow a simple substitution idea, which changes the query algorithm so that thespace bound of Amir et al.’s index can be improved.

3.1 Amir et al.’s Index

Let S[1..s] be a string. To decide whether S matches a pattern P [1..p] with oneerror, suppose S[i] 6= P [i] for some 1 ≤ i ≤ s. Then we need to check whetherS[1..i − 1] is the same as P [1..i − 1] and S[i + 1..s] is the same as P [i + 1..p]. Ifit is the case, S matches P with one error. Amir et al.’s index is based on thisidea. Let S be a string and SR be the reverse of S. Given a set D = {p1, p2, .., pd}of d patterns, let DR = {pR1 , pR2 , ..., pRd } be the set of patterns such that pRi is thereverse of pi. Amir et al.’s index consists of a suffix tree STD for all the patternsin D, and a suffix tree STDR for all patterns in DR. Next, all the nodes in STDand STDR are renamed according to the centroid path decomposition, such thatthe nodes in the same centroid path have contiguous id’s. They also maintain athree-dimensional (n × n × σ) range searching structure which links the nodesv ∈ STD, u ∈ STDR , and a character c ∈ Σ, whenever the concatenation of (i) thereverse of path(u), (ii) c, and (iii) path(v) is equal to some pattern pi ∈ D. Notethat u and v are represented by the node id’s which are renamed according to thecentroid path decomposition. The approximate dictionary matching algorithm is

117

Page 6: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

given as follows.

Let T = t1t2...t|T | For i = 1, ..., |T | do

1. Find the locus node v of the longest prefix of ti+1...t|T | in STD.2. Find the locus node u of the longest prefix of ti−1...t1 in STDR .

Now our task is to quickly check if there is any pattern pk which matches Twith exactly one error, where the error appears at the location ti. This query canbe modeled as a range query on the linking structure between all the ancestors ofu, all the ancestors of v, and character c 6= ti. Based on the property of centroidpath decomposition, this query can be decomposed into O(log2 n) 5-sided queriesand can be answered in O(|T | log3 n log log n + occ) time. Steps 1 and 2 can beanswered by Lemma 2. The main space bottleneck in Amir et al.’s index is thethree-dimensional range searching structure of Overmars [17], so that their totalspace complexity is O(n log2 n) words.

3.2 Substitution Idea

We show how to reduce the above three-dimensional range searching problem intoa series of two-dimensional range searching problem, so that we can achieve anindex with smaller space. In particular, we use a two-dimensional range searchingstructure, such that the nodes v ∈ STD and u ∈ STDR are linked if and only if theconcatenation of (i) the reverse of path(u) and (ii) path(v) is equal to a patternpi ∈ D. The query algorithm is modified as follows.

For i = 1, ..., |T | and for each character c ∈ Σ and c 6= ti do

1. Find the locus node v of the longest prefix of ti+1...tn in STD.2. Find the locus node u of the longest prefix of cti−1...t1 in STDR .3. Similar to the approach in Section 3.1, we perform the intersection query on all

ancestors of u and v, by splitting the query into O(log2 n) three-sided queries.

The three-sided queries of Step 3 can be answered using the structures de-scribed in Section 2.6. The result can be summarized in the following theorem.

Theorem 1 Approximate dictionary matching with one error can be performedin O(σ|T | log2 n log log n+ occ) by maintaining an O(n)-word index.

4 Compressed Approximate Dictionary Matching withOne Error

Now we describe our idea for compressed approximate dictionary matching withone error. Our index is similar to the index described in Section 3.2, and we usethe sampling scheme of [13] to reduce the space requirement of our index. Wehandle long patterns (|pi| ≥ α) and short patterns (|pi| < α) separately using twodifferent indexes, where α is a sampling factor.

118

Page 7: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

4.1 Handling Long Patterns

For our index, we construct sparse suffix trees ∆r and ∆f of the patterns in Dfor a given sampling factor α. Also, we extract the longest prefix of each patternwhose length is a multiple of α, then block every α characters in each patterninto a big character, and construct an auxiliary suffix tree SSTD on these blockedprefixes. Similarly we construct an suffix tree SSTDR on the reversed form of theabove blocked prefixes.

In ∆f and ∆r, if a node represents a suffix of a pattern (not a pattern itself) inD or DR, we mark this node, and call it a marked suffix node. Then, for each noderepresenting a pattern, we mark it in a different way and store a pointer to itsnearest marked ancestor node which also represents a pattern. In the SSTD (re-spectively SSTDR), for each node we store a pointer pointing to the correspondingnode in the compact trie ∆f (respectively ∆r), which represents the same suffixin D (respectively DR).

For each pattern P whose length is not a multiple of α, we define the residue ofP to be its last x characters, where x = |P | (mod α). We use the following QueryAlgorithm 1 to find all the matches such that the error appears in the “blockedprefix” part of the pattern. Then we use Query Algorithm 2 to find all the matchessuch that the error appears in the residue part of the pattern.

Query Algorithm 1

For j = 1, . . . , α doFor i = 1, . . . ,m/α doFor each character in the substring tj+(i−1)α...tj+iα−1, we will change it σ timesand do the following steps. Thus we would do the following steps for σα times.

1. Find the locus node v of the longest prefix of tj+iα...tm in SSTD.2. Find the locus node u of the longest prefix of tj+iα−1...t1 in SSTDR . Note that

in Steps 1 and 2, we query the suffix trees with big characters.3. Use v to find the corresponding node in ∆f and then we further traverse the

sparse suffix tree ∆f to find the correct locus node v′ of the longest prefix oftj+iα...tm in ∆f . We do this further traversal because the length of some longpatterns might not be a multiple of α.

4. Use u to find the corresponding node u′ in ∆r. According to u′, we report allthe marked ancestors (which correspond to long patterns in D) of u′.

5. Perform the intersection query on those marked suffix nodes which are ances-tors of v′ in ∆f and those marked suffix nodes which are ancestors of u′ in∆r.

For each round, in Step 1, we use the result of Lemma 2 so that it takesamortized O(1) time (and a total of O(|T |) time). In Step 2, we use the resultof Lemma 6 so that it takes O(log2 n) time. Step 3 and Step 4 are straightfor-ward and they take at most O(α) time. In Step 5, we use the same scheme as in

119

Page 8: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

the Section 3.2, so that it takes O(log2 n log log n+ |output|) time. Thus, for eachround, the total time is O(log2 n log log n+|output|). Since there are σ|T |α rounds,the total time of our algorithm is bounded by O(σ|T |α log2 n log log n + occ).The space requirement of the sparse suffix trees and the auxiliary suffix treesis O(Σd

i=1(|pi|/α + 1) log n) = O((n/α) log n) bits, and the space requirementof the centroid path decompositions and the range searching structure is alsoO(Σd

i=1(|pi|/α + 1) log n) = O((n/α) log n) bits.

Query Algorithm 2

For j = 1, .., α doFor i = m/α, ..., 1 doFor each character in the substring tj+iα...tj+(i+1)α−2, we will change it σ timesand do the following steps. Thus we would do the following steps for σ(α − 1)times.

1. Find the locus node v′ of the longest prefix of tj+iα...tj+(i+1)α−2 in ∆f .2. Find the locus node u of the longest prefix of tj+iα−1...t1 in SSTDR . Note that

in Step 2, we query the suffix tree with big characters.3. Use u to find the corresponding node u′ in ∆r.4. Perform the intersection query of the marked suffix nodes (whose depth is

larger than the length of the substring from tj+iα to the changed character)which are the ancestors of v′ in ∆f and ancestors of u′ in ∆r. Note that thereare O(α) marked suffix nodes which are the ancestors of v′, so that we will findthese marked suffix nodes directly.

Query Algorithm 2 is very similar to Query Algorithm 1, and it is easy to showthat they have the same time complexity. The space complexity is also the sameas we use the same structures. This gives the following theorem.

Theorem 2 Let D1 ⊆ D be the subset of all long patterns in D, where eachpattern p ∈ D1 has length at least α. Let

∑p∈D1|p| = n1 ≤ n. Then D1 can be

indexed in n1Hk(D1) + o(n log σ) + O((n/α) log n) bits of space, such that all theoccurrences of patterns p ∈ D1 which appear as a substring of an online text Twith one error can be reported in O(σ|T |α log2 n log log n+ occ) time.

4.2 Handling Small Patterns

In order to handle short patterns, we maintain an index for exact dictionarymatching for the set of all short patterns (|pi| ≤ α). Here we choose the nHk(D)+O(n)-bit index by Hon et al. [12] which can perform dictionary matching in optimalO(|T |+occ) time. Our approach is again by substitution. First we block the text Tinto overlapping intervals T1, T2, ... of length 2α, where Ti = T [1+(i−1)α, ..., (i+1)α]. Then we obtain a set of new blocks by introducing an error in each positionof each of these blocks. The total number of such new blocks (with one error) is

120

Page 9: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

O(σα(|T |/α)) = O(σα|T |). Now all these new blocks can be checked separatelyto report the matches. However, this approach creates a small problem as allthe reported occurrences need not be an approximate match (some can be exactmatches as the introduced error need not be within an occurrence). However thenumber of occurrences within a block is bounded by

(2α2

)= O(α2), hence the total

number of outputs can be bounded by O(σ|T |α3). Among all outputs, those whichmatch with an error position can be reported as the actual outputs.

Theorem 3 Let D2 ⊆ D be the subset of all short patterns in D, where eachpattern p ∈ D2 has length less than α. Let

∑p∈D2|p| = n2 ≤ n. Then D2 can be

indexed in n2Hk(D2) + O(n2) bits space, such that all the occurrences of patternsp ∈ D2 which appear as a substring of an online text T with one error can bereported in O(σ|T |α3) time.

Using the results from Theorems 2 and 3, we maintain separate indexes for shortand long patterns. The following theorem can be obtained by choosing α = log n.Note that Hk is the kth order empirical entropy of the complete set (long andshort) of patterns.

Theorem 4 Approximate dictionary matching with one error can be solved in(almost) optimal nHk+O(n)+o(n log σ) bits space and O(σ|T | log3 n log log n+occ)query time.

5 Concluding Remarks

In this paper, we have proposed the first compressed index for approximate dictio-nary matching with one error, taking space close to the optimal. There are severalinteresting open problems in this research direction:

1. Can we improve the query time so that it becomes independent of σ, whilekeeping the space succinct/compressed?

2. Can we obtain an efficient compressed index for approximate dictionary match-ing with k errors?

3. Can we modify the index to allow efficient updates of patterns in D?

References

1. A. Aho and M. Corasick. Efficient String Matching: An Aid to BibliographicSearch. Communications of the ACM, 18(6):333–340, 1975.

2. A. Amir and M. Farach. Adaptive Dictionary Matching. In Proceedings ofSymposium on Foundations of Computer Science, pages 760–766, 1991.

3. A. Amir, M. Farach, Z. Galil, R. Giancarlo, and K. Park. Dynamic DictionaryMatching. Journal of Computer and System Sciences, 49(2):208–222, 1994.

121

Page 10: [IEEE 2011 Data Compression Conference (DCC) - Snowbird, UT, USA (2011.03.29-2011.03.31)] 2011 Data Compression Conference - Compressed Dictionary Matching with One Error

4. A. Amir, M. Farach, R. Idury, A. La Poutre, and A. Schaffer. Improved Dy-namic Dictionary Matching. Information and Computation, 119(2):258–282,1995.

5. A. Amir, D. Keselman, G. M. Landau, M. Lewenstein, N. Lewenstein, andM. Rodeh. Text Indexing and Dictionary Matching With One Error. vol-ume 37, pages 309–325, 2000.

6. D. Belazzougui. Succinct Dictionary Matching With No Slowdown. In CPM,2010.

7. H. L. Chan, W. K. Hon, T. W. Lam, and K. Sadakane. Compressed Indexesfor Dynamic Text Collections. ACM Transactions on Algorithms, 3(2), 2007.

8. R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary Matching and Indexingwith Errors and Don’t Cares. In Proceedings of Symposium on Theory ofComputing, pages 91–100, 2004.

9. P. Ferragina, S. Muthukrishnan, and M. de Berg. Multi-method Dispatching:A Geometric Approach with Applications to String Matching Problems. InProceedings of Symposium on Theory of Computing, pages 483–491, 1999.

10. P. Ferragina and R. Venturini. A Simple Storage Scheme for Strings AchievingEntropy Bounds. Theoretical Computer Science, 372(1):115–121, 2007.

11. J. Fischer and V. Heun. A New Succinct Representation of RMQ-Informationand Improvements in the Enhanced Suffix Array. In Proceedings of Symposiumon Combinatorics, Algorithms, Probabilistic and Experimental Methodologies,pages 459–470, 2007.

12. W. K. Hon, T. H. Ku, R. Shah, S. V. Thankachan, and J. S. Vitter. FasterCompressed Dictionary Matching. In Proceedings of International Symposiumon String Processing and Information Retrieval, pages 191–200, 2010.

13. W. K. Hon, T. W. Lam, R. Shah, S. L. Tam, and J. S. Vitter. Compressed In-dex for Dictionary Matching. In Proceedings of Data Compression Conference,pages 23–32, 2008.

14. W. K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. On Entropy-Compressed Text Indexing in External Memory. In Proceedings of Interna-tional Symposium on String Processing and Information Retrieval, pages 75–89, 2009.

15. W. K. Hon, R. Shah, and J. S. Vitter. Compression, Indexing, and Retrieval forMassive String Data. In Proceedings of Symposium on Combinatorial PatternMatching, pages 260–274, 2010.

16. E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm.Journal of the ACM, 23(2):262–272, 1976.

17. M. H. Overmars. Efficient data structures for range searching on a grid.Journal of Algorithms, 9:254–275, 1988.

18. P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of Symposiumon Switching and Automata Theory, pages 1–11, 1973.

19. D. E. Willard. Log-Logarithmic Worst-Case Range Queries are Possible inSpace Θ(N). Information Processing Letters, 17(2):81–84, 1983.

122