shift-and approach to pattern matching in lzw compressed text
DESCRIPTION
Shift-And Approach to Pattern Matching in LZW Compressed Text. Takuya KIDA. Masayuki TAKEDA. Ayumi SHINOHARA. Setsuo ARIKAWA. Department of Informatics Kyushu University, Japan. Motivation. The available storage devices are limited! I am eager to stuff any available information - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/1.jpg)
Shift-And Approach to Pattern Matching
in LZW Compressed Text
Takuya KIDA
Department of InformaticsKyushu University, Japan
Masayuki TAKEDAAyumi SHINOHARA
Setsuo ARIKAWA
![Page 2: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/2.jpg)
<2/32>
Address book
Schedule
Dictionary
Phone numbers
Memo
Electronic book
Database
The available storage devices are limited! I am eager to stuff any available information up to possible! I want to do pattern matching as fast as possible!
Motivation
Motivation
...Yes! Data compression!
...but a suffix trie is very large...
![Page 3: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/3.jpg)
<3/32>
CompressedText
OriginalOriginalTextText
CompressedText
Pattern MatchingPattern Matching MachineMachine
New Machine !New Machine !
Our goal
Our goal
decompress
![Page 4: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/4.jpg)
<4/32>
year researchers compression method
1988 Eliam-Tsoreff and Vishkin run-length1992 Amir, Landau, and Vishkin two-dimensional run-length
1995 Farach and Thorup LZ77
1996 Amir, Benson and Farach LZW1997 Karpinski, Rytter, and Shinohara straight-line programs
1996 Gasieniec, et al. LZ77
1997 Miyazaki, Shinohara, and Takeda straight-line programs
1992 Amir and Benson two-dimensional run-lengthAmir, Benson, and Farach1994 two-dimensional run-length
1997 Takeda finite state encoding
1998 Shibata byte pair encoding
1994 Manber original compression scheme
1998 Fukamachi, Shinohara, and Takeda Huffman encoding1998 Kida, et al. LZW
Previous researches
Previous researches
AC automatonDCC’98
![Page 5: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/5.jpg)
<5/32>
year researchers compression method
1999 Kida, Takeda, Shinohara, andArikawa
LZW
1999 Shibata, et al. Byte pair encoding
Kida, et al.1999 Dictionary based methods(Collage system)
1999 Navarro and Raffinot LZ family
1999 Shibata, Takeda, Shinohara, andArikawa
Antidictionaries
CPM’99
CPM’99
CPM’99
SPIRE’99
1998 de Moura, Navarro, Ziviani, andBaeza-Yates
Word based encoding
Previous researches
Recent researches
Shift-And algorithm
![Page 6: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/6.jpg)
<6/32>
Main results
The new algorithm scans a compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing.The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton.The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm.
Our main results
|D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences
![Page 7: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/7.jpg)
Lempel-Ziv-Welch Compression
how to compress and decompress
![Page 8: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/8.jpg)
<8/32>
LZW compression
a b ab ab ba b c aba bc abab1 2 34 5 6 9 114 2
Original text:
Compressed text:
Dictionary trieb
a b c
a
a a
a
bb
b c
0
1 2 3
4 5
6 7
9
8 12
10
11
aba6
6
a
a
b
Lempel-Ziv-Welch(LZW) compression
O(|D|) = O(n)
![Page 9: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/9.jpg)
<9/32>
Move of compression
a b ab ab ba b c aba bc abab1 2 34 5 6 9 114 2
Original text:
Compressed text:
Dictionary trie
a b c0
1 2 3b
4a5
a6
b7
b8
c9
a10
b11
a12
How to compress a text
![Page 10: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/10.jpg)
<10/32>
Move of decompression
1 2 34 5 6 9 114 2Original text:
Compressed text:
How to decompress a compressed text
a b ab ab ba b c aba bc abab
Dictionary trie
a b c0
1 2 3b
4a5
a6
b7
b8
c9
a10
b11
a12
O(n) time
O(N) time
![Page 11: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/11.jpg)
Compressed Pattern Matchingin LZW Compressed Text
with Shift-And approach
![Page 12: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/12.jpg)
<12/32>
Shift-And approach to pattern matching
10000
abac
aaabaacaabacabtext:
pattern: aabac
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
10000
11000
11000
11010
&
a a b a c abc11010
00100
00001
mask bits
abac
a
Shift-And approach to pattern matching
Pattern was found!
(Baeza-Yates and Gonnet[1992], Wu and Manber[1992])
![Page 13: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/13.jpg)
<13/32>
Property of SA approach
Properties of Shift-And approach
Simple, but very fast when a pattern length m is not greater than the word length of typical computers (32 or 64).Assuming m32 (or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time. This method has many variations
generalized pattern matching pattern matching with k-mismatch pattern matching for multiple patterns
![Page 14: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/14.jpg)
<14/32>
aabaacaabacab
abac
atext:
Basic idea
10000
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
a ab aa ac a a b a c
Jump! Jump!
pattern: aabac
Basic idea of our algorithm
abc11010
00100
00001
mask bits
10000
11000
10000
6 151compressedtext :
O(1) time?
![Page 15: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/15.jpg)
<15/32>
Basic idea
aabaacaabacab
abac
atext:
10000
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
abc11010
00100
00001
mask bits
10000
11000
10000
We need a mechanism for reporting all pattern occurrences.
pattern: aabac6 151compressed
text :
Pattern was found!
1
Basic idea of our algorithm
![Page 16: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/16.jpg)
<16/32>
Main results
Lemma 1 (Realization of ‘Jump’)The state transition function can be realized in O(|D|+m) time using O(|D|) space, and return the value in O(1) time.
Lemma 2 (Realization of ‘Output ’)The procedure which enumerates the pattern occurrences can be realized in O(|D|+m) time using O(|D|) space, and run in O(r) time.
Technical details
|D| : size of the dictionary trie m : pattern length r : number of pattern occurrences
![Page 17: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/17.jpg)
<17/32>
Overview of the algorithm
Overview of the algorithm
Input. pattern P, u1,u2, …,un : LZW compressed text.Output. All occurrences of the patterns.
^
^Construct mask bits from P.Initialize the dictionary trie, M, U, and V;
l:=0; S:=;
for i:=1 to n do begin for each dOutput(S, ui) do report ‘pattern occurs at position l+d ’; S:= f (S, u); /* Jump the state! */ l:= l+ |ui|; /* increment the offset */ Update the dictionary trie, M, U, and V;end
^
![Page 18: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/18.jpg)
Detail of our Algorithm
Realization of Jump and Output
![Page 19: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/19.jpg)
<19/32>
Detail of ‘Jump’
for a ∈Σ, u ∈Σ*, and S∈{1,・・・ , m},•
Detail of ‘Jump’
10000
11000
11010
&
state transition
10100
state S={1,3}M(a)={1,2,4}M(b)={3}M(c)={5}
abc11010
00100
00001
abac
a
mask bits
f (S, a) : ((S 1)∪{1}) ∩ M(a)M(a) : { 1 i m | Pattern[i] = a }
bit shift OR AND
![Page 20: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/20.jpg)
<20/32>
Detail of ‘Jump’
f (S, a) : ((S 1)∪{1}) ∩ M(a)M(a) : { 1 i m | Pattern[i] = a }
for a ∈Σ, u ∈Σ*, and S∈{1,・・・ , m},•
f f ((SS, , uu) = (() = ((S S ||uu|)|)∪∪{1,{1, ・・・・・・ , , |u||u|}) }) ∩ ∩ MM((uu))^^ ^^
O(1)
Detail of ‘Jump’
M(u) :: f({1,・・・ , m}, u)^ ^definerecursively
f f ((SS,,εε) :) : SS f f ((SS, , uaua) :) : f f ( ( f f ((SS, , uu), ), aa))^^^^ ^^
![Page 21: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/21.jpg)
<21/32>
Move of ‘Jump’
aba10010
abac
aacaabac
00001
M(u)^10000
100
10010
10010
&
10000
abac
aaabaacaabacabtext:
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
Move of f (S, u)^
111
![Page 22: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/22.jpg)
<22/32>
10000
aba10010
abac
aacaabac
00001
M(u)^
Move of ‘Jump’
Move of f (S, u)^
00001
00001
&
10000
abac
aaabaacaabacabtext:
11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
111111
![Page 23: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/23.jpg)
<23/32>
Detail of updating Mhat(u)
How to calculate M(u)^
MM((u u aa)) = f({1,・・・ , m}, u a)^^ ^= f ( f({1,・・・ , m}, u), a )^
= f ( M(u), a )^
= ((((MM((uu)) 1)1)∪∪{1}){1})∩∩MM((aa))^
u a
u
a
Dictionary trie D
M(u)^
M(u a)^
O(1)
total:O(|D|) time and space
![Page 24: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/24.jpg)
<24/32>
Detail of Output(S,u)
Output(S, u) = { 1 j |u| | m∈S }
How to enumerate the occurrences
2
11
Output(S, u) ={ 2, 11}
uS
length i prefix of the pattern for the largest i∈S.
patternoccurrence
patternoccurrence
2{1, ...,m}D
![Page 25: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/25.jpg)
<25/32>
Two subset U and A
U(u) : {1 j |u| | i < m and u[1..i]=Pattern[m-i+1..m]}
V(u) : {1 j |u| | i m and u[1-m+1..i]=Pattern}
Output(S, u) =((m S) U(u)) V(u)
Realization of Output(S, u)
dependent on S independent of S
uS
![Page 26: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/26.jpg)
<26/32>
Detail of updating U and A
How to calculate U(u) and V(u)
u a
u
a
Dictionary trie DU(ua)V(ua)
U(u)V(u)
total:O(|D|) time and space
if m∈M(ua) then U(ua) = U(u) {|u a|}else U(ua) = U(u) ;
^
We can deal with V(n) as the same way of [DCC’98].
O(1)
![Page 27: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/27.jpg)
-- Is this really practical? --
But... Is it But... Is it really fast ?really fast ?
Uhmm....Uhmm....
![Page 28: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/28.jpg)
<28/32>
Experimentation
◆ Method 1:
◆ Method 2:
CompressedText bcbababc 9
CompressedText
Shift-And
Our previousalgorithm(DCC’98)
◆ Method 3:
Experimental Comparisons
Decompress !
CompressedText
Our new algorithms
![Page 29: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/29.jpg)
<29/32>
Experimentation
Original Text"The Brown corpus"
6.8 MbytesCompressed Text3.4 Mbytes
Language: C (with gcc compiler)Machine : Sun SPARCstation 20 with
remote disk storageFile transfer ratio: 0.96 Mbyte/sec
compresscompress(UNIX command)(UNIX command)
Experimental Comparisons
![Page 30: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/30.jpg)
<30/32>
Experimental results
Experimental results
uncompressedtext
Shift-And
CPU time + File I/O time
1.3 timesfaster!
1.5 timesfaster!
elapsed time(s)
6.05
7.31
8.16
CPU time(s)
Shift-And with decompressionOur previous
algorithm(DCC’98)
New algorithmNew algorithm
7.52
6.57
5.15
Method
![Page 31: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/31.jpg)
<31/32>
Experimental results
Experimental results
Shift-And in original text 9.363.09
elapsed time(s)
6.05
7.31
8.16
CPU time(s)
Shift-And with decompressionOur previous
algorithm(DCC’98)
New algorithmNew algorithm
7.52
6.57
5.15
Method
![Page 32: Shift-And Approach to Pattern Matching in LZW Compressed Text](https://reader035.vdocument.in/reader035/viewer/2022062520/56815e34550346895dcc943c/html5/thumbnails/32.jpg)
<32/32>
Conclusion
Conclusion
The proposed algorithm scans an LZW compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing.
We implemented the algorithm, and showed that it is approximately 1.3 times faster than our previous algorithm.
Our new algorithm has several extensions. generalized pattern matching pattern matching with k-mismatches pattern matching for multiple patterns