random access to grammar compressed strings

22
Random Access to Grammar Compressed Strings Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann

Upload: vutu

Post on 02-Jan-2017

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Random Access to Grammar Compressed Strings

Random Access to Grammar Compressed Strings

Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann

Page 2: Random Access to Grammar Compressed Strings

Random Access to Compressed Strings

textDNAXML

Page 3: Random Access to Grammar Compressed Strings

Random Access to Compressed Strings

textDNAXML

• What is the ith character?

• What is the substring at [i,j]?

• Does pattern P appear in text? (perhaps with k errors?)

Page 4: Random Access to Grammar Compressed Strings

Random Access to Grammar Compressed Strings

• Grammar based compression captures many popular compression schemes with no or little blowup in space [Charikar et al. 2002, Rytter 2003].

• Lempel-Ziv family, Sequitur, Run-Length Encoding, Re-Pair, ...

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

≤n

AGTAGTAG N = 8

X7 → X6X3X6 → X5X5X5 → X3X4X4 → TX3 → X1X2X2 → GX1 → A

n = 7

Page 5: Random Access to Grammar Compressed Strings

Tradeoffs and Results

• What is the ith character?

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n1 1

1

1 1

26

3 3

2 1

1 1

2

O(N) spaceO(1) query

O(n) spaceO(n) query

O(n) spaceO(log N) query

• What is the substring at [i,j]?O(n) spaceO(log N + j - i) query

Page 6: Random Access to Grammar Compressed Strings

Application: Black-Box Compressed String Matching

• Does “AGGA” appear in the text (perhaps with k errors)?

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

1 1

1

1 1

26

3 3

2 1

1 1

2

Page 7: Random Access to Grammar Compressed Strings

Application: Black-Box Compressed String Matching

• Total time O(n (log N + m + Blackbox(m))).

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

26

✓✓m+k m+k

• Does “AGGA” appear in the text (perhaps with k errors)?

Page 8: Random Access to Grammar Compressed Strings

Extension: Compressed Trees

• Linear space in compressed tree.

• Fast navigation operations (select, access, parent, depth, height, subtree_size, first_child, next_sibling, level_ancestor, nca).

b c

db

ac a

d

b

c d

b

b

c d

c d b

db

ac a

c d

Page 9: Random Access to Grammar Compressed Strings

Heavy Path Decomposition

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n1 1

1

1 1

26

3 3

2 1

1 1

2

Page 10: Random Access to Grammar Compressed Strings

Heavy Path Decomposition

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n1 1

1

1 1

26

3 3

2 1

1 1

2

Page 11: Random Access to Grammar Compressed Strings

Heavy Path Decomposition

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n1 1

1

1 1

26

3 3

2 1

1 1

2

Page 12: Random Access to Grammar Compressed Strings

Random Access Query

• The path from root to i goes through O(log N) heavy paths

• Query: Binary search all heavy paths on the way

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 i 4 5 6 7 8

N

n1 1

1

1 1

26

3 3

2 1

1 1

2

O(log n) ⋅ O(log N)

Page 13: Random Access to Grammar Compressed Strings

Random Access Query

• The path from root to i goes through O(log N) heavy paths

• Query: Binary search all heavy paths on the way

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 i 4 5 6 7 8

N

n1 1

1

1 1

26

3 3

2 1

1 1

2

O(log n) ⋅ O(log N)

Page 14: Random Access to Grammar Compressed Strings

Interval Biased Search Tree

• The path from root to i goes through O(log N) heavy paths

• Query: Binary search all heavy paths on the way

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 i 4 5 6 7 8

N

n1 1

1

1 1

26

3 3

2 1

1 1

2

O(log n) ⋅ O(log N)O(log N/x)

x =

0 N

?

x

Telescopes to O(log N)

• Space: Each IBSTs uses linear space => total O(n2) space for all heavy paths.

Page 15: Random Access to Grammar Compressed Strings

O(n) Representation of Heavy Paths

• Search for i on heavy path = lowest ancestor of distance i.

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n

X7

X6

X5

X3

X2 X1

Xi Xj

Xk Xl Xp

X4

G A T

XtXm

1 1

1

1 1

26

3 3

2 1

1 1

2

Page 16: Random Access to Grammar Compressed Strings

O(n) Representation of Heavy Paths

• Search for i on heavy path = lowest ancestor of distance i.

• A heavy path decomposition of heavy path representation.

• In-path: O(log N/x) time, total O(n) space.

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n

X7

X6

X5

X3

X2 X1

Xi Xj

Xk Xl Xp

X4

G A T

XtXm

1 1

1

1 1

26

3 3

2 1

1 1

2

Page 17: Random Access to Grammar Compressed Strings

O(n) Representation of Heavy Paths

• Search for i on heavy path = lowest ancestor of distance i.

• A heavy path decomposition of heavy path representation.

• In-path: O(log N/x) time, total O(n) space.

• Between-paths: O(log N/x) time, total O(n log n) space.

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n

X7

X6

X5

X3

X2 X1

Xi Xj

Xk Xl Xp

X4

G A T

XtXm

1 1

1

1 1

26

3 3

2 1

1 1

2 log n

Page 18: Random Access to Grammar Compressed Strings

O(n) Representation of Heavy Paths

• Search for i on heavy path = lowest ancestor of distance i.

• A heavy path decomposition of heavy path decomposition.

• In-path: O(log N/x) time, total O(n) space.

• Between-paths: O(log N/x) time, total O(n log n) space.

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n

X7

X6

X5

X3

X2 X1

Xi Xj

Xk Xl Xp

X4

G A T

XtXm

1 1

1

1 1

26

3 3

2 1

1 1

2

2

log n n/logn leaves

logn leaves

logn leaves

logn leaves

Page 19: Random Access to Grammar Compressed Strings

O(n) Representation of Heavy Paths

• Search for i on heavy path = lowest ancestor of distance i.

• A heavy path decomposition of heavy path decomposition.

• In-path: O(log N/x) time, total O(n) space.

• Between-paths: O(log N/x) time, total O(n log n) space.

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n

X7

X6

X5

X3

X2 X1

Xi Xj

Xk Xl Xp

X4

G A T

XtXm

1 1

1

1 1

26

3 3

2 1

1 1

2

2

log n n/logn leaves

logn nodes

logn nodes

logn nodes

Page 20: Random Access to Grammar Compressed Strings

O(n) Representation of Heavy Paths

• Search for i on heavy path = lowest ancestor of distance i.

• A heavy path decomposition of heavy path decomposition.

• In-path: O(log N/x) time, total O(n) space.

• Between-paths: O(log N/x) time, total O(n log n) space.

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n

X7

X6

X5

X3

X2 X1

Xi Xj

Xk Xl Xp

X4

G A T

XtXm

1 1

1

1 1

26

3 3

2 1

1 1

2

2

log n n/logn leaves

logn nodes

logn nodes

logn nodes

O(nα(n))

Page 21: Random Access to Grammar Compressed Strings

O(n) Representation of Heavy Paths

• Search for i on heavy path = lowest ancestor of distance i.

• A heavy path decomposition of heavy path decomposition.

• In-path: O(log N/x) time, total O(n) space.

• Between-paths: O(log N/x) time, total O(n log n) space.

X3

X2X1

X5

X4 X3

X2X1

X5

X4

X1 X2

X3X6

X7

A G T A G T A G1 2 3 4 5 6 7 8

N

n

X7

X6

X5

X3

X2 X1

Xi Xj

Xk Xl Xp

X4

G A T

XtXm

1 1

1

1 1

26

3 3

2 1

1 1

2

2

log n n/logn leaves

logn nodes

logn nodes

logn nodes

O(nα(n))O(n) with bittricks

Page 22: Random Access to Grammar Compressed Strings

Summary

• Random access and substring decompression.

• O(n) space and O(log N + length of substring) time.

• Black compressed (approximate) string matching.

• Random access in compressed trees.