on the complexity of optimal grammar-based compressiondilant/cs175/talks_1/[j.lustig_t1].pdf · g o...
TRANSCRIPT
![Page 1: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/1.jpg)
On the Complexity of Optimal
Grammar-Based CompressionB y J a n A r p e a n d
R u d i g e r R e i s c h u k
S c h r i f t e n r e i h e d e r I n s t i t u t e f u r
I n f o r m a t i k / M a t h e m a t i k
J a s o n L u s t i g
![Page 2: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/2.jpg)
Grammar-Based Compression: What??
Encoding: Create a context-free grammar for a string of characters or bits
Code the grammar and transfer it
Decoding: Parse and expand the grammar
![Page 3: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/3.jpg)
A short example
X = “DOG EAT DOG”
T1 → DT2 → OT3 →GT4 → [SPACE]T5 →ET6 →AT7 →T
V1 → T1T2T3V2 → T5T6T7S → V1T4V2T4V1
![Page 4: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/4.jpg)
Relationship to sliding window and
other methodsSliding window methods can be seen as a specific case and coding of grammar-based compression
You are doing text-replacement... isn’t it the same thing?
Sort of like the arithmetic coding of grammars - you can start decompressing before the whole input shows up
![Page 5: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/5.jpg)
Goals of grammar-based compression
It must be deterministic -- i.e. you can only get one expansion from a grammar
It should be as small as possible
![Page 6: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/6.jpg)
This is the hard part
Fairly certain that finding the Minimum Grammar Compression (MGC) is quite hard
It is NP-complete, in fact, when restricted to alphabets of size >= 3
However, they are not sure if it is NP-hard in binary encodings (We call this 2MGC)
![Page 7: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/7.jpg)
So what do we do?
Try to approximate the minimum grammar
Smallest grammars known so far are of size at most O(log n), but not sure this is the best possible
![Page 8: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/8.jpg)
What is new here?
Authors try to show relationship between grammars on strings with a library of arbitrary size and those with a finite size
By making a block coding from a string in one arbitrary alphabet to a finite one, they show how the size of a grammar for the coding size is related to that of the original string
This reduces case of arbitrary alphabets to finite ones
![Page 9: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/9.jpg)
It’s all greek to me
τ : Finite or infinite alphabet
Σ : Finite alphabet
φ : Block coding, φ(x | x ∈ τ*) = x --> Σ*
Gx : Grammars for x. {Σ, V = nonterminals, P = derivations, S = start symbols}
m(Gx) : size of the grammar
m*(Gx) : size of smallest grammar
![Page 10: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/10.jpg)
Coding grammar and string grammar
Let x ∈ τ*, φ : l-block code, τ*to Σ*
Grammar for x has size m(x), grammar for φ(τx) has size m(φ)
Grammar for φ(x) has size m(φ(x)) <= m(x)+m(φ)
![Page 11: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/11.jpg)
An example
τ = {0, 1, 2, 3, 4, 5, 6, 7}; |τ| = 8Σ = {0, 1}; |Σ| = 2φ : τx* ➝ Σ* :
0 ➝ 000 4 ➝ 1001 ➝ 001 5 ➝ 1012 ➝ 010 6 ➝ 1103 ➝ 011 7 ➝ 111
φ is an l-block coding where l=3x ∈ τ*, φ(x) ∈ Σ*x = “2 5 4 7 6 1 2 2 2 5”φ(x) = “010 110 100 111 110 001 010 010 010 110”
![Page 12: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/12.jpg)
Gx
Terminals Nonterminals Start
T2 ➝ 2T5 ➝ 5T4 ➝ 4T7 ➝ 7T6 ➝ 6T1 ➝ 1
NT0 ➝ T2T5
NT1 ➝ T4T7
NT2 ➝ T6T1
NT3 ➝ T2T2
NT4 ➝ NT0NT1
NT5 ➝ NT2NT3
NT6 ➝ NT4NT5
Sx ➝ NT6NT4
m(x) = |Terminals| + |Nonterminals| + |Start| = 14
![Page 13: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/13.jpg)
GφTerminals Nonterminals Start
T2 ➝ 010T5 ➝ 101T4 ➝ 100T7 ➝ 111T6 ➝ 110T1 ➝ 001
S2 ➝ T2
S5 ➝ T5
S4 ➝ T4
S7 ➝ T7
S6 ➝ T6
S1 ➝ T1
m(φ) = |Terminals| + |Nonterminals| + |Start| = 12
![Page 14: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/14.jpg)
What we can do
Make a grammar for φ(x), of size m(φ(x))
m(φ(x)) ≤ m(x) + m(φ)
From Gφ(x) authors construct another grammar for x, m(x) ≤ 2*l*m(φ(x))
If φ is overlap-free, m(x) <= 2*m(φ(x))
They also show that this holds for m*(x)
![Page 15: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/15.jpg)
More on binary alphabets
Authors are interested in binary because it is the most practical
Take an l-block code φ : τx* ➝ {0, 1}*
m*(x) >= 1/24 * l * m*(φ(x))
In other words:
m*(φ(x)) ≤ 24/l * |τ|
![Page 16: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/16.jpg)
Bounded v. Unbounded
For all ε> 0, there is a natural n such that if |x| >= n, then for any Gx of size m(x), we can make a grammar for φ(x) of size m(φ(x)) ≤ (12 + ε)m(x)
Authors can then take this and make another Gx of size m(x) ≤ 2m(φ(x))
This shows us that the size of grammars for bounded v. unbounded alphabets only differs by constant factors
![Page 17: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/17.jpg)
So what?
This implies that if MGC cannot be done within constant factors for arbitrary strings, it can’t be done for binary either
Also, they showed that for unbounded and finite alphabets the grammars are related by constants
Needs additional research - find optimal grammar-based compression for a set of strings
![Page 18: On the Complexity of Optimal Grammar-Based Compressiondilant/cs175/Talks_1/[J.Lustig_T1].pdf · G o a l s o f g r a m m a r - b a s e d c o m p r e s s io n It must be deterministic](https://reader034.vdocument.in/reader034/viewer/2022050503/5f95b2bdd49f531bf0057aba/html5/thumbnails/18.jpg)
?