lecture 7 sequitur
TRANSCRIPT
January 5, 2016 1 [email protected]
January 5, 2016 2 [email protected]
Introduction
Contents
Context Free Grammar
Sequitur Principles
Context-Free Grammar Example
January 5, 2016 [email protected] 3
Sequitur (or Nevill-Manning algorithm) is a recursive algorithm developed
by Craig Nevill-Manning and Ian H. Witten in 1997 that infers a hierarchical
structure (context free grammar) from a sequence of discrete symbols. The
algorithm operates in linear space and time. It can be used in data
compression software applications
Sequitur is based on the concept of context-free grammars, so we start
with a short review of this field.
Introduction
January 5, 2016 [email protected] 4
It reads the input symbol by symbol and uses repeated phrases in the
input data to build a set of context-free production rules.
Sequitur (from the Latin for “it follows”) is based on the concept of context-
free grammars.
It considers the input stream a valid sequence in some formal language.
January 5, 2016 5 [email protected]
A (natural) language starts with a small number of building blocks (letters
and punctuation marks) and uses them to construct words and sentences.
A sentence is a finite sequence (a string) of symbols that obeys certain
grammar rules.
Similarly, a formal language uses a small number of symbols (called
terminal symbols) from which valid sequences can be constructed.
The rules can be used to construct valid sequences and also to
determine whether a given sequence is valid.
A production rule consists of a nonterminal symbol on the left and a
string of terminal and nonterminal symbols on the right.
January 5, 2016 [email protected] 6
– terminals: b, e
– non-terminals: S, A
– Production Rules:
– S is the start symbol
January 5, 2016 [email protected] 7
The nonterminal symbol on the left becomes the name of the string on
the right.
In general, the right-hand side may contain several alternative strings,
but the rules generated by sequitur have just a single string.
The BNF notation, used to describe the syntax of programming
languages, is based on the concept of production rules.
We use lowercase letters to denote terminal symbols and uppercase
letters for the non-terminals.
BNF is an acronym for “Backus Naur Form“
January 5, 2016 [email protected] 8
Suppose that the following production rules are given:
A → ab, B → Ac, C → BdA.
Now verify that the string abcdab is valid
It is clear that the production rules reduce the redundancy of the original
sequence, so they can serve as the basis of a compression method.
Using these rules we can generate the valid strings ab (an application of
the nonterminal A), abc (an application of B), abcdab (an application of C),
as well as many others.
Context Free Grammar
January 5, 2016 [email protected] 9
Each repetition results in a rule, is replaced by the name of the rule (a
nonterminal symbol), thereby resulting in a shorter representation.
Generally, a set of production rules can be used to generate many valid
sequences, but the production rules produced by sequitur are not general.
They can be used only to reconstruct the original data.
The production rules themselves are not much smaller than the original
data, so sequitur has to go through one more step, where it compresses
the production rules.
The compressed production rules become the compressed stream, and
the sequitur decoder uses the rules (after decompressing them) to
reconstruct the original data.
January 5, 2016 [email protected] 10
If the input is a typical text in a natural language, the top-level rule
becomes very long, typically 10–20% of the size of the input, and the
other rules are short, with typically 2–3 symbols each.
January 5, 2016 [email protected] 11
Sequitur Principles
• Digram Uniqueness:
– no pair of adjacent symbols (digram) appears more than once in the
grammar.
• Rule Utility:
– Every production rule is used more than once.
• These two principles are maintained as an invariant while inferring a
grammar for the input string.
Sequitur constructs its grammars by observing two principles (or enforcing
two constraints) that we denote by p1 and p2.
Constraint p1 is; No pair of adjacent symbols will appear more than once in
the grammar (this can be rephrased as; Every digram in the grammar is
unique).
Constraint p2 says; Every rule should be used more than once.
This ensures that rules are useful. A rule that occurs just once is useless
and should be deleted.
January 5, 2016 [email protected] 12
The result is a two-rule grammar, where the first rule is the input
sequence with its redundancy removed, and the second rule is short,
replacing the digram bc with the single nonterminal symbol A.
January 5, 2016 [email protected] 13
The input S is considered a one-rule grammar. It has redundancy, so each
occurrence of abcdbc is replaced with A. Rule A still has redundancy
because of a repetition of the phrase bc, which justifies the introduction of
a second rule B.
January 5, 2016 [email protected] 14
Above Figure shows how the two constraints can be violated. The first
grammar of Figure contains two occurrences of bc, thereby violating p1.
The second grammar contains rule B, which is used just once. It is easy to
see how removing B reduces the size of the grammar. The resulting,
shorter grammar is shown in following Figure. It is one rule and one symbol
shorter.
January 5, 2016 [email protected] 15
The sequitur encoder constructs the grammar rules while enforcing the
two constraints at all times.
If constraint p1 is violated, the encoder generates a new production rule.
When p2 is violated, the useless rule is deleted.
The encoder starts by setting rule S to the first input symbol. It then goes
into a loop where new symbols are input and appended to S.
Each time a new symbol is appended to S, the symbol and its
predecessor become the current digram.
If the current digram already occurs in the grammar, then p1 has been
violated, and the encoder generates a new rule with the current digram
on the right-hand side and with a new nonterminal symbol on the left.
The two occurrences of the digram are replaced by this nonterminal.
January 5, 2016 [email protected] 16
January 5, 2016 [email protected] 17
Notice that generating rule C has made rule B underused (i.e., used just
once), which is why it was removed in the previous Figure.
One more detail, namely rule utilization, still needs to be discussed.
When a new rule X is generated, the encoder also generates a counter
associated with X, and initializes the counter to the number of times X is
used (a new rule is normally used twice when it is first generated). Each
time X is used in another rule Y, the encoder increments X’s counter by
1. When Y is deleted, the counter for X is decremented by 1. If X’s
counter reaches 1, rule X is deleted.
January 5, 2016 [email protected] 18
As an example, we show the information sent to the decoder for the input
string abcdbcabcdbc (above Figure). Rule S consists of two copies of rule
A. The first time rule A is encountered, its contents aBdB are sent. This
involves sending rule B twice. The first time rule B is sent, its contents bc
are sent (and the decoder does not know that the string bc it is receiving is
the contents of a rule). The second time rule B is sent, the pair (1, 2) is sent
(offset 1, count 2).
The decoder identifies the pair and uses it to set up the rule 1 → bc.
Sending the first copy of rule A therefore amounts to sending abcd(1, 2).
The second copy of rule A is sent as the pair (0, 4) since A starts at offset
0 in S and its length is 4. The decoder identifies this pair and uses it to set
up the rule 2 → a 1 d 1 . The final result is therefore abcd(1, 2)(0, 4).
January 5, 2016 [email protected] 19
Context-Free Grammar Example
January 5, 2016 [email protected] 20
Arithmetic Expressions
January 5, 2016 [email protected] 21
Sequitur Example
January 5, 2016 [email protected] 22
January 5, 2016 [email protected] 23
January 5, 2016 [email protected] 24
January 5, 2016 [email protected] 25
January 5, 2016 [email protected] 26
January 5, 2016 [email protected] 27
January 5, 2016 [email protected] 28
January 5, 2016 [email protected] 29
January 5, 2016 [email protected] 30
January 5, 2016 [email protected] 31
January 5, 2016 [email protected] 32
January 5, 2016 [email protected] 33
January 5, 2016 [email protected] 34
January 5, 2016 [email protected] 35
January 5, 2016 [email protected] 36
January 5, 2016 [email protected] 37
January 5, 2016 [email protected] 38
January 5, 2016 [email protected] 39
January 5, 2016 [email protected] 40
January 5, 2016 [email protected] 41
January 5, 2016 [email protected] 42
January 5, 2016 [email protected] 43
January 5, 2016 [email protected] 44
January 5, 2016 [email protected] 45
January 5, 2016 [email protected] 46
January 5, 2016 [email protected] 47
The Hierarchy
January 5, 2016 48 [email protected]