parsing techniques a practical guide by dick grune and ceriel j. h. jacobs

449
Parsing Techniques A Practical Guide by Dick Grune and Ceriel J. H. Jacobs 1

Upload: luther

Post on 25-Feb-2016

29 views

Category:

Documents


1 download

DESCRIPTION

Parsing Techniques A Practical Guide by Dick Grune and Ceriel J. H. Jacobs. Book in slide format. F antastic book: Parsing Techniques I went through chapters 1 and 2 of the book and created slides of them. That is, the following slides are chapters 1 and 2, in slide form. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

1

Parsing TechniquesA Practical Guide

by Dick Grune and

Ceriel J. H. Jacobs

Page 2: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

2

Book in slide format

• Fantastic book: Parsing Techniques• I went through chapters 1 and 2 of the book and

created slides of them.• That is, the following slides are chapters 1 and 2, in

slide form.• Additionally, there are several slides from:

– Personal correspondence with one of the authors, Dick Grune.

– Material from other sources.– Slides that I created, applying the concepts to XML.

Roger L. CostelloJune 1, 2014

Page 3: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

3

Why grammars, not automata?

• There is a close relationship between formal grammars and other abstract notions used in computer science, such as automata and algorithms.

• Indeed, since the results in one theory can often be translated into another, it seems to be an arbitrary decision as to which interpretation is primary.

• In these slides formal grammars are given preferential treatment because they are probably the most commonly known of the various theories among computer scientists. This is due to the success of context-free grammars in describing the syntax of programming languages.

Page 4: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

4

Chapter 1Defining Parsing and Grammars

Page 5: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

5

Parsing• Parsing is the process of structuring a linear representation in accordance with a

given grammar. • This definition has been kept abstract on purpose to allow as wide an interpretation

as possible. • The “linear representation” may be:

– a sentence– a computer program– a knitting pattern– a sequence of geological strata– a piece of music– actions of ritual behavior In short, any linear sequence in which the preceding elements in some way restrict the next element.

• For some of the examples the grammar is well known, for some it is an object of research, and for some our notion of a grammar is only just beginning to take shape.

Page 6: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

6

Parsinggrammar

parserlinear representation

structure

Parsing is the process of structuring a linear representation in accordance with a given grammar. A “linear representation” is any linear sequence in which the preceding elements in some way restrict the next element.

Page 7: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

7

Grammar: a succinct summary

• For each grammar, there are generally an infinite number of linear representations (“sentences”) that can be structured with it.

• That is, a finite-sized grammar can supply structure to an infinite number of sentences.

• This is the main strength of the grammar paradigm and indeed the main source of the importance of grammars: they summarize succinctly the structure of an infinite number of objects of a certain class.

Page 8: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

8

Reasons for parsingThere are several reasons to perform this structuring process called parsing.

1. One reason derives from the fact that the obtained structure helps us to process the object further. When we know that a certain segment of a sentence is the subject, that information helps in understanding or translating the sentence. Once the structure of a document has been brought to the surface, it can be processed more easily.

2. A second reason is related to the fact that the grammar in a sense represents our understanding of the observed sentences: the better a grammar we can give for the movement of bees, the deeper our understanding of them.

3. A third lies in the completion of missing information that parsers, and especially error-repairing parsers, can provide. Given a reasonable grammar of the language, an error-repairing parser can suggest possible word classes for missing or unknown words on clay tablets.

Page 9: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

9

Grammatical inference

• Grammatical inference: Given a (large) set of sentences, find the/a grammar which produces them.

• Grammatical inference is also known as grammar induction or syntactic pattern recognition.

Page 10: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

10

XML Schema from an XML instance

The XML tool oXygen XML does grammatical inference when it creates an XML Schema from an XML instance document.

Page 11: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

11

The science of parsing

• Parsing is no longer an arcane art.• In the 1970s Aho, Ullman, Knuth, and many

others put parsing techniques solidly on their theoretical feet.

Page 12: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

12

Mathematician vs. Computer Scientist

• To a mathematician all structures are static. They have always existed and will always exist. The only time-dependence is that we have not discovered all the structures yet.– Example: the Peano axioms create the integers without reference

to time.• The computer scientist is concerned with (and fascinated by)

the continuous creation, combination, separation, and destruction of structures. Time is of the essence.– Example: if the computer scientist uses the Peano axioms to

implement integer addition, he finds they describe a very slow process, which is why he will look for a more efficient approach.

Page 13: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

13

Many uses for parsing

Parsing is for anyone who has parsing to do: – The compiler writer– The linguist– The database interface writer– The geologist who wants to test grammatical

descriptions of a sequence of geological strata– The musicologist who wants to test grammatical

descriptions of a music piece

Page 14: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

14

Requirements for a parser developer

It requires a good ability to visualize, some programming experience, and the willingness and patience to follow non-trivial examples.

Page 15: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

15

Chapter 2Grammars as a Generating Device

Page 16: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

16

Need to define some terms

• In computer science as in everyday parlance, a grammar serves to describe a language.

• To establish our terminology and to demarcate the universe of discourse, we shall examine these terms:

Language Grammar Language Descriptions

Page 17: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

17

Language

We examine three views of the word “language”:

– How the larger part of mankind views language– How the computer scientist views language– How the formal-linguist views language

Page 18: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

18

Layman’s view of languages

• To the larger part of mankind, language is first and foremost a means of communication.

• Communication is brought about by sending messages, through air vibrations or through written symbols.

• Languages have three levels of composition:– Messages fall apart into sentences, – which are composed of words, – which in turn consist of symbol sequences when written.

Page 19: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

19

Computer scientist view of languages

• A language has sentences, and these sentences possess structure.

• Information may possibly be derived from the sentence’s structure; that information is called the meaning of the sentence.

• Sentences consist of words called tokens, each possibly carrying a piece of information, which is its contribution to the meaning of the whole sentence.

Page 20: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

20

Computer scientist view of languages

• A language is a probably infinitely large set of sentences, each composed of tokens in such a way that it has structure.

• The tokens and structure cooperate to describe the semantics (meaning) of the sentence.

• To a computer scientist is a sentence in the language of “arithmetics on single digits”. Its structure can be shown by inserting parentheses ( and its semantics is .

Page 21: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

21

Formal-linguist view of languages

• A language is a “set” of sentences, and each sentence is a “sequence” of “symbols”.

• There is no meaning, no structure. Either a sentence belongs to the language or it does not.

• The only property of a symbol is that is has an identity.• In any language there are a certain number of different

symbols – the alphabet – and that number must be finite. Just for convenience we write these symbols as a, b, c, …, but ◊,▪,ⱴ, … would do equally well, as long as there are enough symbols.

Page 22: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

22

Formal-linguist view of languages • The word “sequence” means that the symbols in each

sentence are in a fixed order and we should not shuffle them.

• The word “set” means an unordered collection with all the duplicates removed. A set can be written down by writing the objects in it, surrounded by curly braces.

• All this means is that to a formal-linguist the following is a language: {a, b, ab, ba}

• The formal-linguist also calls a sentence a “word” and he says that “the word ab is in the language {a, b, ab, ba}”

Page 23: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

23

Formal-linguist vs. computer scientist

• The formal-linguist holds his views of language because he wants to study the fundamental properties of languages in their naked beauty. It gives him a grip on a seemingly chaotic and perhaps infinitely complex object: natural language.

• The computer scientist holds his view of language because he wants a clear, well-understood, and unambiguous means of describing objects in the computer and of communication with the computer (a most exacting communication partner).

Page 24: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

24

Grammars

We examine three views of the word “grammar”:

– How the larger part of mankind views grammar– How the formal-linguist views grammar– How the computer scientist views grammar

Page 25: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

25

Layman’s view of grammars

A grammar is a book of rules and examples which describes and teaches the language.

Page 26: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

26

Formal-linguist’s view of grammars

• A generative grammar is an exact, finite-size, recipe for constructing the sentences in the language.

• This means that, following the recipe, it must be possible to construct each sentence of the language (in a finite number of actions) and no others.

• This does not mean that, given a sentence, the recipe tells us how to construct that particular sentence, only that it is possible to do so.

Page 27: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

27

Computer scientist’s view of grammars

The computer scientist has the same view as the formal-linguist, with the additional requirement that the recipe should imply how a sentence can be constructed.

Page 28: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

28

Infinite sets from finite descriptions

A language is a possibly infinite set of sequences of symbols and a grammar is a finite recipe to generate those sentences.

Page 29: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

29

Example of an infinite set from a finite description

The set of all positive integers is a very finite-size description of a definitely infinite-size set.

Page 30: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

30

Not all languages are describable

• Can all languages be described by finite descriptions?

• Answer: No.

Page 31: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

31

Outline of the proof

• The proof that not all languages can be described by finite descriptions is not trivial. But it is very interesting and famous. We will present an outline of it.

• The proof is based on two observations and a trick.

Page 32: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

32

Enumerate language descriptions

The language descriptions can be listed. This is done as follows:

1. Take all descriptions of size one, that is, those of only one letter long, and sort them alphabetically.• Depending on what, exactly, we accept as a description,

there may be zero descriptions of size one, or 27 (all letters + space), or 95 (all printable ASCII characters), or something similar.

2. Take all descriptions of size two, sort them alphabetically. Do the same for lengths 3, 4, and further.This is observation number one.

Page 33: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

33

Each description has a well-defined position

• Now we have a list of descriptions. Each describes a language.

• So each description has a position on the list.• Example: our description the set of all positive integers is 32

characters long. To find its position on the list, we have to calculate how many descriptions there are with less than 32 characters, say L. We then have to generate all descriptions of size 32, sort them and determine the position of our description in it, say P, and add the two numbers L and P. This will, of course, give a huge number but it does ensure that the description is on the list in a well-defined position.

This is observation number two.

Page 34: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

34

Our example description is at position L + P

{ descriptions of size 1{ descriptions of size 2{ descriptions of size 3. . . . . . { descriptions of size 31. . . . . . . . . . . . . . . . . . . .

descriptions of size 32 the set of all positive integers{

L

P

Page 35: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

35

Two things to note

• Note #1: Just listing all descriptions alphabetically, without reference to their lengths, would not do. There are already infinitely many descriptions starting with an “a”, so no description starting with a higher letter could get a number on the list.

• Note #2: there is no need to actually do all this. It is just a thought experiment that allows us to examine and draw conclusions about the behavior of a system in a situation which we cannot possibly examine physically.

Page 36: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

36

Both nonsensical and meaningful descriptions

There will be many nonsensical descriptions on the list. This is immaterial to the argument. The important thing is that all meaningful descriptions are on the list, and the strategy ensures that.

Page 37: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

37

Alphabet

• The words (sentences) in a language are composed of a finite set of symbols.

• This set of symbols is called the alphabet.• We will assume the symbols in the alphabet

are ordered.• Then the words in the language can be

ordered too.• We shall indicate the alphabet by Σ.

Page 38: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

38

Language that consists of all possible words

• The language that consists of all possible words that can be built from an alphabet is called Σ*

• For the alphabet Σ = {a, b} we get the language { , a, b, aa, ab, ba, bb, aaa, …}

The empty word (the word consisting of zero as and zero bs). It may be easily overlooked, so we shall write it as ε (epsilon), regardless of the alphabet. So, Σ* = {ε, a, b, aa, ab, ba, bb, aaa, …}

Page 39: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

39

Words in Σ* can be enumerated

• Since the symbols in the alphabet Σ are ordered, we can list the words in the language Σ*, using the same technique as in the previous slides: – First, list all words of size zero, sorted; then list all

words of size one, sorted; and so on • This is actually the order already used in our

set notation for Σ*

Page 40: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

40

Compare language L against Σ*

• Since Σ* contains all possible words, all languages using alphabet Σ are subsets of it.

• Let L be a language over Σ (the word “over” means “built out of”).

• We can go through the list of words in Σ* and put checkmarks on all words that are in L.

• Suppose our language L is “the set of all words that contain more as than bs”. L is (a, aa, aab, aba, baa, …}

ε ✓ a

b ✓ aa

ab ba bb

✓ aaa ✓ aab ✓ aba

abb . . .

Page 41: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

41

Encode languages using 0 and 1

• The list of blanks and checkmarks is sufficient to identify and describe a language.

• For convenience we write the blank as 0 and the checkmark as 1, as if they were bits in a computer.

• We can now write L = 01010001110…– So, we have attached the infinite bit-string

01010001110… to the language description “the set of all words that contain more as than bs”.

• The set of all words over an alphabet is Σ* = 1111111…

ε ✓ a

b ✓ aa

ab ba bb

✓ aaa ✓ aab ✓ aba

abb . . .

Page 42: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

42

Languages are infinite bit-strings

• Any language can be encoded as an infinite bit-string, be it a formal language like L, a programming language like Java, or a natural language like English.

• For the English language the 1s in the bit-string will be very scarce, since hardly any arbitrary sequence of letters is a good English sentence.

Page 43: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

43

List of languages• We attached the infinite bit-string

01010001110… to the language description “the set of all words that contain more as than bs”.

• In the same way, we can attach bit-strings to all descriptions.

• Some descriptions may not yield a language, in which case we can attach an arbitrary infinite bit-string to it.

• Since all descriptions can be put on a single numbered list, we get, for example, this table:

DescriptionDescription #1Description #2Description #3Description #4Description #5Description #6

. . .

Language000000100…110010001…011011010…110011010…100000011..111011011…

. . .

Page 44: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

44

The list is incomplete

• Many languages exist that are not on the list of languages above.

• The above list is far from complete, although the list of descriptions is complete.

• We shall prove this by using the diagonalization process (“Diagonalverfahren”) of Cantor.

DescriptionDescription #1Description #2Description #3Description #4Description #5Description #6

. . .

Language000000100…110010001…011011010…110011010…100000011..111011011…

. . .

Page 45: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

45

Flip the bits along the diagonal

• Consider the language C = 100110…, which has the property that its n-th bit is unequal to the n-th bit of the language described by Description #n.

• The first bit of C is 1 because the first bit of Description #1 is 0. The second bit of C is 0 because the second bit of Description #2 is 1. And so on.

DescriptionDescription #1Description #2Description #3Description #4Description #5Description #6

. . .

Language000000100…110010001…011011010…110011010…100000011..111011011…

. . .

Page 46: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

46

Create a language

So C is created by walking the top-left to bottom-right diagonal of the language table and copying the opposites of the bits we meet.

DescriptionDescription #1Description #2Description #3Description #4Description #5Description #6

. . .

Language000000100…110010001…011011010…110011010…100000011..111011011…

. . .

Language000000100…110010001…011011010…110011010…100000011..111011011…

. . .

C = 100110…

Page 47: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

47

It’s a new language!

• The language C cannot be on the list!– C cannot equal line 1 since its first bit differs from that

line.– C cannot equal line 2 since its second bit differs from

that line.– And so forth.

• So, C cannot be on the list.

Language000000100…110010001…011011010…110011010…100000011..111011011…

. . .

C = 100110…

Page 48: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

48

Infinite number of new languages• So in spite of the fact that we exhaustively listed all possible

finite descriptions, we have created a language that has no description on the list.

• There are many more languages not on the list:– Construct, for example, the language whose n+6-th bit differs from

the n+6-th bit in Description #n. Again, it cannot be on the list since for each Description #n the new language differs in the n+6-th bit. That means that bits 1…5 play no role, and can be chosen arbitrarily; this yields another 25 = 32 languages that are not on the list.

Language000000100…110010001…011011010…110011010…100000011..111011011…

. . .

C+6 = xxxxx1101…

Page 49: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

49

Even more new languages

And there are many more languages not on the list:– Construct, for example, the language whose 2n-th bit

differs from the 2n-th bit in Description #n. Again, it cannot be on the list since for each Description #n the new language differs in the 2n-th bit. That means that the odd bits play no role and can be chosen freely.

Language000000100…110010001…011011010…110011010…100000011..111011011…

. . .

2C = x1x1x0x0…

Page 50: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

50

Infinitely many languages cannot be described

• We can create an infinite number of languages, none of which allows a finite description.

• For every language that can be described there are infinitely many that cannot.

Page 51: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

51

Many languages beyond our reach

We can only describe a tiny subset (not even a fraction) of all possible languages. There are infinitely many languages out there, forever beyond our reach.

Page 52: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

52

Unequal infinities

Although there are infinitely many descriptions and infinitely many languages, these infinities are not equal to each other.

# of languages =

# of descriptions of languages =

Page 53: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

53

Generating a set of objects• A good way to generate a set of objects is to start with a

small object and to give rules for how to add to it and generate new objects from it.

• Example: start with these objects: integers and addition. Then define rules for generating new objects from the primitive objects: 2

“2 is an even number and the sum of two even numbers is again an even number.”This generates the set of all even numbers.

Page 54: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

54

Generate the set of even numbers

Primitives

The integers

Addition operator

Rules

2 is an even number

The sum of two even numbers is an even number

The set of all even numbers

Examples: 4 is an even number since it the sum of 2 + 2, and 2 is an even number.6 is an even number since it is the sum of 4 and 2.8 is an even number since if is the sum of 6 and 2.

Page 55: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

55

Generative rules

We can use those rules to generate the set of even numbers.

even → even + even→ 2 + 2→ 4

even → even + even→ 4 + 2→ 6

→ means “may be replaced by”

Page 56: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

56

Generate infinite set from a finite set of rules

• With those two rules we are able to generate an infinite set.

• Note the generative character of the recipe (rules).

Page 57: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

57

Terminal, non-terminal symbols

“2” is called a terminal.

“even” is a symbol that stands for a number. It is called a non-terminal (a singularly uninspired term).

Page 58: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

58

Naming convention

Even → Even + Evenleft-hand

Side(LHS)

right-handSide

(RHS)

Since we have identified terminal symbols and non-terminal symbols as technical objects, we shall write them in Cambria Math font.

We write terminals in lower case letters and start non-terminals with upper case letters.

Non-terminals are called variables or syntactic categories in linguistic contexts.

Page 59: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

59

Generate regular expressions

Primitives

The symbols: |, *, (, and )

Characters

Rules

1. Any character is a regular expression

2. r1 | r2 is a regular expression, where r1 and r2 are regular expressions

3. r1r2 is a regular expression

4. r* is a regular expression

Regular expressions

Examples: Using rule 1 we generate this regular expression: a, which denotes this set: {a} Using rule 3 and 1 we generate this regular expression: ab, which denotes this set: {ab} Using rule 2, 3, and 1 we generate this regular expression: ab | b, which denotes this set: {ab, b} Using rule 4 and 3 we generate this regular expression: a*b, which denotes this infinite set: {ab, aab, aaab, …}

Page 60: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

60

Rules for regular expressions

Page 61: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

61

Tom, Dick and Harry language

• Generate the set of all lists of names, each of the form: tom, dick and harry.

• All names but the last two are separated by commas.

• Duplicate names are okay, such as: tom, tom, and dick.

• Although these are not complete sentences in normal English, we shall call them “sentence” since that is what they are in our midget language.

Page 62: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

62

Simple recipe for generating the language

1. tom is a Name, dick is a Name, harry is a Name

2. A Name is a Sentence3. A Sentence followed by a “,” and a Name is

again a Sentence4. Before finishing, if the Sentence ends in “,

Name” replace it by “and Name”

Page 63: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

63

Problem with the recipe• Clause 4 has trouble:

4. Before finishing, if the Sentence ends in “, Name” replace if by “and Name”.

• A Sentence does not really end in “, Name”, it ends in “, harry” or such, and “Name” is just a symbol that stands for a real name; such symbols cannot occur in a real sentence and must in the end be replaced by a real name as given in clause 1:1. tom is a Name, dick is a Name, harry is a Name.

• Likewise, the word “Sentence” in the recipe is a symbol that stands for an actual sentence.

Page 64: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

64

Two kinds of symbols

• There are two kinds of symbols: – Real symbols which occur in finished sentences like

“tom”, a comma, and “and”– Intermediate symbols like “Sentence” and “Name” that

cannot occur in finished sentences• The first kind are called terminal symbols (terminals

for short).• The second kind are called non-terminals, a

singularly uninspired term. In linguistic contexts they are called variables or syntactic categories.

Page 65: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

65

The recipe generates sentences

To stress the generative character of the recipe, we shall replace X is Y by Y may be replaced by X.

Instead of saying:

tom is a name

we say:

Name may be replaced by tom

Page 66: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

66

Revised recipe

1. Name may be replaced by tomName may be replaced by dickName may be replaced by harry

2. Sentence may be replaced by Name3. Sentence may be replaced by Sentence, Name4. “, Name” at the end of a Sentence must be replaced by

“and Name” before Name is replaced by any of its replacements

5. A sentence is finished only when it no longer contains non-terminals

6. We start our replacement procedure with Sentence

Page 67: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

67

Different types of clauses1. Name may be replaced by tom

Name may be replaced by dickName may be replaced by harry

2. Sentence may be replaced by Name3. Sentence may be replaced by Sentence, Name4. “, Name” at the end of a Sentence must be replaced by “and Name” before Name is replaced

by any of its replacements5. A sentence is finished only when it no longer contains non-terminals6. We start our replacement procedure with Sentence

Clauses 1 through 4 describe replacements, but 5 and 6 are different:

– Clause 5 is not specific to this grammar. It is valid generally and it is one of the rules of the game.

– Clause 6 tells us where to start generating. This name is called the start symbol, and it is required for every grammar.

Page 68: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

68

Conventions

• For brevity we write → instead of “may be replaced by”

Instead of writing: Name may be replaced by tom

We write:Name → tom

• The part before the → is called the left-hand side (LHS), the part after it is called the right-hand side (RHS).

Page 69: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

69

Finite recipe for generating strings in the t, d & h language

1. Name → tomName → dickName → harry

2. Sentence → NameSentence → List End

3. List → NameList → List, Name

4. , Name End → and Name5. the start symbol is Sentence

Page 70: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

70

Transformation of the form of the recipe rules

Our initial expression of each rule took this form:

tom is a Name

To emphasize the generative nature of the rules we revised the rules to this form:

Name may be replaced by tom

For brevity we replaced “may be replaced by” with an arrow:

Name → tom

That form is strong enough to serve as the basis for formal languages.

Page 71: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

71

Generating a sentence from the recipe1. Name → tom

Name → dickName → harry

2. Sentence → NameSentence → List End

3. List → NameList → List, Name

4. , Name End → and Name5. the start symbol is Sentence

Sentence → List End 2. Sentence → List End→ List, Name End 3. List → List, Name→ List, Name, Name End 3. List → List, Name→ Name, Name, Name End 3. List → Name→ tom, Name, Name End 1. Name → tom→ tom, dick, Name End 1. Name → dick→ tom, dick and Name 4. , Name End → and Name→ tom, dick and harry 1. Name → harry

Page 72: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

72

Form is the foundation for formal grammars

• This form: Name → tom“Name may be replaced by tom” is strong enough to serve as a basis for formal grammars.

• Similar forms, often called “rewriting systems”, have a long history among mathematicians, and were already in use several centuries B.C. in India.

Page 73: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

73

Chomsky

The specific form shown below was first studied extensively by Chomsky. His analysis has been the foundation for almost all research and progress in formal languages, parsers, and a considerable part of compiler construction and linguistics.

1. Name → tomName → dickName → harry

2. Sentence → NameSentence → List End

3. List → NameList → List, Name

4. , Name End → and Name5. the start symbol is Sentence

Page 74: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

74

Formal languages

• Formal languages are a branch of mathematics.

• The mathematics of formal languages uses a special notation that has to be learned. It allows a very concise expression of what and how but gives very little information on why. This tutorial gives the why.

Page 75: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

75

Formal definition of a grammar• A generative grammar G is an ordered four-tuple (VN, VT, S, F) where VN and VT are

finite alphabets with VN ∩ VT = , ∅ S is a distinguished symbol of VN, and F is a finite set of ordered pairs (P, Q) such that P and Q are in (VN ∪ VT)* and P contains at least one symbol from VN.

• The symbols of VN are called nonterminal symbols or variables and will usually be denoted by capital letters.

• The symbols of VT are called terminal symbols and will usually be denoted by small letters.

• The sets VN and VT are disjoint in every grammar.• The nonterminal symbol S is called the initial symbol and is used to start the

derivations of the sentences of the language.• The ordered pairs in F are called rewriting rules or productions and will be written

in the form P → Q where the symbol → is, of course, not in VN ∪ VT.

• Productions are used to derive new sentences from given ones by replacing a part equal to the left-hand side of a rule by the right-hand side of the same rule.

Page 76: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

76

LHS must contain a non-terminal

A generative grammar G is an ordered four-tuple (VN, VT, S, F) where VN and VT are finite alphabets with VN ∩ VT = , ∅ S is a distinguished symbol of VN, and F is a finite set of ordered pairs (P, Q) such that P and Q are in (VN ∪ VT)* and P contains at least one symbol from VN.

P → QMust contain a non-terminal

Page 77: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

77

Phrase structure grammars

• The grammar above is in the form of what is known as a phrase structure grammar for the t,d&h language (often abbreviated to PS grammar).

• PS grammars have no restriction on the right-hand side of production rules and on the left-hand side only the restriction that it contain at least one non-terminal.

• PS grammars are called Type 0 grammars.

1. Name → tomName → dickName → harry

2. Sentence → NameSentence → List End

3. List → NameList → List, Name

4. , Name End → and Name5. the start symbol is Sentence

Page 78: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

78

The alternative ( | ) symbol

Several right-hand sides with the same left-hand side are grouped together and separated by vertical bars, |. This bar symbol belongs to the formalism, just as the arrow →, and can be read “or else”. The right-hand side separated by vertical bars are also called alternatives.

1. Name → tomName → dickName → harry

2. Sentence → NameSentence → List End

3. List → NameList → List, Name

4. , Name End → and Name5. the start symbol is Sentence

1. Name → tom | dick | harry2. Sentence → Name | List End3. List → Name | List, Name4. , Name End → and Name5. the start symbol is Sentence

Page 79: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

79

Sentential forms

• In the process of generating a sentence from a grammar, a series of intermediate forms are produced, ultimately leading to the sentence.

• Each intermediate form is called a sentential form.

• The sentential forms are all the forms that occur from start symbol to final sentence.

Page 80: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

80

Sentential Forms

Sentence → List End→ List, Name End→ List, Name, Name End → Name, Name, Name End → tom, Name, Name End → tom, dick, Name End → tom, dick and Name → tom, dick and harry

1. Name → tom | dick | harry2. Sentence → Name | List End3. List → Name | List, Name4. , Name End → and Name5. the start symbol is Sentence

Sentential forms

Page 81: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

81

Terminology

• If a sentential form contains no non-terminals it is called a sentence and belongs to the generated language.

• The transitions (separated by arrows →) are called production steps.

• The grammar rules are called production rules.

Page 82: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

82

Example of terminology usage

We have seen that the sentential forms occurring in the production process for a finite-state grammar all contain only one non-terminal, except the last.

Page 83: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

83

Terminal productions

The set of strings that are generated from the start symbol are called the terminal productions.

Page 84: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

84

Production graph:

Sentence → List End→ List, Name End→ List, Name, Name End → Name, Name, Name End → tom, Name, Name End → tom, dick, Name End → tom, dick and Name → tom, dick and harry

Sentence

List End

,List Name

,List Name

Name

End

and Name

and harry,tom dick

Page 85: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

85

Graphs

• The production process can be made more visual by drawing connective lines between corresponding symbols, using a graph.

• A graph is a set of nodes connected by a set of edges.

• If the edges are arrows, the graph is a directed graph; if they are lines, the graph is undirected.– Almost all graphs used in parsing techniques are

directed.

Page 86: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

86

Production graph

A graph corresponding to a production process is called a production graph or syntactic graph and depicts the syntactic structure (with regard to the given grammar) of the final sentence.

Page 87: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

87

Production graph

Sentence

List End

,List Name

,List Name

Name

End

and Name

and harry,tom dick

Starlike construction results from rewriting a group of symbols.

The production graph normally fans out and downwards

Page 88: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

88

Graph, not tree

Sentence

List End

,List Name

,List Name

Name

End

and Name

and harry,tom dick

Trees don’t fan out and then come back together. So this is a production graph, not a production tree.

Page 89: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

89

Production graphs are acyclic

• A cycle in a graph is a path from node N following the arrows, leading back to N.

• A production graph cannot contain cycles. Here’s why: To get a cycle we would need a non-terminal node N in the production graph that has produced children that are directly or indirectly N again. But since the production process always makes new copies for the nodes it produces, it cannot produce an already existing node.

• So production graphs are always acyclic. Directed acyclic graphs are called dags.

Page 90: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

90

Exercise

• Draw the production graph for this grammar: A → B

B → C C → A

• Assume A is the start symbol.

Page 91: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

91

Ha! It’s a trick question

• You don’t draw a production graph for a grammar. You draw it for the process taken to generate a sentence.

• The grammar rules on the previous slide has no terminals. It loops. So it cannot generate a sentence.

Page 92: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

92

Only legal sentences

• It is impossible to generate: tom, dick, harry• If a sentence has more than one name, this

rule must be used: Sentence → List End

• The only way to remove End is with this rule: , Name End → and Name

1. Name → tom | dick | harry2. Sentence → Name | List End3. List → Name | List, Name4. , Name End → and Name5. the start symbol is Sentence

note the comma

Page 93: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

93

Implementing “must replace”

• Recall our recipe: “, Name” at the end of a Sentence must be replaced by “and Name” before Name is replaced by any of its replacements.

• Our formalism uses arrow → which means may replace.• Amazingly, we have succeeded in implementing the

notion must replace in a system that only uses may replace.

• We accomplished this by splitting must replace into may replace (List Name may replace Sentence) and must not be a non-terminal (the items in a sentence must not be a non-terminal).

Page 94: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

94

Grammar produces many sentences

The grammar produces many sentences:tom, dick and harryharry and tomharrytom, tom, tom and toman infinity of others

1. Name → tom | dick | harry2. Sentence → Name | List End3. List → Name | List, Name4. , Name End → and Name5. the start symbol is Sentence

Page 95: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

95

Blind alley

Sentence → List End→ List, Name End→ List, Name, Name End → Name, Name, Name End → tom, Name, Name End → tom, dick, Name End → tom, dick, harry End

1. Name → tom | dick | harry2. Sentence → Name | List End3. List → Name | List, Name4. , Name End → and Name5. the start symbol is Sentence

There is no rule for just the End non-terminal, so we can proceed no further with this sentential form. With the path we have taken, we have arrived at a blind alley.

Page 96: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

96

Frugal framework

• The main properties of a formal grammar are:– it has production rules, which may be used for

rewriting part of the sentential form – it has a start symbol, which is the mother of all

sentential forms• In the production rules we find non-terminals

and terminals; finished sentences contain terminals only.

• That’s it! It’s a frugal framework.

Page 97: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

97

The expressive power of thisfrugal framework

• Formal grammars is a framework of impressive frugality for generating sets.

• Question: Is it sufficient for generating sets?• Answer: We do not have anything more

expressive. All other methods known to mankind for generating sets have been proved to be equivalent-to, or less powerful than a phrase structure grammar.

Page 98: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

98

Computer programs generate sets

• A program reads some data and outputs a result. That result is called a sentence in formal language theory.

• Given another input, the program generates another sentence.

• And so on.• So a program generates a set of sentences (a

language).

Page 99: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

99

Are programs more expressive than phrase structure grammars?

It has been proved that any set (language) that can be generated by a program can be generated by a phrase structure grammar.

Page 100: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

100

A stronger method might exist

• There is no proof that a stronger method cannot exist.

• But in view of the fact that many quite different methods all turn out to halt (in expressivity) at the same barrier, it is highly unlikely that a stronger method will ever be found.

Page 101: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

101

Illustration of the expressive power of grammars

Below is a grammar for the movements of a Manhattan turtle. A Manhattan turtle moves in a plane and can only move north, east, south or west in distances of one block. The grammar below produces all paths that return to their own starting point.

→→ → → → → → → → →→→→

north Move south | east Move west | εeast northsouth northwest northnorth eastsouth eastwest eastnorth southeast southeast southnorth westeast westsouth west

Movenorth east

north southnorth westeast northeast southeast west

south northsouth eastsouth westwest north

west eastwest south

Page 102: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

102

Move

north southMove

south west

east westMove south

eastnorth

Production graph for the round trip: north east south west

The empty alternative in rule 1 (the ε) results in this Move dying out.

Page 103: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

103

The grammar for a set

• There can be infinitely many grammars for a set.

• By the grammar for a set we mean any grammar that does the job (generates the desired set) and is not obviously overly complicated.

Page 104: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

104

Easy grammars and hard grammars

• Some grammars are easy to understand.• Some simple grammars generate very complicated

sets.• The grammar for any given set is, however, usually far

from simple. • Theory says that if a set can be generated at all (for

example, by a program), it can be generated by a phrase structure grammar.

• But theory does not say that it will be easy to do so, or that the grammar will be understandable.

Page 105: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

105

Unmanageability of phrase structure grammars

Apart from the intellectual problems phrase structure grammars pose, they also exhibit fundamental and practical problems:

– No general parsing algorithm for them can exist.– All known special parsing algorithms are either

very inefficient or very complex.

Page 106: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

106

Chomsky hierarchy

• The desire to restrict the unmanageability of phrase structure grammars, while keeping as much of their generative powers as possible, has led to the Chomsky hierarchy of grammars.

• The hierarchy has four types of grammars, numbered 0 to 3.

• It is useful to include a fifth type, called Type 4.

Page 107: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

107

Type 0 (phrase-structure grammars)

Type 1 (context-sensitive grammars)

Type 2 (context-free grammars)

Type 3 (regular grammars)

Type 4 (finite-choice grammars)

Chomsky hierarchy

Page 108: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

108

Increasingly restricted grammars• Type 0 grammars are the (unrestricted) phrase structure

grammars.• The other types originate from applying more and more

restrictions to the allowed form of the rules of the grammar.

• Each of these restrictions has far-reaching consequences; the resulting grammars are gradually easier to understand and manipulate, but are also gradually less powerful.

• Fortunately, these less powerful types are still very useful, actually more useful even than Type 0.

Page 109: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

109

Type 0: arbitrary number of symbols on LHS and RHS

The characteristic property of a Type 0 grammar is that it may contain rules that transform an arbitrary (non-zero) number of symbols into an arbitrary (possibly zero) number of symbols. Example:

, N E → and Nin which 3 symbols are replaced by 2.

Page 110: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

110

Type 1 grammars

• Type 1 grammars restrict the freedom of the Type 0 grammars.

• There are two completely different definitions of Type 1 grammars, which can be easily proved to be equivalent:– Type 1 monotonic– Type 1 context-sensitive

Page 111: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

111

Type 1 monotonic grammars

• A grammar is Type 1 monotonic if every rule has the same or more symbols on the right-hand side (the tree expands, doesn’t contract).

• This forbids, for example, the rule,, N E → and N

LHS has3 symbols

RHS has2 symbols

, N E

and N

Tree is contracting

Page 112: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

Type 1 context-sensitive grammars

• A grammar is Type 1 context-sensitive if all of its rules are context-sensitive.

• A rule is context-sensitive if only one non-terminal symbol in its left-hand side gets replaced by other symbols, while we find the others back, undamaged and in the same order, in the right-hand side. Example:

Name Comma Name End → Name and Name End

“The rule Comma → and may be applied if the left context is Name and the right context is Name End.” The replacement must be at least one symbol long; thus context-sensitive grammars are always monotonic.

112

Page 113: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

113

Example of a CS grammar rule

• Question: Is this a context-sensitive grammar rule: →

• Answer: Yes, because is replaced by . • “The replacement must be at least one symbol

long”. Thus, the replacement may be more than one symbol. In this case the replacement is two symbols.Note: It is not the case that is replaced by . Why? Because only a non-terminal may be replaced.

Page 114: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

114

Key to writing monotonic grammars

• In writing monotonic grammars one has to be careful to never produce more symbols than will eventually be produced.

• This rule produces a symbol than must eventually be deleted:

Sentence → List EndThe End symbol does not produce anything and has to be deleted:

, Name End → and NameBut that is not monotonic.

Page 115: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

115

End symbol is deleted

Sentence

List End

, Name End

and Name

Page 116: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

116

Monotonic grammar for thet,d&h language

We avoid the need to delete the End marker by incorporating End into the rightmost name:

→→ → →

tom | dick | harryName | ListEndName | Name , Listand Name

NameSentence

List, EndName

Page 117: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

117

Context-sensitive grammar for thet,d&h language

→→ → →→

tom | dick | harryName | ListEndName | Name Comma Listand EndNameand Name

NameSentence

ListComma EndName

and EndNamecontext is . . . EndNamecontext is and . . .

→→ → →→

tom | dick | harryName | ListEndName | Name , Listand EndNameand Name

NameSentence

List, EndName

and EndName

We had to introduce a new non-terminal, Comma. Here’s why:Notice that this isn’t correct:

, is a terminal symbol. But in CS grammars only non-terminal symbols are replaced:“one non-terminal symbol in its left-hand side gets replaced by other symbols”

Page 118: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

118

MT = CS and less powerful than PS

• Monotonic and context-sensitive grammars are equally powerful: for each language that can be generated by a monotonic grammar a context-sensitive grammar exists that generates the same language, and vice versa.

• They are less powerful than the Type 0 grammars. There are languages that can be generated by Type 0 grammars that cannot be generated by any Type 1 (or Type 2, 3, 4) grammar.

Page 119: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

119

No simple Type 0 grammars

Strangely enough, no simple examples of Type 0 languages are known, only their existence can be proved.

Page 120: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

120

Type 0 (phrase-structure grammars)

Type 1 (context-sensitive grammars)

Type 2 (context-free grammars)

Type 3 (regular grammars)

Type 4 (finite-choice grammars)

It can be proven that there are languages in here, but we cannot create grammars that generate them.

Roger, not surethis is true

Page 121: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

121

Contradiction?

• Two slides back it says:Strangely enough, no simple examples of Type 0 languages are known, only their existence can be proved.

• But hold on! Didn’t we already see a couple of Type 0 grammars? Here’s one:

1. Name → tom | dick | harry2. Sentence → Name | List End3. List → Name | List, Name4. , Name End → and Name5. the start symbol is Sentence

This is a PS grammar!

Page 122: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

122

Key concept

The above grammar is in the form of a phrase structure grammar but the language (set) it generates can be generated by less powerful grammars.

1. Name → tom | dick | harry2. Sentence → Name | List End3. List → Name | List, Name4. , Name End → and Name5. the start symbol is Sentence

Page 123: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

123

Type of a grammar is its smallest class

• Any Type 1 grammar is also a Type 0 grammar since the class of Type 1 grammars is obtained from the class of Type 0 grammars by applying restrictions.

• But it would be confusing to call a Type 1 grammar a Type 0 grammar; it would be like calling a cat a mammal: correct but imprecise.

• A grammar is named after the smallest class (that is, the highest class number) in which it will still fit.

Page 124: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

124

Using a Type 0 grammar for the t,d&h language was overkill

• We saw a Type 0 grammar that generates the t,d&h language:

• We saw two different Type 1 grammars that generates the t,d&h language; here’s one of them:

→→ → →→

tom | dick | harryName | ListEndName | Name Comma Listand EndNameand Name

NameSentence

ListComma EndName

and EndName

→→ → →

tom | dick | harryName | List EndName | List, Nameand Name

NameSentence

List, End

Page 125: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

125

Type n language for a Type n grammar

• A Type n language can be generated by a Type n grammar or anything stronger, but not by a weaker Type n+1 grammar.

• If a language is generated by a Type n grammar, that does not necessarily mean that there is no (weaker) Type n+1 grammar for it.– The t,d&h language can be generated by a Type 0

grammar, but it can also be generated by Type 1, 2, and 3 grammars.

Page 126: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

126

Constructing a Type 1 grammar

• The standard example of a Type 1 language is the set of strings that consist of an equal number of as, bs, and cs, in that order.

a a . . . . a b b . . . . b c c . . . . c

n of them n of them n of them

• We shall derive a grammar for this toy language.

Page 127: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

127

Constructing a grammar for anbncn

• Starting with the simplest case, we have the rule:0. S → abc

• Having obtained one instance of S, we may want to prepend more as to the beginning. If we want to remember how many there were, we shall have to append something to the end as well, and it cannot be a b or c. We shall use a yet unknown symbol Q. The following rule both prepends and appends:

1. S → aSQ

Page 128: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

128

Continued• Now, to get aabbcc from this, each Q must be worth one b

and one c, but we cannot just write:Q → bc

because that would allow bs after the first c. • The above rule would, however, be all right if we were

allowed to do replacement only between a b on the left and a c on the right. There the newly inserted bc will do no harm:

2. bQc → bbcc• Still, we cannot apply this rule since normally the Qs are to

the right of the c. This can be remedied by allowing Q to hop left over c:

3. cQ → Qc

Page 129: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

129

Grammar for anbncn

S (start)aSQ (rule 1)aaSQQ (rule 1)aaabcQQ (rule 1)aaabQcQ (rule 3)aaabbccQ (rule 2)aaabbcQc (rule 3)aaabbQcc (rule 3)aaabbbcccc (rule 2)

Derivation of a3b3c3

1. S → abc | aSQ2. bQc → bbcc3. cQ → Qc

Page 130: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

130

Derivation graph for a2b2c2

S

Sa Q

ba c Q

b Q c

bb c caa

1. S → abc | aSQ2. bQc → bbcc3. cQ → Qc

Page 131: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

131

Starlike forms

S

Sa Q

ba c Q

b Q c

bb c caa

Starlike forms

Page 132: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

132

Monotonic or CS?

• Is the following grammar monotonic or context-sensitive?

1. S → abc | aSQ2. bQc → bbcc3. cQ → Qc

• Answer: it is monotonic. The last rule is not context-sensitive since it does not conform to: only one non-terminal symbol in its left-hand side gets replaced by other symbols, while we find the others back, undamaged and in the same order, in the right-hand side.

Page 133: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

133

The anbncn language is Type 1

• It can be proved (using the pumping lemma for context-free grammars) that there is no Type 2 grammar for the anbncn language.

• We have created a Type 1 grammar for it.• Therefore it is of Type 1.

Page 134: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

134

CS grammars, MT grammars

• Although only context-sensitive Type 1 grammars can by rights be called context-sensitive grammars (CS grammars), that name is used even if the grammar is actually monotonic Type 1.

• There are no standard initials for monotonic, but MT will do.

Page 135: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

135

CF grammars

• Type 2 grammars are called context-free grammars (CF grammars).

• A CF grammar may contain only rules that have a single non-terminal on their left-hand side.

Page 136: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

136

CS versus CF grammars

Whereas context-sensitive grammars have rules in which a non-terminal symbol changes within a context, the rules in context-free grammars are independent of context (the left and right contexts are absent/empty).

– The rules in context-free grammars have a single non-terminal on their left-hand side.

Page 137: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

137

Independent production property

• Since there is always only one symbol on the left-hand side, each node in a production graph has the property that whatever it produces is independent of what its neighbors produce: the productive life of a non-terminal is independent of its context.

• Starlike forms cannot occur. Each node fans out, no nodes come together.

• Consequently the production graph has a pure tree-form and is called a production tree.

Page 138: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

138

Feature of XML that breaks production independence

In XML, if an element has an IDREF attribute, then it is dependent on what its neighbors produce.

<Book footnote_ref=“RB”> <Title>Illusions</Title></Book><Footnote id=“RB”> <Author>Richard Bach</Author></Footnote>

The Book element must reside in a context where there is an element with an ID attribute whose value matches the value of @footnote_ref.

Consider an XML Schema that declares the Book element to have an IDREF footnote_ref attribute. That XML Schema is a context-free grammar. But the sentences that are generated (i.e., the XML instances) have additional semantics that imposes a context-sensitivity on the Book element.

Page 139: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

139

Sentences in formal languages have no semantics

• The sentences that you generate from a formal grammar have no semantics.

• The symbols in a sentence have no semantics.• But in an XML instance document that conforms to an

XML Schema the symbols do have semantics.– Example: an element with an attribute of type IDREF must

reference an ID value. That element/attribute must reside in a context in which there is a matching ID value. So while the XML Schema is simply a context-free grammar, the semantics of the data types imposes an additional layer of constraint on the XML instance.

This is a key concept

Page 140: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

140

CF grammar for the t,d&h language

→→ →

tom | dick | harryName | List and NameName , List | Name

NameSentence

List

1.2.3.

Page 141: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

141

A production tree for the CF t,d&h grammar

→→ →

tom | dick | harryName | List and NameName , List | Name

NameSentence

List

1.2.3.

Sentence

NameList and

,Name List

Name

harryand,tom dick

Notice that it is a tree: all nodes fan out, there are no starlike forms.

Page 142: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

142

A rule “defines” a non-terminal

• All right-hand sides for a non-terminal are collected in one grammar rule.

• Each grammar rule reads like a definition of the left-hand side:– A Sentence is either a Name or a List followed by and

followed by Name.– A List is either a Name followed by a , followed by a List or

it is a Name.

→→ →

tom | dick | harryName | List and NameName , List | Name

NameSentence

List

1.2.3.

Page 143: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

143

Sentences are produced using two processes

Context-free grammars produce sentences by two processes:

– concatenation (“… followed by …”)– choice (“either … or …”)

Page 144: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

144

Identification mechanism

In addition to the concatenation and choice processes there is an identification mechanism which links the name of a non-terminal used in a right-hand side to its defining rule (“… is a …”).

– Example: “Name is a List” links Name to the rule that defines List.

Page 145: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

145

Each non-terminal generates a language

• Earlier we identified a “language” as a set of strings: the set of terminal productions of the start symbol.

• The independent production property allows us to extend this definition to any non-terminal in the grammar: each non-terminal produces a set—a language—independent of the other non-terminals.

• If we write the set of strings produced by as and has a production rule with, say, two alternatives, , then . “ ” ∪ is the union operator on sets.

• If then consists of, say, three members , we have “” is the concatenation operator on the strings in the sets.

Page 146: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

146

Nullable/Empty

• A non-terminal whose language contains ε is called nullable.

• One also says that it produces empty.

Page 147: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

147

In Type 0 or Type 1 grammars only the start symbol defines a language

• Recall the Type 1 grammar for anbncn:1. S → abc | aSQ2. bQc → bbcc3. cQ → Qc

• We cannot define a language L(Q) since Q does not produce anything meaningful by itself.

Page 148: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

148

Only for Type 2 and lower

• Defining a language for a non-start symbol is possible only for Type 2 grammars and lower.

• Defining a non-start non-terminal as nullable is only possible for Type 2 grammars and lower.

Page 149: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

149

Recursive non-terminals

A non-terminal A is recursive if an A in a sentential form can produce something that again contains an A. Example:

A → aAthe A is directly recursive. Here is an example of indirect recursion: A → aB B → bAA produces aB and B produces bA, which takes us back to the production for A.

Page 150: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

150

Right recursion

A non-terminal A is right-recursive if it can produce something that has an A at the right end of the rewrite rule. Example: A → abcA

Page 151: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

151

Left recursive

A non-terminal A is left-recursive if it can produce something that has an A at the left end of the rewrite rule. Example: A → Aabc

Page 152: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

152

Self-embedding

A non-terminal A is self-embedding if there is a derivation in which A produces A with something, say α, before it and something, say β, after it. Example: A → αAβ

Page 153: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

153

Nesting

• Self-embedding describes nesting: α is the part produced when entering another level of nesting, β is the part produced when leaving that level.

• The best-known example of nesting is the use of parentheses in arithmetic expressions:

→ →

… | Simple_expressionNumber | ‘(‘ Arith_expression ‘)’

Arith_expressionSimple_expression

Page 154: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

154

Both left- and right-recursive

A non-terminal can be left-recursive and right-recursive at the same time; it is then self-embedding. Example: A → Ab | cA | d

→→ → →

AbcAbccAbccdb

A

Page 155: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

155

Recursion is essential for infinite languages

• If no non-terminal in a grammar is recursive, each production step uses up one non-terminal since that non-terminal will never occur again.

• So the production process cannot continue unlimitedly, and a finite language results.

• Recursion is essential for life in grammars.

Page 156: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

156

Can create infinite languages using the repetition operator

• Thus far we have not used repetition operators in grammars.

• Later we will extend the grammar syntax to provide repetition operators.

• With the extended syntax we can create infinite languages without using recursion.

Page 157: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

157

Advantage/disadvantage of CF grammars

• In the actual world, many things are defined in terms of other things.

• The advantage of CF grammars is that they are a very concise way to formulate such interrelationships.

• The disadvantage of CF grammars is that they can generate a lot of good-looking nonsense.

Page 158: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

158

XML

• XML is a text-markup system.• Markup is used to express and control the

basic structure.• An XML instance document is a parse tree!

Page 159: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

159

CF grammar for English

If we ignore enough detail we can recognize an underlying context-free structure in the sentences of a natural language, for example, English:

→→ → →→→→→

Subject Verb ObjectNounPhraseNounPhrasethe QualifiedNounNoun | Adjective QualifiedNouncastle | caterpillar | catswell-read | white | wistful | …admires | bark | criticize | …

SentenceSubjectObject

NounPhraseQualifiedNoun

NounAdjective

Verb

Page 160: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

160

… which produces sentences like:

→→ → →→→→→

Subject Verb ObjectNounPhraseNounPhrasethe QualifiedNounNoun | Adjective QualifiedNouncastle | caterpillar | catswell-read | white | wistful | …admires | bark | criticize | …

SentenceSubjectObject

NounPhraseQualifiedNoun

NounAdjective

Verb

the well-read cats criticize the wistful caterpillar

Page 161: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

161

Since no context is incorporated, it will equally well produce this good-looking nonsense:

→→ → →→→→→

Subject Verb ObjectNounPhraseNounPhrasethe QualifiedNounNoun | Adjective QualifiedNouncastle | caterpillar | catswell-read | white | wistful | …admires | bark | criticize | …

SentenceSubjectObject

NounPhraseQualifiedNoun

NounAdjective

Verb

the cats admires the white well-read castle

Page 162: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

162

For keeping context we could use a phrase structure grammar:

The markers Singular and Plural control the production of English words. Still, this grammar allows the cats to bark …

For a better way to handle context, see various sections in Chapter 15, especially Van Wijngaarden grammars (Section 15.2) and attribute and affix grammars (Section 15.3).

→→

→ → →→→→

Noun Number VerbSingular | Pluralcastle Singular | caterpillar Singular | …Singular admires | …εcats PluralPlural bark | Plural criticize | …ε

SentenceNumber

Noun SingularSingular Verb

SingularNoun PluralPlural Verb

Plural

Page 163: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

163

Programming languages are defined using CF grammars

• The bulk of examples of CF grammars originate from programming languages.

• Sentences in these languages (that is, programs) have to be processed automatically (by a compiler) and it was recognized early (around 1958) that this is much easier if the language has a well-defined formal grammar.

• The syntaxes of all programming languages in use today are defined through formal grammars.

Page 164: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

164

XML

• XML Schemas specify XML languages using CF grammars.

• Sentences in XML languages (that is, XML instances) have to be validated automatically (by a validator) and it was recognized early (around 1999) that this is much easier if the XML language has a well-defined formal grammar.

• The syntaxes of nearly all XML languages in use today are defined through XML Schemas.

Page 165: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

165

ε-rules, ε-free

• A grammar rule that has an empty right-hand side: A → εis called an ε-rule. Read that rule as: A may be replaced by the empty string (which we denote by ε). – An empty string is a string of length zero, it contains no

characters from the alphabet, Σ.• A grammar that contains no such rules is called ε-

free.

Page 166: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

166

Non-monotonic CF grammar

The only way a CF rule can be non-monotonic is by having an ε-rule. A grammar containing this rule would not be monotonic:

A → ε

Page 167: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

167

Require monotonicity

• Some authors (for example, Chomsky) and some parsing algorithms require a CF grammar to be monotonic.

• This means that it must be ε-free.

Page 168: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

168

Making a grammar ε-free

• Almost any CF grammar can be made ε-free by systematic substitution of the ε-rules.

• The exception is a grammar in which the start symbol produces ε.

• The transformation of a CF grammar that contains ε-rules into an ε-free grammar is explained in Section 4.2.3.1

Page 169: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

169

Advantage of ε-free CF grammars

The proofs and parsers are less complicated, sometimes much less complicated.

Page 170: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

170

Disadvantage of ε-free CF grammars

• The disadvantage with transforming a CF grammar to remove ε-rules is that the resulting grammar will almost always be more complicated.

• Example: Suppose we have a system that can be fed bits of information like: “Amsterdam is the capital of the Netherlands”, “Truffles are expensive”, and can then be asked a question.

continued

Page 171: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

171

Disadvantage of ε-free CF grammars

• On a superficial level we can define its input as:

input → zero-or-more-bits-of-info questionzero-or-more-bits-of-info → bit-of-info zero-or-more-bits-of-info | ε

• This definition of input neatly fits the user’s view of the problem.• Here is an ε-free grammar for the input:

input → question-preceded-by-infoquestion-preceded-by-info → question | bit-of-info question-preceded-by-info

• This second definition does not fit the user’s view of the problem.• As a grammar becomes more and more complicated, the requirement that

it be ε-free becomes more and more of a nuisance: the grammar is working against us, not for us.

Page 172: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

172

No problem theoretically

• Requiring grammars be ε-free presents no problem from a theoretical point of view: any CF language can be described by an ε-free CF grammar and ε-rules are never needed.

• Better still, any grammar with ε-rules can be mechanically transformed into an ε-free grammar for the same language.

• But the price we pay is that of any grammar transformation: it is no longer our grammar and it does not reflect the original structure as well.

Page 173: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

173

The ε-rule is a useful tool

The bottom line is that the practitioner finds the ε-rule to be a useful tool.

Page 174: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

174

ε-rules make parsing problematic

• Many parsing methods will in principle work for ε-free grammars only: if something does not produce anything, you can’t very well see if it’s there.

• Often the parsing method can be doctored to handle ε-rules, but that invariably increases the complexity of the method.

• If ε-rules did not exist, then the topic of parsing would be 30% smaller – but then grammars would lose much more than 30% of their usefulness.

Page 175: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

175

Advantage of using ε-rules

The advantage is that ε-rules are very convenient for the grammar writer and user.

Page 176: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

176

Notational style: BNF

• There are several different styles of notation for CF grammars of programming languages.

• They are all functionally equivalent.• The first is the Backus-Naur Form (BNF) which

was first used to define ALGOL 60. Here is a sample:

<name> ::= tom | dick | harry<sentence> ::= <name> | <list> and <name><list> ::= <name>, <list> | <name>

• Angle brackets are used to enclose non-terminals and ::= is used for “may produce”

Page 177: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

177

Notational style: van Wijngaarden

• The second notational style is that of van Wijngaarden. Here is a sample:

name: tom symbol; dick symbol; harry symbol.sentence: name, list, and symbol, name.list: name, comma symbol, list; name.

• The names of symbols end in …symbol; their representations are hardware-dependent and are not defined in the grammar.

• Rules are terminated with a period.

Page 178: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

178

van Wijngaarden grammars

• Punctuation is used in the traditional way. For example, the comma binds tighter than the semicolon.

• The punctuation can be read as follows:

: is defined as a(n); , or as a(n)

, followed by a(n). , and as nothing else.

• So this rule

sentence: name, list, and symbol, name.

would be read as: A sentence is defined as a name followed by a list followed by an and-symbol followed by a name, and as nothing else.

name: tom symbol; dick symbol; harry symbol.sentence: name, list, and symbol, name.list: name, comma symbol, list; name.

Page 179: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

179

van Wijngaarden grammars

• The van Wijngaarden notation achieves its full power only when applied to the two-level van Wijngaarden grammars.

• But it also has merits on its own: it is formal and still quite readable.

Page 180: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

180

Extended CF grammars

• CF grammars are made more compact and more readable by introducing special shorthands for frequently used constructions.

• Rules like:List → Item | Item List

are written in an extended CF grammar as:List → Item+

Item+ means “one or more Items” • We do not need to give a rule for Item+, the rule:

Item+ → Item | Item Item+

is implicit.• This notation for grammars is called Extended BNF (EBNF).

Page 181: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

181

Extended CF grammars (cont.)

• Likewise, rules like:List → ε | Item List

are written in an extended CF grammar as:List → Item*

Item* means “zero or more Items”• The rule:

Item* → ε | Item Item*is implicit.

Page 182: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

182

Extended CF grammars (cont.)

• Finally, rules like:Item → ε | Item

are written in an extended CF grammar as: Item → Item?

Item? means “zero or one Item” (optional Item)

• The rule: Item? → ε | Itemis implicit.

Page 183: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

183

Repetition operators

+, *, and ? are called repetition operators

Page 184: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

184

Extending an operator’s range

• In the preceding examples the operators +, *, and ? work on the preceding symbol.

• Their range can be extended by using parentheses: (Item ;)? means “optionally an Item followed by a ; ”

Page 185: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

185

Advantage of extended grammar

The advantage of the repetition operators and parentheses is that grammars can be written more efficiently, more compactly, and more readable.

Page 186: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

186

Illustrate the gain in efficiency, compactness, and readability

→→

→ → →→→→

Preface ChapterSequence Conclusion“PREFACE” ParagraphSequenceChapter | Chapter ChapterSequence“CHAPTER” Number ParagraphSequenceParagraph | Paragraph ParagraphSequenceSentenceSequence…“CONCLUSION” ParagraphSequence

BookPreface

ChapterSequenceChapter

ParagraphSequenceParagraph

SentenceSequenceConclusion

→→ → →→→

Preface Chapter+ Conclusion“PREFACE” Paragraph+

“CHAPTER” Number Paragraph+

Sentence+

…“CONCLUSION” Paragraph+

BookPrefaceChapter

ParagraphSentence

Conclusion

Use extended operators

Page 187: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

187

Overdoing a good thing

Some styles even allow constructions like:– Item+4 meaning “One or more Item, with a

maximum of 4 ”– Item+ , meaning “One or more Items separated by

commas ”

Page 188: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

188

No increase in expressive power

• The extensions of an EBNF grammar do not increase its expressive powers.

• All implicit rules can be made explicit and then a normal CF grammar results.

• Their strength lies in their user-friendliness.

Page 189: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

189

Kleene star

• The star in the notation X* is called the Kleene star

• As we’ve seen, in a grammar X* should be read as “zero or more Xs ”

• If X is a set, X* should be read as “the set of zero or more elements of X concatenated ”

• We denote an alphabet by Σ and the set of all strings over the alphabet by Σ*

Page 190: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

190

Regular expressions

• Forms involving the repetition operators *, +, or ? and possibly the separators ( and ) are called regular expressions.

• EBNFs, which have regular expressions for their right-hand sides, are sometimes called regular right part grammars (RRP grammars), which is more descriptive than “extended context free” but is more of a tongue twister.

Page 191: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

191

Structural meaning of a regular right-hand side

• There are two schools of thought about the structural meaning of a regular right-hand side.

• One school maintains that a rule like:Book → Preface Chapter+ Conclusion

is an abbreviation of: Book → Preface α Conclusion α → Chapter | Chapter α

• This is, a right recursive interpretation.

Page 192: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

192

Right recursive interpretation

• The advantage of a right recursive interpretation is that it is easy to explain and the transformation to “normal” CF is simple.

• The disadvantages are:– The transformation entails anonymous rules

(identified by α).– The lopsided production tree does not correspond

to our idea of the structure of the Book (see figure on next slide).

Page 193: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

193

Production tree for a right recursive interpretation

Book

Preface α Con-clusion

Chapter α

Chapter α

Chapter α

Chapter

→→

Preface α ConclusionChapter | Chapter α

Bookα

Page 194: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

194

Iterative interpretation

• The second school of thought claims that: Book → Preface Chapter+ Conclusion

is an abbreviation of: Book → Preface Chapter Conclusion | Preface Chapter Chapter Conclusion | Preface Chapter Chapter Chapter Conclusion | …

• It has the advantage that it yields a beautiful production tree (see figure on next slide), but the disadvantages are that it involves an infinite number of production rules and that the nodes in the production tree have varying fan-out.

Page 195: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

195

Production tree for the iterative interpretation

Book

Preface Chapter Con-clusionChapter Chapter Chapter

Page 196: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

196

Which interpretation is commonly used?

Since the iterative implementation is complex, most practical parser generators use the recursive interpretation in some form or another, whereas most research has been done on the iterative interpretation.

Page 197: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

197

XML/XSD uses the iterative interpretation

<xs:element name="Book"> <xs:complexType> <xs:sequence> <xs:element name="Preface">...</xs:element> <xs:element name="Chapter" maxOccurs="unbounded">...</xs:element> <xs:element name="Conclusion">...</xs:element> </xs:sequence> </xs:complexType> </xs:element>

Book → Preface Chapter+ Conclusion

Is expressed in XSD like so:

<Book> <Preface>...</Preface> <Chapter>...</Chapter> <Chapter>...</Chapter> <Chapter>...</Chapter> <Conclusion>...</Conclusion></Book>

Which is interpreted using the iterative interpretation:

Wrong! The XML instance is a sentence of the grammar. The issue is: what is the parse tree (DOM tree) for this instance?

Page 198: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

198

Basic property of CF grammars

• The basic property of CF grammars is that they describe things that nest: an object may contain other objects in various places, which in turn may contain … etc.

• When during the production process we have finished producing one of the objects, the right-hand side still “remembers” what has to come after it.

Page 199: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

199

While working on expanding a non-terminal, the following symbols remain queued-up

After having descended into the depth of the non-terminal Subject to produce something like the wistful cat, the right-hand side Subject Verb Object still remembers that a Verb must follow. While we are working on the Subject, the Verb and Object symbols remain queued at the right in the sentential form. For example:

the wistful QualifiedNoun Verb Object

→→ → →→→→→

Subject Verb ObjectNounPhraseNounPhrasethe QualifiedNounNoun | Adjective QualifiedNouncastle | caterpillar | catswell-read | white | wistful | …admires | bark | criticize | …

SentenceSubjectObject

NounPhraseQualifiedNoun

NounAdjective

Verb

Page 200: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

200

Here’s what makes CF languages so useful

• It is the parsability that make CF languages so useful, not the fact that they stem from the Chomsky hierarchy.

• Parsing is the task of converting a string to the production tree.

Page 201: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

201

Type 3 grammars

• The restriction to Type 3 disallows the recollection of things that came before.

• A right-hand side may only contain one non-terminal and it must come at the end.

• This means that there are only two kinds of rules:– A non-terminal produces zero or more terminals.– A non-terminal produces zero or more terminals followed by one

non-terminal.• Example: the language a*b+ is generated by this Type 3

grammar:S → a*BB → b+

Page 202: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

202

Type 2 versus Type 3

• Type 2 allows queuing-up whereas Type 3 does not.

• Type 2 allows recollection of things that came before whereas Type 3 does not.

This is a key concept

Page 203: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

203

Chomsky definition of Type 3

• Our definition is equivalent and more convenient.• It is not completely trivial to convert a Type 3

grammar under our definition to a grammar under the Chomsky definition.

A non-terminal produces zero or more terminals.A non-terminal produces zero or more terminals followed by one non-terminal.

Our definition of Type 3:

A non-terminal produces one terminal.A non-terminal produces one terminal followed by one non-terminal.

Chomsky definition of Type 3:

Page 204: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

204

Type 2.5 grammar (linear grammar)

• A Type 2.5 grammar allows a single non-terminal on the right-hand side and it doesn’t have to be at the end.

• This kind of grammar is called a linear grammar.

Page 205: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

205

Note the equivalence between a Type 3 grammar and a finite-state automaton

S → a*BB → b+

Type 3 grammar:

S

b

B

a

b

Finite-state automaton

Page 206: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

206

Why grammars, not automata?• There is a close relationship between formal grammars and

other abstract notions used in computer science, such as automata and algorithms.

• Indeed, since the results in one theory can often be translated into another, it seems to be an arbitrary decision as to which interpretation is primary.

• In these slides formal grammars are given preferential treatment because they are probably the most commonly known of the various theories among computer scientists.

• This is due to the success of the context-free grammars in describing the syntax of programming languages.

Page 207: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

207

Type 3 = Regular grammars

• Type 3 grammars are also called regular grammars (RE grammars) or finite-state grammars (FS grammars)

• More precisely the version defined below is called right-regular since the only non-terminal in a rule is found at the right end of the right-hand side.

A non-terminal produces zero or more terminals.A non-terminal produces zero or more terminals followed by one non-terminal.

Our definition of Type 3:

Page 208: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

208

Type 3 grammars can recurse

• Recall the rules on Type 3 grammars:– A non-terminal produces zero or more terminals.– A non-terminal produces zero or more terminals

followed by one non-terminal• This grammar conforms to the rules; therefore

it is Type 3:A → a | aA (it is recursive)

Page 209: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

209

Left-regular grammars• The left-regular grammars are subject to the restriction that the

only non-terminal in a rule must be at the left end of the right-hand side:– A non-terminal produces zero or more terminals.– A non-terminal produces one non-terminal followed by zero or more

terminals.• Example: this Type 3 left-regular grammar generates the language

a*b*S → Ab*A → a*

• Left-regular grammars are less intuitive than right-regular grammars, occur less frequently, and are more difficult to process, but they do occur occasionally.

Page 210: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

210

Regular grammar = right-regular grammar

Given the prevalence of right-regular over left-regular, the term “regular grammar” is usually intended to mean “right-regular grammar.”

Page 211: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

211

Right-recursive vs. right-regular

• A non-terminal A is right-recursive if it can produce a sentential form that has an A at the right end.– Right-recursive means that rule A can be used

again in the production process• A rule is right-regular simply means that its

non-terminal is at the right end, following any terminal symbols. Right-regular has nothing to do with recursion.

Page 212: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

212

Non-nesting

Regular grammars don’t nest

Page 213: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

213

Common usage

• Regular grammars are used very often to describe the structure of text at the character level, in lexers.

• It is customary for the terminal symbols of a regular grammar to be single characters.

Page 214: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

214

Type 3 grammars for the t,d,&h language

→→ →

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & t | & d | & h

SentenceList

ListTail

→→ →

t | d | h | ListListHead &t | ListHead &d | ListHead &hListHead , t | ListHead , d | ListHead , h | t | d | h

SentenceList

ListHead

Right-regular grammar:

Left-regular grammar:

Page 215: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

215

Grammar and equivalent automaton

t,d,h

ListTailt,d,h

Automaton

t,d,h

S ε List

,

→→ →

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & t | & d | & h

SList

ListTail

Right-regular grammar:

&

Page 216: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

216

Challenge of left-regular grammars

• Here is a Type 3 grammar (left-regular):S → Ab*A → a*

• Doing recognition with a left-regular requires a non-deterministic automation. See Section 5.3.

Page 217: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

217

Production chain

• The production tree for a sentence from a Type 3 (right-regular) grammar degenerates into a production chain of non-terminals that drop a sequence of terminals on their left.

• See next slide for an example

Page 218: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

218

Sentence

List

t ListTail

, List

d ListTail

& h

→→ →

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & t | & d | & h

SentenceList

ListTail

Right-regular grammar:

Production chain

Page 219: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

219

The […] notational device

There is a lot of repeating in the above grammar. A notational device has been invented to abate this nuisance. Square brackets are used to indicated “one out of a set of characters”: [tdh] is an abbreviation for t | d | h

→→

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & t | & d | & h

SentenceList

ListTail

→→

[tdh] | List[tdh] ListTail, List | & [tdh]

SentenceList

ListTail

Page 220: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

220

The macro notational device

• A macro is a name for pieces of the grammar.• The macro is referenced by preceding the

name with a $ symbol.• A referenced macro is substituted by the

grammar.

→→ →

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & t | & d | & h

SentenceList

ListTail

→→

→ →

t | d | h $Name | List$Name ListTail, List | & $Name

NameSentence

ListListTail

usemacro

Page 221: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

221

The lex parser

• lex is a popular parser for regular grammars.• It supports both notational devices: the […]

device and the macro device.

Page 222: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

222

Type 3 Chomsky definition of t,d&h

The following grammar does not adhere to the Chomsky definition of Type 3:

A non-terminal produces one terminal.A non-terminal produces one terminal followed by one non-terminal.

Chomsky definition of Type 3:

→→ →

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & t | & d | & h

SentenceList

ListTail

2 terminals – not allowed in the Chomsky definition

Page 223: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

223

Convert to the Chomsky definition

If we adhere to the Chomsky definition of Type 3, our grammar will not be smaller than:

A non-terminal produces one terminal.A non-terminal produces one terminal followed by one non-terminal.

Chomsky definition of Type 3:

→→ →

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & t | & d | & h

SentenceList

ListTail

→→ →

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & Name | & Name | & Namet | d | h

SentenceList

ListTailName

Our Type 3 definition of the t,d&h language

Chomsky Type 3 definition of the t,d&h language

Page 224: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

224

Chomsky version of t,d&h

The Chomsky version is:– Bigger (an additional rule is needed)– Not as easy to read as the version that uses […] or

macros– But it is easier to process

→→ →

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & Name | & Name | & Namet | d | h

SentenceList

ListTailName

Page 225: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

225

Key point

There is variation in how people define Type 0, Type 1, Type 2, and Type 3. Depending on which definition you use, the grammars you create may vary in user friendliness and in ease of processing.

Page 226: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

226

Formal linguist vs. Computer scientist

• Chomsky’s definition of Type 3 has minimal mechanisms. The formal linguist is interested in and helped by minimally sufficient grammars.

• The computer scientist values a form in which the concepts underlying the grammar ($Name, etc.) are easily expressed, at the expense of additional processing.

Page 227: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

227

Two observations about regular grammars

• The sentential forms will only contain one non-terminal and this will always be at the end.

• The size of regular grammars can be reduced considerably by using the repetition operators *, +, and ? along with the grouping operators ( and ).

Page 228: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

228

One non-terminal, at the end

Sentence → List → t ListTail → , List → d ListTail → & h

Notice in these sentential forms that there is one non-terminal and it is at the end.

→→

t | d | h | Listt ListTail | d ListTail | h ListTail, List | & t | & d | & h

SentenceList

ListTail

Page 229: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

229

Smaller and simpler

All regular grammars can be reduced considerably in size by using the repetition operators *, +, and ? for “zero or more”, “one or more” and “optionally one”, respectively.

Sentence → [tdh] | ListList → [tdh] ListTail ListTail → , List | & [tdh]

Using the repetition operators along with ( and ) for grouping, we can simply the grammar to:

Sentence → (( [tdh] , )* [tdh] & )? [tdh]

Page 230: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

230

Regular expressions

Regular expressions exist for all Type 3 grammars.

Sentence → (( [tdh] , )* [tdh] & )? [tdh]

regular expression

Regular grammar (Type 3 grammar) that uses repetition operators along with ( and )

Page 231: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

231

Type 4 grammars

• The last restriction we shall apply to what is allowed in a production rule is a pretty final one: no non-terminal is allowed in the right-hand side.

• This removes all generative power from the mechanism, except for the choosing of alternatives.

• The start symbol has a (finite) list of alternatives from which we are allowed to choose.

• Type 4 grammars are named finite-choice grammars (FC grammars).

Page 232: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

232

Example of a FC grammar

• There is no FC grammar for the t,d&h language.• If, however, we are willing to restrict ourselves to lists

of names of finite length (say, no more than three), then there is a FC grammar, since one could enumerate all combinations. For three names we get:

S → [tdh] | [tdh] & [tdh] | [tdh] , [tdh] & [tdh]

for a total of 3 + 3 x 3 + 3 x 3 x 3 = 39 production rules.

Page 233: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

233

Chomsky: No FC grammar

• FC grammars are not part of the official Chomsky hierarchy in that they are not identified by Chomsky.

• They are nevertheless very useful and are often required as a tail-piece in some process or reasoning.

• For example, the set of reserved words (keywords) in a programming language can be described by a FC grammar.

Page 234: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

234

Parts of grammars are FC

• Although not many grammars are FC in their entirety, some of the rules in many grammars are finite-choice.

• For example, the first t,d&h grammar we looked at has a FC rule:1. Name → tom

Name → dickName → harry

2. Sentence → NameSentence → List End

3. List → NameList → List, Name

4. , Name End → and Name5. the start symbol is Sentence

Finite choice

Page 235: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

235

Summary

The below table summarizes the most complicated data structures that can occur in the production of a sentence, with respect to the grammar type used.

Chomsky type Grammar type Most complicated data structure

0/1 PS/CS production dag

2 CF production tree

3 FS production chain

4 FC production element

Legend:dag: directed acyclic graphPS: phrase -structureCS: context-sensitiveFS: finite-stateFC: finite-choice

Page 236: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

236

XML is a graph

• XML is a graph due to ID-IDREF.• So it must be Type 0/1.

Page 237: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

237

Symbology

Let:VN denote the set of non-terminal symbolsVT the set of terminal symbolsS the start symbolF the production rules

Page 238: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

238

Formal definition of Type 0, 1, 2, 3 grammars

A generative grammar G = (VN, VT, S, F) is said to be of Type i if it satisfies the restrictions described in this list:

i = 0 No restrictions except the LHS must contain at least one non-terminal

i = 1: Every rewriting rule in F has the form Q1AQ2 → Q1PQ2, with Q1, Q2, and P in (VN V∪ T)*, A V∈ N, and P ≠ ε, except possibly for the rule S → ε, which may occur in F, in which case S does not occur on the right-hand sides of the rules.

i = 2: Every rule in F has form A → P, where A V∈ N, and P (V∈ N V∪ T)*.

i = 3: Every rule in F has form with A → PB or A → P, where A, B V∈ N, and P V∈ T*.

Page 239: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

239

Monotonic not necessarily a context-sensitive grammar

• A grammar is Type 1 monotonic if every rule has the same or more symbols on the right-hand side (the tree expands, doesn’t contract).

• A grammar is Type 1 context-sensitive if all of its rules are context-sensitive. A rule is context-sensitive if only one (non-terminal) symbol in its left-hand side gets replaced by other symbols, while we find the others back, undamaged and in the same order, in the right-hand side.

• The below grammar for anbncn is Type 1 monotonic but not Type 1 context-sensitive:

1. S → abc | aSQ2. bQc → bbcc3. cQ → Qc

Every rule has the same or more symbols on the right-hand side, so it is monotonic.

Page 240: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

240

Context-sensitive grammar for anbncn

The below grammar for anbncn is Type 1 context-sensitive:

1. S → aSBC | aBC2. CB → HB3. HB → HC4. HC → BC5. aB → ab6. bB → bb7. bC → bc8. cC → cc

http://www.answers.com/topic/context-sensitive-grammar

S (start)aSBC (rule 1)aaBCBC (rule 1)aabCBC (rule 5)aabHBC (rule 2)aabHCC (rule 3)aabBCC (rule 4)aabbCC (rule 6)aabbcC (rule 7)aabbcc (rule 8)

Derivation of a2b2c2CS grammar for anbncn

Page 241: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

241

Generating sentences from a grammar

• Until now we have only produced single sentences from our grammars, in an ad hoc fashion.

• But the purpose of a grammar is to generate all of its sentences.

• Fortunately there is a systematic way to do so.

Page 242: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

242

Production queue

Queue

aSQ

S → abc | aSQaQC → bbcccQ → Qc

aSQ

substitute S

aaSQQ, aabcQ

We can systematically generate all sentences using a queue.

Page 243: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

243

Systematic way to enumerate all sentences in a PS grammar

• Begin with S as the only sentential form in the queue. Now continue doing the following:– Consider the first sentential form in the queue.– Scan it from left to right, looking for a substring that matches the left-hand

side of a production rule.– For each such production rule found, make a copy of the sentential form,

replace the substring with the production rule’s right-hand side, add the revised sentential form to the end of the queue.

– If the original sentential form does not contain any non-terminals, write it down as a sentence in the language.

– Throw away the original sentential form; it has been fully processed. • If no rule matched and the sentential form was not a finished

sentence, it was a blind alley; they are removed automatically by the above process and leaves no trace.

Page 244: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

244

Recursively enumerable

• The procedure on the previous slide enumerates all strings in a PS language.

• Thus, PS languages are also called recursively enumerable sets, where “recursively” is taken to mean “by a possibly recursive algorithm.”

Page 245: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

245

Non-recursively enumerable sets

There are sets that are not recursively enumerable; the set of all Type 0 grammars that do not produce the empty string is an example. There is no grammar for it, because this set cannot be generated (you cannot tell whether a Type 0 grammar produces the empty string). If phrase-structure is not sufficient, only natural language description will do, as shown here.

Page 246: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

246

Language-generating procedure

The queue procedure is a systematic way of producing all the strings that a grammar is capable of generating. That is, it is a language-generating procedure.

Page 247: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

247

Let’s see the procedure in action for the anbncn grammar:S → abc | aSQbQc → bbcc cQ → Qc

Step Queue (front of queue on left) Result1. S2. abc aSQ abc

aSQ3. aabcQ aaSQQ4. aaSQQ aabQc 5. aabQc aaabcQQ aaaSQQQ6. aaabcQQ aaaSQQQ aabbcc7. aaaSQQQ aabbcc aaabQcQ8. aabbcc aaabQcQ aaaabcQQQ aaaaSQQQQ aabbcc9. aaabQcQ aaaabcQQQ aaaaSQQQQ10. aaaabcQQQ aaaaSQQQQ aaabbccQ11. aaaaSQQQQ aaabbccQ aaaabQcQQ… …

Page 248: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

248

Every sentence will be produced

• The table on the previous slide shows that we do not get a sentence each time we turn the crank (each time we process the item at the front of the queue).

• In fact, real sentences will get scarcer and scarcer. The reason is that during the process more and more side-lines develop which all require equal attention.

• Still, we can be certain that every sentence that can be produced, will be produced.

Page 249: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

249

Breadth-first generator

This way of doing things is called breadth-first production. Computers are better at it than humans.

S

abc aSQ

output

aabcQ aaSQQ

aaabcQQ aaaSQQQaabQc

Page 250: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

250

Replace all left-hand sides

Queue

AC

S → ACA → bAC → ac

AC

substitute A and AC

ac, bC

If we were to only substitute A, then the remainder is C which is a blind alley. Doing both substitutions (replace A by b and AC by ac) also leads to a blind alley, but there will be an output, ac.

Page 251: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

251

Sentential form provides a context

Queue

abcXYdef

The sentential form provides a context. If you ignore a context you run the risk of creating false productions.

abcXYdef

context

…………..

Remember, this discussion is just for phrase-structure grammars.

Page 252: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

252

Grammar that generates the empty set

• What language will this grammar generate?S → ABA → BB → A

• Let’s show a few sentential forms:

S → AB → BB → AB → BB → AB → …

• Every new sentential form contains non-terminals. • It generates no sentences. • Therefore, it produces the empty set. • The language generated is the empty set: L(G) = {}

Page 253: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

253

PS grammar that generates the empty set

The language generated by this grammar is also empty:

S → aABbB → aAb → SBbAa → SaBB → SAB → ab

Page 254: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

254

Undecidable• We have seen how to systematically generate, for PS

grammars, all sentential forms using a queue.• It is not at all certain that the process will obtain a sentence. • It is quite possible that every new sentential form never

terminates (see example on previous slide).• It is undecidable whether a PS grammar produces the

empty set.• “Undecidable” means that there cannot be an algorithm

that will, for every PS grammar, correctly tell if the grammar produces at least one sentence.

Page 255: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

255

Procedure

PS grammar

Yes (no), the language generated is the empty set

No such procedure exists

Page 256: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

256

This queue will run forever

Queue

AB

AB

S → ABA → BB → A

substitute

AA, BB

Page 257: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

257

No algorithm to determine if PS grammars will produce something

• There is no algorithm that can decide, given an arbitrary PS grammar, whether it will produce a sentence.

• This does not mean that we cannot prove for some given grammar that it generates nothing. It means that the proof method used will not work for all PS grammars.

• We could have a program that correctly says Yes in finite time if the answer is Yes but that takes infinite time if the answer is No. In fact, the queue procedure answers Yes in finite time but takes an infinite time if the answer is No.

Page 258: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

258

Many special cases can be identified

• For many PS grammars we can prove if they produce the empty set or not.

• For example, the grammar may have a rule S -> a, or we may find it has no rule without a non-terminal in its RHS.

Page 259: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

259

Can still get useful info

Procedure

PS grammar

The language generated is (not) the empty set

No such procedure exists

Even though we can’t get an exact answer, this does not prevent us from obtaining all sorts of useful information that gets close. The computer scientist is aware of but not daunted by the impossibilities from formal languages.

Page 260: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

260

The Halting Problem for PS grammars

• The previous slides say there is no algorithm to determine if an arbitrary PS grammar will produce a sentence.

• Question: What would cause a grammar to not produce a sentence?

• Answer: Each sentential form must have a non-terminal (otherwise we have a sentence). The production rules must produce another sentential form with a non-terminal. So the sentential forms never halt.

• There is no algorithm that can determine if an arbitrary PS grammar’s production graph will halt.

Page 261: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

261

The halting problem is undecidable

• Problem: write a tool that, given an arbitrary PS grammar, it determines whether it will produce a string.

• What algorithm would you devise?• You might use the algorithm presented earlier: use a queue, take the

first item off the queue, substitute, add the new sententials to the end of the queue, repeat. Upon the first item generated that has no non-terminals (is a sentence), return Yes (the grammar does produce at least one sentence) and stop. Clearly this algorithm will run forever on those grammars that don’t produce a string.

• Perhaps there is another algorithm that will solve the problem? It turns out, there is no other algorithm.

The halting problem is not decidable for PS grammars

Page 262: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

262

Example of a PS grammar?

• First, all Type 1-Type 4 grammars are Type 0 grammars too. And you can trivially rewrite (preserving the sets they generate) any of these so they are no longer Type 1-4.

• If we want interesting examples we will have to concentrate not on the form of the grammar, which is what Type N is concerned with, but on the sets they generate.

Page 263: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

263

Fundamental difference between Type 0 and Type 1

The fundamental difference between Type 0 and Type 1 lies in the sets (languages) they can generate:

– For a set that is Type 1 (context-sensitive), we can determine in finite time whether any item is or isn’t in the set, whereas

– For a set that is Type 0 (phrase-structure), an item not in the set cannot be ruled out in finite time. That is, it may take an infinite amount of time to determine that an item is not in the set. More formally, determining if a given item belongs to a set generated by a Type 0 grammar is undecidable.

Page 264: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

264

Time required to determine if an item is in the set defined by a grammar

CS Grammar PS Grammar

Item is in the set Finite time Finite time

Item is not in the set

Finite time Infinite time

The difference between PS grammars and CS grammars is that PS grammars take an infinite amount of time to determine that an item is not in the set.

Page 265: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

265

Finding a PS grammar

• Finding a phrase-structure grammar—that’s not a context-sensitive grammar—amounts to finding a set in which we can determine in finite time that an item belongs in the set, but an infinite amount of time is required to determine that an item does not belong in the set.

• Here is a set that is well-known to be undecidable: The set of all programs that terminate.

Page 266: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

266

Creating a PS grammar

• Let L be a grammar for a simple but complete (with full Turing power) programming language.

• Write a breadth-first generator for all programs in L (generate the programs using the queue algorithm).

• Write an interpreter for L.• Start interpreting the programs breath-first as they come.• When a program terminates, we produce it as part of the generated

set.• If the interpreter doesn’t terminate, the program is not a member of

the set.• So this Type 0 grammar generates just the set of all terminating

programs in L, a set the membership of which is undecidable, so there cannot be a Type 1 grammar for this set.

Page 267: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

267

Expanding/shrinking PS grammars

Q → XYZ

Q is replaced by XYZ

Thus a production can grow or shrink.

XYZ → Q

XYZ is replaced by Q

Page 268: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

268

Expanding/shrinking PS sentential forms

Length of sentential form

production process

Page 269: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

269

Unsolvable

• When we do get sentences from the queue procedure, they may be produced in an unexplainable order.

• The sentential forms may grow for a while and then suddenly shrink, perhaps even to the empty string.

• It can be proven that there cannot be an algorithm that for all PS grammars produces their sentences in increasing length (actually, in non-decreasing length).

• In other words, the parsing problem for PS grammars is unsolvable.

Page 270: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

270

Terminology: Undecidable vs. Unsolvable

• Undecidable is the term used for Yes/No questions.– Example of an undecidable question: For any arbitrary

PS grammar, does the grammar produce a sentence?• Unsolvable is the term used for problems.

– Example of an unsolvable problem: For any arbitrary PS grammar, generate its sentences in increasing length.

• Note: in the literature these terms are used interchangeably.

Page 271: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

271

Turn to CS grammars

• We have been discussing phrase-structure grammars: – How to systematically generate their languages (use

the queue procedure)– Can we write a procedure to decide whether or not an

arbitrary PS grammar will generate a sentence (no). We noted that PS sentential forms can expand and shrink during a production process.

• Now let’s address the same issues for CS grammars.

Page 272: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

272

Language generation

The language-generating queue procedure is also applicable to CS grammars.

Page 273: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

273

CS grammars don’t shrink

Q1AQ2 → Q1PQ2

context

A is replaced by P

context

P may be multiple symbols. Thus a production can grow. P cannot be empty (ε). Thus a production cannot shrink. A production is either of the same length or longer.

Page 274: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

274

Expanding CS sentential forms

Length of sentential form

production process

Page 275: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

275

Decidable

• The sentential forms in CS grammars never shrink: the strings are produced in monotonic order of increasing length.

• This means: if we want to know if a given string w is in the language, we can just wait until we see it come up, in which case the answer is Yes, or until we see a longer string come up, in which case the answer is No.

• It is decidable whether a CS grammar produces the string w.

Page 276: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

276

Generate string using queue procedure

CS grammar, G

Stopw L(G)∈

A procedure exists for deciding if a string w is an element of a CS language

q = w?

generated string q

Yes No

Stopw L(G)∉

Yes Nolength(q)

>length(w)

?

Continue

Page 277: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

277

Recursive sets

• Since the strings in a CS language can be recognized by a possibly recursive algorithm, CS languages are also called recursive sets.

• So, the term recursive set means there exists a procedure for determining if a string w is an element of the set generated by a CS grammar.

Page 278: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

278

Halting Problem Decidable?

• We can systematically generate, for CS grammars, all sentential forms using a queue.

• However, it is not at all certain that the process will obtain a sentence.

• It is quite possible that every new sentential form never terminates.

• Is there a procedure for deciding whether a CS grammar produces the empty set?

Page 279: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

279

Halting Problem Decidable?

• Clearly the queue procedure will not work – it will loop endlessly if a grammar’s language is empty.

• Perhaps there is some other procedure that could be applied to a CS grammar to decide if it will produce something?

• It turns out that there is no procedure. The halting problem is undecidable for CS grammars.See http://www.cs.cmu.edu/~./FLAC/pdf/ContSens-6up.pdf, bottom of page 12.

Page 280: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

280

Turn to CF grammars• We have been discussing context-sensitive grammars:

– How to systematically generate their languages (use the queue procedure)

– A procedure to decide whether a string w is an element of the language generated by an arbitrary CS grammar (run queue until a match is found or a longer string is encountered

We noted that CS sentential forms always expand during a production process.

• Now let’s address the same issues for CF grammars.

Page 281: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

281

There is an algorithm to determine if a CF grammar will produce something

With CF grammars it may still happen that a grammar will never produce a sentence but we can determine that beforehand, as follows:

1) First, scan the grammar to find all non-terminals which have a right-hand side that contains terminals only or is empty. These terminals are guaranteed to produce something.

2) Now scan again to find non-terminals which have a right-hand side that consists of only terminals and non-terminals that are guaranteed to produce something. This will give us new non-terminals that are guaranteed to produce something.

3) Repeat 2) until we find no more new non-terminals.4) If we have not met the start symbol this way, the grammar will not

produce anything.

The halting problem is decidable for CF grammars

Page 282: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

282

ExampleDetermine that the CF grammar for the t,d,&h language produces a sentence:

Sentence → tSentence → dSentence → hList → t ListTailList → d ListTailList → h ListTailListTail → , ListListTail → & tListTail → & dListTail → & h

Sentence tSentence dSentence hListTail & tListTail & dListTail & h

List → t ListTailList → d ListTailList → h ListTailListTail → , List

Terminals-only

1)

List t ListTailList d ListTailList h ListTail

ListTail → , List

RHS has non-terminals

guaranteed to produce

something

Guaranteed to produce something

Page 283: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

283

Leftmost rewriting

Leftmost rewriting: in the production process, rewrite the leftmost non-terminal every time.

Page 284: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

284

Rightmost rewriting

Rightmost rewriting: in the production process, rewrite the rightmost non-terminal every time.

Page 285: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

285

Notation for: rule 2, second alternative

→→ →

tom | dick | harryName | List and NameName , List | Name

NameSentence

List

1.2.3.

Sentence → List and Name

Consider this derivation:

“Sentence” was rewritten using rule 2’s second alternative, i.e., 2bWe will write the derivation like so:

Sentence →2bList and Name

Page 286: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

286

Compare leftmost and rightmost rewriting

→→ →

tom | dick | harryName | List and NameName , List | Name

NameSentence

List

1.2.3.

Sentence →2bList and Name →3aName, List and Name →1atom, List and Name →3btom, Name and Name →1btom, dick and Name →1ctom, dick and harry

Leftmost rewritingSentence →2bList and Name →1cList and harry →3aName, List and harry →3bName, Name and harry →1bName, dick and harry →1atom, dick and harry

Rightmost rewriting

Notes:a. The sequences of production

rules are not as similar as we would expect. The sequences are neither equal nor each other’s mirror image, nor is there any obvious relationship.

b. In grand total the same rules and alternatives are used.

Page 287: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

287

Show the order that non-terminals are rewritten in the production tree

Sentence →2bList and Name →3aName, List and Name →1atom, List and Name →3btom, Name and Name →1btom, dick and Name →1ctom, dick and harry

Leftmost rewriting Sentence

NameList and

,Name List

Name

harryand,tom dick

1

2

3 4

5

6

Page 288: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

288

Show the order that non-terminals are rewritten in the production tree

Sentence

NameList and

,Name List

Name

harryand,tom dick

1

3

6 4

5

2Sentence →2bList and Name →1cList and harry →3aName, List and harry →3bName, Name and harry →1bName, dick and harry →1atom, dick and harry

Rightmost rewriting

Page 289: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

289

Different order of rewriting

Sentence

NameList and

,Name List

Name

harryand,tom dick

1

2

3 4

5

6

Sentence

NameList and

,Name List

Name

harryand,tom dick

1

3

6 4

5

2

Rightmost rewritingLeftmost rewriting

Both rewrite-sequences define the same production tree. But the order of rewriting differs.

Page 290: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

290

Leftmost derivationSentence →2bList and Name →3aName, List and Name →1atom, List and Name →3btom, Name and Name →1btom, dick and Name →1ctom, dick and harry

Leftmost rewriting

Sentence → List and Name → Name, List and Name → tom, List and Name → tom, Name and Name → tom, dick and Name → tom, dick and harry

Here is the sequence of production rules used in leftmost rewriting:

This sequence of production rules is called the leftmost derivation of a sentence.

Page 291: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

291

Indicating a leftmost production

A leftmost production step can be indicated by using an arrow marked with a small l, for example:

Name, List and Name → tom, List and Namel

The leftmost production sequence:Sentence → List and Name → Name, List and Name → tom, List and Name → tom, Name and Name → tom, dick and Name → tom, dick and harry

l l l l l l

can be abbreviated:Sentence → tom, dick and harry

l*

Page 292: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

292

Rightmost derivation

Sentence → List and Name → List and harry → Name, List and harry → Name, Name and harry → Name, dick and harry → tom, dick and harry

Here is the sequence of production rules used in rightmost rewriting:

This sequence of production rules is called the rightmost derivation of a sentence.

Sentence →2bList and Name →1cList and harry →3aName, List and harry →3bName, Name and harry →1bName, dick and harry →1atom, dick and harry

Rightmost rewriting

Page 293: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

293

Indicating a rightmost production

A rightmost production step can be indicated by using an arrow marked with a small r, for example:

List and Name → List and harryr

The rightmost production sequence:r r r r r r

can be abbreviated:Sentence → tom, dick and harry

r*

Sentence → List and Name → List and harry → Name, List and harry → Name, Name and harry → Name, dick and harry → tom, dick and harry

Page 294: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

294

Indicating a production

The fact that Sentence produces tom, dick and harry in any way is written:

Sentence → tom, dick and harry*

Page 295: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

295

Parsing (defined)

Parsing is the task of reconstructing the derivation tree (or graph) for a given input string.

Parserinput string

grammar

derivation tree (or graph)

Page 296: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

296

→→ →

tom | dick | harryName | List and NameName , List | Name

NameSentence

List

1.2.3.

Sentence

NameList and

,Name List

Name

harryand,tom dick

tom, dick and harry Parser

Page 297: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

297

Most parsers use leftmost or rightmost derivation

Some of the most efficient parsing techniques can be understood more easily if viewed as attempts to reconstruct a leftmost or rightmost derivation process of the input string.

Page 298: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

298

The concept of zero is still not well accepted

Roughly 1500 years after the introduction of zero as a number by mathematicians in India, the concept is still not well accepted in computer science:

• Many programming languages do not support records with zero fields• Many programming languages do not support arrays with zero elements• Many programming languages do not support variable definitions with

zero variables• In some programming languages the syntax for calling a routine with zero

parameters differs from that for a routine with one or more parameters• XML provides a special syntax for empty elements• Many compilers refuse to compile a module that defines zero names• No parser generator can produce a parser for the empty language (the

language with zero strings)

Page 299: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

299

Empty language vs. a language that consists of the empty string

• Empty language: {}• Language with only the empty string: {ε}

That language is easily generated by this grammar:S → ε

• What would the grammar for the empty language look like?

Page 300: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

300

Grammars that produce the empty language

• For a grammar to produce nothing, the production process cannot be allowed to terminate.

• Here’s one such grammar: S → S

That grammar is ugly for two reasons:– The generation process just loops and no information

about the emptiness of the language is obtained– The use of the symbol S is arbitrary

Page 301: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

301

Force the production process to get stuck

• Another approach to force the production process to get stuck is by not having any production rules in the grammar.

• Recall that grammars are formally defined as, G = (VN, VT, S, F), where F = the set of production rules.

• This approach produces, G = (S, {}, S, {}). • That is not very satisfactory either since:

– Now we have a non-terminal without a defining rule– The symbol S is still arbitrary

Page 302: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

302

Don’t allow the production process to get started

• A better way is never allow the production process to get started: have no start symbol.

• This can be accommodated by allowing a set of start symbols in the definition of a grammar rather than a single start symbol.

• There are good reasons for having a set of start symbols: each global element declaration in an XML Schema is a potential start symbol.

• If we extend the definition of a CF grammar to use a set of start symbols, the grammar for the empty language obtains the elegant and satisfactory form:

({}, {}, {}, {})

Page 303: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

303

Rules with empty left-hand side

• It might be useful to have grammar rules in which the left-hand side is empty:

ε → djakdlsaiewp• Terminal productions of the right-hand sides

of such rules may appear anywhere in the input, thus modeling noise and other every-day but extraneous events.

Page 304: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

304

Our preoccupation with empty is not frivolous

Our preoccupation with empty strings, sets, languages, etc. is not frivolous, since it is well-known that the ease with which a system handles empty cases is a measure of its cleanliness and robustness.

Page 305: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

305

CF grammars are limited

• Many things can be expressed using CF grammars.

• However, CF grammars have serious limitations.

Page 306: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

306

Lineage of a symbolHere is a CF grammar for the d, h & h language (t = Tom, d = Dick, h = Harry, S = Start, L = List, N = Name):

1. S → L & N2. S → N3. L → N , L4. L → N5. N → t6. N → d7. N → h

Here is the production tree for a derivation of: d, h & h:

1

3 7

,

6 4

hd

7

h & A sentence from the grammar

When we have obtained a sentence from a CF grammar, we may look at each terminal symbol in it and ask: How did it get here? For example, looking at the production tree, we see that “d” was produced as the 1st member of the right-hand side of rule number 6. The left-hand side of this rule, the parent of our symbol, was produced as the 1st member of rule 3. And so on, until we reach the start symbol. We can, in a sense, trace the lineage of the symbol in this way.

Page 307: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

307

Express lineage as rule/member pairs

1

3 7

,

6 4

hd

7

h &

1. S → L & N2. S → N3. L → N , L4. L → N5. N → t6. N → d7. N → h

Rule/member pairs: { (6,1), (3, 1), (1, 1) }

(6,1)

(3,1)

(1,1)

(3,2)

(7,1)

(4,1)

(1,2)

(1,3)

(7,1)

Page 308: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

308

Original symbol, original sentence

• If all rule/member pairs in the lineage of a symbol are different, we call the symbol original. – Example: the lineage of the first h is { (7, 1), (4, 1),

(3,3), (1,1) }. Since all rule/member pairs are different, h is original.

• If all the symbols in a sentence are original, we call the sentence original.

Page 309: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

309

Same symbol, different lineage1. S → L & N2. S → N3. L → N , L4. L → N5. N → t6. N → d7. N → h

Member/rule pairs: { (7,1), (1,3) }

Rule/member pairs: { (7, 1), (4, 1), (3,3), (1,1) }

If a symbol occurs twice in an original sentence, both its lineages must be different: if they were the same, they would describe the same symbol in the same place.

1

3 7

,

6 4

hd

7

h &

(1,1)

(7,1)

(4,1)

(1,3)

(7,1)

Page 310: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

310

Any CF grammar produces a finite set of original sentences

(rule, member)

Finite number

Finite number

So the number of unique (rule, member) pairs is finite. Therefore the number of original symbols is finite and the number of original sentences is finite.

We arrive at the surprising conclusion that any CF grammar produces a finite-size kernel of original sentences and (probably) an infinite number of unoriginal sentences.

Page 311: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

311

Unoriginal sentences

• What do “unoriginal” sentences look like?• By definition, an unoriginal sentence has one

or more symbols that are unoriginal.• A symbol is unoriginal if it has two or more

(rule, member) pairs that are the same.• Two (rule, member) pairs are the same means

that the same grammar rule is repeated.

Page 312: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

312

Original sentence1

2

a

1. S → b A b2. A → a A a3. A → a

a

Each symbol is original, so baaab is an original sentence.

3

aa a

b b

b b

Rule/member pairs: { (1, 1) }

Rule/member pairs: { (1, 3) }

Rule/member pairs: { (2,1), (1, 2) }

Rule/member pairs: { (3, 1), (2, 2), (1, 2) }

Rule/member pairs: { (2, 3), (1, 2) }

(1,1)(1,2)

(1,3)

(2,1)(2,2)

(2,3)

(3,1)

Page 313: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

313

Unoriginal sentence1

2

a

1. S → b A b2. A → a A a3. A → a

a2

3

a aa

a

a

a

a

Rule/member pairs: { (3,1), (2, 2), (2, 2), (1, 2) }

Duplicate rule/member pairs, so this symbol is not original and so baaaaab is not an original sentence.

b

b

b

b

(3,1)

(2,2)

(2,2)

(1,2)

Page 314: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

Repeated rule

314

Rule 2 is repeated1

2

a a2

3

a aa

a

a

a

a

b

b

b

b

Page 315: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

315

Partition the sentence• Let’s partition the sentence into parts. • Let A = the repeated rule (in this case it is rule 2)

– w : the part produced by the A that is furthest down the tree– vwx: the part produced by the A that is furthest up the tree– uvwxy: the entire unoriginal sentence

1

a a2

3

a aa

a

a

a

a

b

b

b

b

u v w x y

2

Page 316: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

316

Pump up the sentence

We can get another unoriginal sentence by replacing the smaller A by the larger A.

1

a a2

3

a aa

a

a

a

a

b

b

b

b

u v w x y

2

Replace this by this

Page 317: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

317

Another unoriginal sentence

b a a a a a b

wv xu y

b a a a a a a a b

replace w with vwx

wu yvv x x

uv2wx2y

Page 318: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

318

and another unoriginal sentence

b a a a a a a a a a b

replace w with vwx

wv xvv x x

uv3wx3y

b a a a a a a a bwu yvv x x

u y

Page 319: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

319

Family of nested sentences

• We can, in this way, construct a complete family of sentences uvnwxny for all n≥0.

• This form shows w nested in a number of v and x brackets, in an indifferent context of u and y. 1

a a2

3

a aa

a

a

a

a

b

b

b

b

u v w x y

2

Family of nested sentences

Page 320: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

320

Original sentences become exhausted

When we examine longer and longer sentences in a CF language, the original sentences become exhausted and we meet only families of closely related sentences telescoping off into infinity.

Page 321: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

321

uvwxy theorem

• uvwxy theorem: any sentence generated by a CF grammar that is longer than the longest original sentence from that grammar can be cut into five pieces u, v, w, x, and y, in such a way that uvnwxny are sentences from that grammar for all n≥0.

• The uvwxy theorem is also called the pumping lemma for context-free languages.

Page 322: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

322

Language isn’t CF if long sentences don’t decay into nested sentences

• If a language keeps on providing longer and longer sentences without reducing to families of nested sentences, there cannot be a CF grammar for it.

• We have already encountered the context-sensitive language anbncn and it is easy to see that it does not decay into such nested sentences as sentences get longer and longer. Consequently, there is no CF grammar for it.– A general technique for showing that a language is not context-

free may be found in this article: Commun. ACM, April 1993, Using the context-free pumping lemma, by David Billington.

Page 323: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

323

Increase the length of original sentences

• The longest original sentence is a property of the grammar, not the language.

• By making a more complicated grammar for a language we can increase the set of original sentences and push away the border beyond which we are forced to start nesting (repeating).

• If we make the grammar infinitely complicated, we can push the border to infinity and obtain a phrase structure language from it. How we can make a CF grammar infinitely complicated is described in the section on two-level grammars, 15.2.1.

Page 324: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

324

Regular grammars are limited

A simpler form of the uvwxy theorem applies to regular (Type 3) languages.

Page 325: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

325

Repeated non-terminals for long sentences

• We have seen that the sentential forms occurring in the production process for a regular grammar all contain only one non-terminal, which occurs at the end.

• During the production of a very long sentence, one or more non-terminals must occur two or more times, since there are only a finite number of non-terminals.

Page 326: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

326

Example Regular grammar

S → sP | sAP → pQQ → qAA → aR | aTR → rST → tUU → u

SsPspQspqAspqaRspqarSspqarsAspqarsaTspqarsatUspqarsatu

Clearly the sequence from the first A to this A can be repeated over and over

Page 327: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

327

uvnw

SsPspQspqAspqaRspqarSspqarsAspqarsaTspqarsatUspqarsatu

u

u v

u v w

u is the part leading up to the first A, v is the part between the first A and the second A (it can be repeated), and w is the part after the last A to terminate the production process.

Page 328: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

328

uvw theorem

• uvw theorem: any sufficiently long string from a regular language can be cut into three pieces u, v, and w, so that uvnw are strings in the language for all n≥0.

• The uvw theorem is also called the pumping lemma for regular languages.

Page 329: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

329

CF grammars as transition graphs• A transition graph is a directed graph in which the arrows are labeled with zero

of more symbols from the grammar. • As you follow the arrows in the graph you produce one of the associated

symbols, if there is one, and nothing otherwise.• The nodes, often unlabeled, are resting points between producing the symbols.• If there is more than one outgoing arrow from a node you can choose any to

follow.• Here is the transition graph for the tom, dick, and harry language:

,

tomdick

harry

and

tomdick

harry

Page 330: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

330

Produce the same strings

1. Sentence → Name | List and Name2. Name → tom | dick | harry3. List → Name , List | Name

,

tomdick

harry

and

tomdick

harry

CF Grammar

Transition Graph

Page 331: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

Turn a grammar into a set of transition graphs

It is easy to turn a CF grammar into a set of transition graphs, one for each non-terminal, as shown below.

331

Sentence → Name | List & NameName → tom | dick | harryList → Name | Name , List

Sentence

Name

List & Name

tomdickharryName

List

Name

Name , List

Page 332: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

332

Recursive transition network

• The transition graphs on the previous slide have non-terminal labels above the arrows.

• Upon encountering an arrow that points to node n2 and labeled with non-terminal N: push n2 onto a stack, continue the walk at the entrance to the transition graph for N. When leaving the transition graph for N, pop n2 from the stack and continue at node n2.

• This is the recursive transition network interpretation of context-free grammars: the set of graphs is the transition network, and the stacking mechanism provides the recursion.

Page 333: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

333

Regular grammars as transition graphs

The rules of a regular grammar can also be turned into transition graphs:

Sentence → [tdh] | List List → [tdh] ListTailListTail → & [tdh] |, List

Sentence

[tdh]

List

ListTail

&

, List

[tdh]

There is a non-terminal only when leaving a graph. No need for stacking: interpret an arrow marked with a non-terminal N as a jump to the transition graph for N. So a regular grammar corresponds to a non-recursive transition network.

List [tdh] ListTail

Page 334: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

334

Useless rules, successful production process

• Grammars can contain useless rules: rules that cannot play a role in any successful production process.

• A production process is successful when it results in a terminal string.• Production attempts can be unsuccessful by getting stuck (no further

substitution possible) or by entering a situation in which no substitution sequence will ever remove all non-terminals.

• Example of a Type 0 grammar that can get stuck:

1. S → A B2. S → B A3. S → C4. A B → x5. C → C C

If we start with the first rule for S, all goes well and we produce the terminal string x.If we start with the second rule for S, we get stuck. It is a useless rule.If we start with the third rule for S, we get ourselves into an infinite loop, producing more and more Cs.Rules 2, 3, and 5 can never occur in a successful production process; they are useless rules and can be removed from the grammar without affecting the language produced.

Page 335: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

335

Remove useless rules

• Useless rules are not a fundamental problem: they do not obstruct the normal production process.

• Still, they are dead wood in the grammar and one would like to remove them.

• Also, when they occur in a grammar specified by a programmer they probably point at some error and one would like to detect them and give warning or error messages.

Page 336: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

336

Useless rules is undecidable for Type 0 and Type 1

It can be shown that in general it is undecidable whether a rule in a Type 0 or Type 1 grammar is useless: there cannot be an algorithm that does it correctly in all cases.

Is the rule useless?rule i

grammar (type 0 or 1)

yes/no

Impossible to build this

Page 337: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

337

Useless rules is decidable for Type 2

The problem of deciding whether a rule in a CF grammar is useless is easily solved.

Is the rule useless?rule i

grammar (type 2)

yes/no

Easy to build this

Page 338: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

338

Key Concept

• It is important to know what type of grammar you are dealing with (Type 0, 1, 2, or 3). Why?

• Because if you know that it is Type 0 or 1, then you can take advantage of a result from the field of Formal Languages and not attempt to build a program to decide if a rule is useless. If you know that it is Type 2, then you can easily build the program.

Page 339: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

339

3 causes for useless rules in CF grammars

• A rule in a CF grammar can be useless through three causes:1. It may contain an undefined non-terminal.2. It may not be reachable from the start symbol.3. It may fail to produce anything.

• These are useless rules.

Page 340: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

340

Useless Rules

undefinednon-terminals unreachable

non-terminals

non-productivenon-terminals

Page 341: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

341

Undefined non-terminals

• The right-hand side of a rule may contain a non-terminal for which no production rule is given.

• Such a rule can be removed. • However, that may result in another non-

terminal becoming undefined.• Example: If the A in this rule B → . . . A . . .

is undefined, remove the rule. But now B may be undefined, so remove rules with B on the right-hand side, etc.

Page 342: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

342

Unreachable non-terminals

If a non-terminal cannot be reached from the start symbol, its defining rules will never be used, and it cannot contribute to the production of any sentence.

Page 343: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

343

Non-productive rules

• A rule that gets stuck in an infinite loop cannot contribute anything to the sentences of the language of the grammar, since once the rule is entered, there is no way to get rid of it: the rule has a non-productive non-terminal.

• Example: the rule X → aX is non-productive and any rule which has X in its right-hand side is non-productive.

• In an extreme case all non-terminals in a grammar are non-productive. This happens when all right-hand sides in the grammar contain at least one non-terminal. Then there is no way to get rid of the non-terminals, and the grammar itself is non-productive.

Page 344: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

344

Loops

• Rules of the form A → A are called loops.• Loops can be indirect:

A → B B → C C → A

• Loops can be hidden: A → PAQ P → ε Q → ε

A

BC

*

*A PAQ

ε ε

Page 345: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

345

Loops can legitimately occur

• A loop can legitimately occur in the production of a sentence, provided there is also a production that enables breaking out of the loop.

• Example: below, the first rule is a loop, but the second rule enables breaking out of the loop: A → aA A → a

Page 346: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

346

Proper grammar

A grammar without useless non-terminals and loops is called a proper grammar.

Page 347: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

347

Cleaning up a CF grammar

• Normally, grammars supplied by people do not contain undefined, unreachable, or non-productive non-terminals.

• If they do, it is almost certainly a mistake and we would like to detect and report them.

• Such anomalies can, however, occur in generated grammars or be introduced by some grammar transformations, in which case we wish to detect them to “clean up” the grammar.

Page 348: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

348

Algorithm to detect and remove useless non-terminals and rules

• The algorithm to detect and remove useless non-terminals and rules from a context-free grammar consists of two steps:1. Remove non-productive rules2. Remove unreachable non-terminals

• It is not necessary to remove rules with undefined non-terminals since the first step does this automatically.

Page 349: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

349

Let’s clean up this CF grammar

The above grammar looks innocent: all its non-terminals are defined and it

does not exhibit any suspicious constructions.

S → A B | D EA → aB → b CC → cD → d FE → eF → f D

Page 350: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

350

Step 1: Remove non-productive rules

• The following slides describe how to remove non-productive rules.

• Find the non-productive rules by finding the productive rules. After finding all productive rules, the other, remaining rules are the non-productive rules.

Page 351: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

351

Algorithm to find productive rules

• A rule is productive if its right-hand side consists of symbols all of which are productive.

• Productive symbols:– Terminal symbols are productive since they

produce terminals.– Empty is productive since it produces the empty

string.– A non-terminal is productive if there is a

productive rule for it.

Page 352: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

352

Initial knowledge

Rule Productive

S → A B | D EA → a Productive

B → b CC → c Productive

D → d FE → e Productive

F → f D

Go through the grammar and for each rule for which we know that all its right-hand side symbols are productive, mark the rule and the non-terminal it defines as Productive.

Page 353: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

353

Build on top of our knowledge

Rule Productive

S → A B | D EA → a Productive

B → b C Productive (since b is productive and C is productive)

C → c Productive

D → d FE → e Productive

F → f D

Now we know more. Apply this knowledge in a second round through the grammar.

Page 354: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

354

Round three

Rule Productive

S → A B S → D E

Productive (since A is productive and B is productive)

A → a Productive

B → b C Productive (since b is productive and C is productive)

C → c Productive

D → d FE → e Productive

F → f D

Page 355: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

355

Round four

Rule Productive

S → A B S → D E

Productive (since A is productive and B is productive)

A → a Productive

B → b C Productive (since b is productive and C is productive)

C → c Productive

D → d FE → e Productive

F → f D

A fourth round yields nothing new.

Page 356: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

356

Recap

Rule Productive

S → A B S → D E

Productive (since A is productive and B is productive)

A → a Productive

B → b C Productive (since b is productive and C is productive)

C → c Productive

D → d FE → e Productive

F → f D

We now know that A, B, C, E and the rule S → A B are productive. D, F, and the rule S → D E are non-productive.

Page 357: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

357

Remove non-productive rules

Rule Productive

S → A B Productive (since A is productive and B is productive)

A → a Productive

B → b C Productive (since b is productive and C is productive)

C → c Productive

E → e Productive

We have pursued all possible avenues for productivity and have not found any possibilities for D, F, and the second rule for S. That means they are non-productive and can be removed from the grammar.

The grammar after removing non-productive rules

Page 358: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

358

Removing non-productive rules also removes undefined non-terminals

• Earlier we said: It is not necessary to remove rules with undefined non-terminals since the first step [remove non-productive rules] does this automatically.

• Consider a rule R that contains an undefined non-terminal, U. The algorithm shown on the previous slides will not mark R as “Productive” and hence R will be removed. Also, any rules that reference R in their right-hand side will not be marked “Productive” and will be removed. And so forth.

• So an undefined non-terminal is just a special case of a non-productive non-terminal: it is non-productive because there is no rule for it.

Page 359: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

359

Bottom-up process

Removing the non-productive rules is a bottom-up process: only at the bottom level, where the terminal symbols live, can we know what is productive.

Page 360: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

360

Knowledge-improving algorithm

• In the previous slides we increased our knowledge with each round.

• The previous slides is our first example of a closure algorithm.

Page 361: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

361

Closure algorithms

Closure algorithms are characterized by two components:1. Initialization: an assessment of what we know initially.

For our problem we knew: The grammar rules Terminals and empty are productive

2. Inference rule: a rule telling how knowledge from several places is to be combined. The inference rule for our problem was: If all the right-hand side symbols of a rule are productive, then the rule’s left-hand side non-terminal is productive.

The inference rule is repeated until nothing changes any more.

Page 362: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

362

Step 2: Remove unreachable non-terminals

The second step in removing useless non-terminals and rules from a context-free grammar is to remove unreachable non-terminals.

Page 363: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

363

Reachable non-terminals

• A non-terminal is called reachable or accessible if there exists at least one sentential form, derivable from the start symbol, in which it occurs.

• Example: a non-terminal A is reachable if S → αAβ for some α and β.

• Find the unreachable non-terminals by finding the reachable ones.

*

Page 364: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

364

Closure algorithm for finding reachable non-terminals

• Initialization: the start symbol is marked “reachable”.

• Inference rule: for each rule in the grammar of the form A → α with A marked “reachable”, all non-terminals in α are marked “reachable”.

• Continue applying the inference rule until nothing changes any more.

• The remaining unmarked non-terminals are not reachable and their rules can be removed.

Page 365: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

365

Initialization

Rule Reachable

S → A B S is reachable

A → aB → b CC → cE → e

Page 366: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

366

Round one

Rule Reachable

S → A B S is reachable

A → a A is reachable because it is reachable from S

B → b C B is reachable because it is reachable from S

C → cE → e

Page 367: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

367

Round two

Rule Reachable

S → A B S is reachable

A → a A is reachable because it is reachable from S

B → b C B is reachable because it is reachable from S

C → c C is reachable because it is reachable from B

E → e

Page 368: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

368

Round three

Rule Reachable

S → A B S is reachable

A → a A is reachable because it is reachable from S

B → b C B is reachable because it is reachable from S

C → c C is reachable because it is reachable from B

E → e

The third round produces no change.So the rule E → e is unreachable and can be removed.

Page 369: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

369

Cleaned grammar

S → A B | D EA → aB → b CC → cD → d FE → eF → f D

S → A BA → aB → b CC → cE → e

S → A B A → aB → b CC → c

Initial grammar

Grammar after removing non-productive rules

Grammar after removing unreachable non-terminals

Page 370: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

370

Top-down process

Removing unreachable non-terminals is a top-down process: only at the top level, where the start symbol lives, can know what is reachable.

Page 371: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

371

Order of cleaning matters

• The cleaning process must occur in this order:– First, remove non-productive rules– Second, remove unreachable non-terminals

• If the order is switched it may produce a grammar which again contains unreachable non-terminals. So you will have to redo the algorithm for removing unreachable non-terminals.

Page 372: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

372

Need run the algorithms only once

• Suppose we remove non-productive rules and then unreachable rules.

• Consider a non-terminal N in a reachable rule: X → αNβ

• By removing unreachable rules could N become undefined? Will we have to run the algorithm for removing non-productive rules again?

• No. In the process of removing non-productive rules we determined that all symbols on the right-hand side of X are productive. That means that N is productive (N is defined).

Page 373: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

373

Cleaning may remove all rules

Cleaning a grammar may remove all rules, including those for the start symbol, in which case the grammar describes the empty language.

Page 374: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

374

Set properties of context-free and regular languages

Since languages are sets, it is natural to ask if the standard operations on sets can be performed on them, and if so, how. Set operations:

– union– intersection– negation (complement)

Page 375: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

375

Set operations

• The union of two sets and contains the elements that are in either set; it is written:

• The intersection contains the elements that are in both sets; it is written:

• The negation of a set contains those in Σ* but not in ; it is written:

Page 376: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

376

Set operations on grammars

In the context of formal languages the sets are defined through grammars, so actually we want to do the operations on the grammars rather than on the languages.

Page 377: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

377

Union of two grammars

• Constructing the grammar for the union of two languages is trivial for context-free and regular languages (and in fact for all Chomsky types): just construct a new start symbol , where and are the start symbols of the two grammars that describe the two languages.

• Make sure the names and are different.

A grammar describes a language.

Page 378: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

378

context-free context-free = context-free

The union of two context-free languages is a context-free language.

Page 379: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

379

Intersection of two CF grammars might produce a non-CF grammar

• The intersection of two context-free languages might not be context-free.• Consider the two CF languages:

L1 = anbncm (same number of as and bs, arbitrary number of cs) L2 = ambncn (arbitrary number of as, same number of bs and cs)described by the CF grammars:

L1 → A P A → a A b | εP → c P | ε

L2 → Q CQ → a Q | ε C → b C c | εand

• A string that occurs in both languages must have the same number of as and bs per L1 and the same number of bs and cs per L2. So the intersection language consists of strings of the form anbncn and we know that language is not context-free.

Page 380: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

380

context-free context-free = …

• The intersection of two context-free languages might be a context-sensitive language.– Example: the intersection of anbncm and ambncn is

anbncn, and the latter is context-sensitive.• The intersection of two context-free languages

might be a context-free language. – Example: the intersection of a context-free

language with itself is context-free.

Page 381: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

381

Set theory vs. language theory

• When languages are treated as sets it is easy to generate anbncn

anbncm ambncnanbncn

• Conversely, when languages are treated as grammars it is quite difficult to generate anbncn

1. S → abc | aSQ2. bQc → bbcc3. cQ → Qc

Page 382: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

382

Easy to intersect two CF languages

Just enumerate them both (use the queue algorithm) and output words that appear on both lists. The queue algorithm outputs the strings in order of increasing length. Suppose grammar 1 generates string abc. We can determine if grammar 2 generates abc by running the queue algorithm on grammar 2 until (a) it outputs abc, or (b) it outputs a string with length greater than 3 (the length of abc).

Page 383: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

383

Easy to determine membership of the intersection of two XML languages

• Problem: is an XML instance document a member of the intersection of two XML Schemas?

• Validate the XML instance document twice, once for each XML Schema. The XML instance is a member of the intersection if and only if it conforms to both XML Schemas.

Page 384: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

384

The intersection of CF languages has weird properties

• The intersection of two CF languages might be a Type 1 (context-sensitive) grammar.

• The intersection of three CF languages is more powerful than the intersection of two of them.

• Remarkable phenomenon: any Type 1 language, and even any Type 0 language, can be constructed by intersecting just two CF languages, provided we are allowed to erase all symbols in the resulting strings that belong to a set of erasable symbols. Continued

Page 385: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

385

erasure(L3 ∩ L4) = a CS language

• The CS language we will use to demonstrate this remarkable phenomenon is the set of all strings that consist of two identical parts: ww, where w is any string over the given alphabet.

• The two languages to be intersected are defined by:L3 → A P A → a A x | b A y | εP → a P | b P | ε

L4 → Q C Q → a Q | b Q | εC → x C a | y C b | εand

x and y are the erasable symbols

Page 386: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

386

Intersecting two CF grammars with erasable symbols

L3 → A P A → a A x | b A y | εP → a P | b P | ε

L4 → Q C Q → a Q | b Q | εC → x C a | y C b | εand

This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs.

Example string: abaxyxaba

This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs.

Example string: abaxyxaba

abaxyxaba

abaabaerase the erasable symbols x, y

Page 387: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

387

Dark mirrorL3 → A P A → a A x | b A y | εP → a P | b P | ε

L4 → Q C Q → a Q | b Q | εC → x C a | y C b | εand

This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs.

Example string: abaxyxaba

This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs.

Example string: abaxyxaba

dark mirror of dark mirror of

Page 388: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

388

Each grammar has 3 partsL3 → A P A → a A x | b A y | εP → a P | b P | ε

L4 → Q C Q → a Q | b Q | εC → x C a | y C b | εand

This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs.

Example string: abaxyxaba

This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs.

Example string: abaxyxabaA1 M1 G1 G2 M2 A2

Page 389: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

389

Corresponding parts must matchfor an intersection

L3 → A P A → a A x | b A y | εP → a P | b P | ε

L4 → Q C Q → a Q | b Q | εC → x C a | y C b | εand

This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs.

Example string: abaxyxaba

This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs.

Example string: abaxyxabaA1 M1 G1 G2 M2 A2

The intersection forces A1 = G2, M1 = M2, and G1 = A2.

Page 390: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

390

A1 equals A2

L3 → A P A → a A x | b A y | εP → a P | b P | ε

L4 → Q C Q → a Q | b Q | εC → x C a | y C b | εand

This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs.

Example string: abaxyxaba

This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs.

Example string: abaxyxabaA1 M1 G1 G2 M2 A2

The intersection forces A1 = G2, M1 = M2, and G1 = A2.M1 is the dark mirror of A1, so M2 is the dark mirror of A1, so A2 equals A1.

Page 391: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

391

Erase the dark mirrorsL3 → A P A → a A x | b A y | εP → a P | b P | ε

L4 → Q C Q → a Q | b Q | εC → x C a | y C b | εand

This grammar produces strings consisting of three parts: a sequence A1 of as and bs, followed by its “dark mirror” image M1, in which a corresponds to x and b to y, followed by an arbitrary sequence G1 of as and bs.

Example string: abaxyxaba

This grammar produces strings consisting of an arbitrary sequence G2 of as and bs, a “dark mirror” image M2 preceding a sequence A2 of as and bs.

Example string: abaxyxabaA1 M1 G1

G2 M2 A2

The intersection forces A1 = G2, M1 = M2, and G1 = A2.M1 is the dark mirror of A1, so M2 is the dark mirror of A1, so A2 equals A1.After erasing the mirrors we have abaaba, which is (aba)2

Page 392: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

392

Create Type 0 from the intersection of context-free languages

Using a massive application of the mirror-mirror trick, one can relatively easily prove that any Type 0 language can be constructed as the intersection of two CF languages, plus a set of erasable symbols.

The construction is described extremely formally in @Book{ FormalLanguages.Harrison.1978, author = {Harrison, Michael A.}, title = {Introduction to Formal Language Theory}, publisher = {Addison-Wesley}, year = {1978}} on pages 307-311.

But like all text in this book the explanation is very dense and is severely complicated by the fact that the author inside the explanation wants to prove that the two CF languages you need are deterministic.

Roughly the same explanation can be found in @Article{ New.Ginsburg.1967, author = {Ginsburg, Seymour and Greibach, Sheila and Harrison, Michael A.}, title = {One-Way Stack Automata}, journal = {J. ACM}, volume = {14}, number = {2}, year = {April 1967}, pages = {389-418}, annote = {}} on pages 402-405.

Page 393: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

393

**** Provisional Explanation of Creating a Type 0 Language by Intersecting Two CF Languages (Dick Grune)**** The basic trick is for the string in the intersection to represent the complete production process of a terminal production of a Type 0 grammar G. The steps are encoded as follows: … ( X1l alpha_1 X1r → -X1r -beta_1 -X1l ) ( X2l alpha_2 X2r → -X2r -beta_2 -X2l )... (1) where the →, ( and ) are unique markers; Xnl(eft) and Xnr(ight) are arbitrary strings; -S means the reverse of S; and alpha_1 → beta_1, alpha_2 → beta_2, etc. are rules in G. In fact this is what L1 generates, and that is easy to do: it's just a repetition of ( Xnl alpha_n Xnr → -Xnr -beta_n -Xnl). It is easy to produce S x -S by CF grammar, so apply twice, once for the inner Xnr, once for the outer Xnl, and you're done. But the above steps are a good derivation only when the output of one step is the input of the next, so -(-X1r -beta_1 -X1l) must be equal X2l alpha_2 X2r. This is where L2 comes in. It generates … → Y1 ) ( -Y1 → Y2 ) ( -Y2 → … (2) which again can be produced easily with a CF grammar, since its structure is again S x -S. Intersecting the two enforces the reverse of -X1r -beta_1 -X1l (which itself was the reverse of the result of the step alpha_1 → beta_1) to be equal to X2l alpha_2 X2r, the input of the next step. This makes the intersection of string (1) and string (2) a representation of a valid Type 0 production process. There are two more details to care for. One is the start-up, which is next to trivial. The second is the close-down and the harvesting of the result. This is where the homomorphism (the erasing of the erasable symbols) comes in. Before we start the whole construction we replace all terminals in G by non-terminals with similar names, and declare all symbols in G erasable. This ensures that when we in the end apply the homomorphism (the erasure act) the whole production process disappears. But of course we want to keep the final product which consists exclusively of those non-terminals that represent terminals. We harvest them by letting the productions of L1 and L2 end in the language T ) #-T, where T is any string of the non-terminals created for the original terminals of G, and # replaces each of the non-terminals by its corresponding terminal. Again this is easy to do since its structure is again essentially S x -S. Now when we erase the erasable symbols, everything disappears except the final string of terminals, a production of G. Hurray!

Page 394: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

394

context-free ∩ regular = context-free

• The intersection of a context-free and a regular language is always a context-free language.

• There is a simple algorithm to construct a grammar for that intersection language.

Page 395: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

395

De Morgan’s Law

The intersection of two sets equals the negation of the two sets, unioned, then negated:

Page 396: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

396

The negation of a context-free language might not be context-free

• De Morgan’s law: L1 ∩ L2 = ¬((¬L1) (¬L∪ 2))• Suppose the negation of a CF language produces a CF

language.• L1 and L2 are CF languages. Then ¬L1 is a CF language, as is ¬L2.

We know that the union of two CF languages produces a CF language so (¬L1) (¬L∪ 2) produces a CF language. The negation of it then produces a CF language. So ¬((¬L1) (¬L∪ 2)) is a CF language. But that equals L1 ∩ L2 and we already know that the intersection of two CF languages might not be CF.

• Therefore, negation of a CF language is not guaranteed to produce a CF language.

Page 397: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

397

Set properties of regular (Type 3) languages

regular-language regular-language = regular-language ∪regular-language ∩ regular-language = regular-language ¬regular-language = regular-language

Page 398: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

398

Would there be programming languages? XML?

• It is interesting to speculate what would have happened if formal languages had been based on set theory with all the set operations right from the start, rather than on the Chomsky hierarchy.

• Would context-free languages still have been invented?• CF languages guarantee only set union, not set

intersection or set difference. If you insist on having set intersection (which is very tempting and convenient, see for example the ease with which you can construct anbncn by intersection), you'll never invent CF languages.

Page 399: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

399

Parsing for grammar conformance

• Scenario: you observe a pattern in the strings that you are dealing with. So you create a grammar to describe the pattern. Now you want to check that your grammar correctly describes the pattern. What is required of a parser?

• Parsing only needs to check that the string conforms to the grammar.

Page 400: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

400

Parsing to determine the string’s semantics

• Often we want to go further than simply check that a string conforms to the grammar we have designed for it.

• We want to know the string’s meaning, its semantics.

• The semantics of a string is directly related to the structure of its production tree. If it is not, we have the wrong grammar.

Page 401: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

401

Attaching semantics to a grammar

Attaching semantics to a (context-free) grammar is done in a very simple and effective way:

To each rule in the grammar, a semantic clause is attached, which relates the semantics of the members of the right-hand side of the rule to the semantics of the left-hand side.

Sum → Digit {A0 := A1}

semantic clause

Page 402: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

402

Flow of the semantic info

Semantic info can flow up, down, or both ways:• Up: semantic information flows from the leaves of the

tree upward to the start symbol. The semantics of the members of the right-hand side of each rule is used to define the semantics of the left-hand side.

• Down: semantic information flows downward from the start symbol to the leaves. The semantics of the left-hand side of each rule is used to define the semantics of the members of the right-hand side.

• Both: semantic information flows up and down for a while until a stable situation is reached.

Page 403: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

403

Inherited vs. derived semantic information

• Semantic information flowing down is called inherited: each rule inherits semantics from its parent in the tree.

• Semantic information flowing up is called derived: each rule derives semantics from its children. – Derived information is also called synthesized

information.

Page 404: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

404

Expressing semantics

• There are many ways to express semantics.• We will briefly describe two often-used and

well-studied techniques:1. Attribute grammars2. Transduction grammars

Page 405: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

405

Add semantic info to this grammar

• We will explain attribute grammars and transduction grammars using the language: sums of one-digit numbers.

• The semantics of a sentence in the language is the value of the sum.

• The language is generated by this grammar:1. Sum → Digit2. Sum → Sum + Digit3. Digit → 0 | 1 | … | 9

Here is one of the strings in the language: 3 + 5 + 1The semantics of that string is: 9

Page 406: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

406

Attribute grammars

• The semantic clause in an attribute grammar assume that each node in the production tree has room for one or more attributes, which are just values (numbers, strings, or anything else) sitting in nodes in production trees.

• For simplicity we restrict ourselves to attribute grammars with only one attribute per node.

• The semantic clause of a rule in such a grammar contains some formulas which compute the value of some of the non-terminals in that rule from those of other non-terminals in that same rule.

• These semantic actions connect only values that are local to the rule. The overall semantics is composed as the result of all the local computations.– Local actions produce global results – cool!

• If the semantic action of a rule R computes the value of the left-hand side of R, that value is derived (synthesized). If it computes a value of one of the non-terminals in the right-hand side of R, say A, then that value is inherited by A.

Page 407: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

407

Naming the attributes

Sum → Sum + Digit

Each non-terminal has an associated attribute.The attribute for the symbol on the left-hand side, Sum, is named A0

Each symbol, including terminals, are indexed. So the attributefor the right-side Sum is A1, the attribute for Digit is A3.

A0 A1 A3

Page 408: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

408

Attribute grammar for the Sum grammar

1. Sum → Digit {A0 := A1}2. Sum → Sum + Digit {A0 := A1 + A3}3. Digit → 0 {A0 := 0}4. Digit → 1 {A0 := 1}5. Digit → 2 {A0 := 2}6. Digit → 3 {A0 := 3}7. Digit → 4 {A0 := 4}8. Digit → 5 {A0 := 5}9. Digit → 6 {A0 := 6}10. Digit → 7 {A0 := 7}11. Digit → 8 {A0 := 8}12. Digit → 9 {A0 := 9}

Page 409: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

409

Initial production tree

+

+ 1

53A0 = 3 A0 = 5

A0 = 1

The initial production tree for 3 + 5 + 1 is given above. Initially only the attributes of the leaves are known, but as soon as all attributes in a right-hand side of a production rule are known, we can use its semantic clause to compute the attribute of its left-hand side. This way the attribute values (semantics) percolate up the tree, finally reaching the start symbol and providing us with the semantics of the whole string.

Page 410: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

410

Attribute values percolate up the tree

+

+ 1

53A0 = 3 A0 = 5

A0 = 1

+

+ 1

53A0 = 3 A0 = 5

A0 = 1A0 = A1 + A3

= 3 + 5 = 8

+

+ 1

53A0 = 3 A0 = 5

A0 = 1A0 = A1 + A3

= 3 + 5 = 8

A0 = A1 + A3

= 8 + 1 = 9

Attribute grammars are a very powerful method for handling the semantics of a language.

This is another example of a closure algorithm!

Page 411: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

411

XML attributes

• An XML Schema creates a grammar.• XML has “attributes”.• Are XML attributes in any way related to

attribute grammars?• Did the creators of XML create XML attributes

simply for tucking away a name-value pair in a tidy fashion? Or, did they have in mind a deeper usage for XML attributes: use them to define the semantics of the XML?

Page 412: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

412

Transduction grammars

• Transduction grammars define the semantics of a string (the “input string”) as another string, the “output string” or “translation”.

• The semantic clause in a production rule is the string that should be output for the node.

• The string is output for a node after the strings for all its children.

Page 413: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

413

Transduction grammar for Sum

Here is a transduction grammar which translates a sum of digits into an instruction to compute the value of the sum:

1. Sum → Digit “make it the result”2. Sum → Sum + Digit “add it to the previous

result3. Digit → 0 “take a 0”4. Digit → 1 “take a 1”5. Digit → 2 “take a 2”6. Digit → 3 “take a 3”7. Digit → 4 “take a 4”8. Digit → 5 “take a 5”9. Digit → 6 “take a 6”10. Digit → 7 “take a 7”11. Digit → 8 “take a 8”12. Digit → 9 “take a 9”

Page 414: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

414

Sequence of outputs

Sum

Sum

Sum

Digit

3

+ Digit

5

+ Digit

1take a 3

Sum

Sum

Sum

Digit

3

+ Digit

5

+ Digit

1

make it the result

Sum

Sum

Sum

Digit

3

+ Digit

5

+ Digit

1

take a 5

Sum

Sum

Sum

Digit

3

+ Digit

5

+ Digit

1

Add it to the previous result

Page 415: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

415

Meaning of 3 + 5 + 1

The transduction grammar translates 3 + 5 + 1 into: take a 3

make it the result take a 5 add it to the previous result

take a 1 add it to the previous resultwhich is indeed what 3 + 5 + 1 “means”

Page 416: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

416

Augmented Transition Network (ATN)

• Semantics can be introduced into a recursive transition network by attaching actions to the transitions in the graphs.

• These actions can set variables, construct data structures, etc.

• A thus augmented recursive transition network is known as an Augmented Transition Network (ATN).

Sentence

Name

List & Name

Recursive Transition Network:

Page 417: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

417

Generative Power

• Grammars generate languages.• By applying restrictions to grammar rules we

reduce the generative power of the grammars.

Page 418: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

418

Grammar power = language size?

• One often hears statements like these:Type grammars are more powerful than Type grammars, for A regular (Type 3) grammar is not powerful enough to match parentheses.

• What kind of power is meant?• One might think that it is the power to generate

larger and larger sets, but this is clearly incorrect: the largest possible set of strings, Σ* (the set of all strings over the alphabet) is easily generated by the (unpowerful) Type 3 grammar:

S → [Σ] S | ε where [Σ] is an abbreviation for the symbols of the alphabet

Page 419: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

419

Power means restriction!

• Restricting Σ* requires more powerful grammars.• More powerful grammars can define more

complicated boundaries between correct and incorrect sentences.Example: a Type 1 (context-sensitive) grammar can define sentences with the same number of as, bs, and cs (i.e., anbncn) whereas the best that a Type 2 (context-free) grammar can do is define sentences with the same number of as and bs, with no restrictions on the number of cs (i.e., anbncm).

• Some boundaries are so fine, they cannot be described by any grammar (that is, by any generative process).

Page 420: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

420

Power of a grammar

• A more powerful grammar does not mean that the grammar can generate larger languages (larger sets).

• More powerful means that the grammar can define more precise rules regarding what strings are allowed in the language.

Page 421: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

421

Metaphor for grammar power: outlining a rose

• Imagine drawing a rose. It is approximated by increasingly finer outlines.• In this metaphor, the rose corresponds to the language (imagine the

strings of the language as molecules in the rose); the grammar serves to delineate its silhouette.

• A regular grammar only allows us straight horizontal and vertical line segments to describe the flower. Ruler and T-square suffice, but the result is a course and mechanical looking picture.

• A CF grammar would approximate the outline by straight lines at any angle and by circle segments. The drawing could still be made using the classical tools of compass and ruler. The result is stilted but recognizable.

• A CS grammar would present us with a smooth curve tightly enveloping the flower, but the curve is too smooth: it cannot follow all the sharp turns, and it deviates slightly at complicated points. Still, a very realistic picture results.

• An unrestricted phrase structure grammar can represent the outline perfectly.

• The rose itself cannot be caught in a finite description. Its essence remains forever out of our reach.

Page 422: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

422

Set of Java programs that can be generated by the grammar types

• A regular grammar can generate the set of all “lexically correct” Java programs. A Java program is lexically correct if there are no newlines inside strings, comments are terminated before end-of-file, all numerical constants have the right form, etc.

• A context-free grammar can generate the set of all “syntactically correct” Java programs. These programs conform to the CF grammar in the language manual.

• A context-sensitive grammar can generate the set of all semantically correct Java programs. These are the programs that pass through a Java compiler without drawing error messages.

• An unrestricted phrase structure grammar can generate the set of all Java programs that would terminate in finite time when run with a given input. Such a grammar would, however, be very complicated, since it would incorporate detailed descriptions of the Java library routines and the Java run-time system.

• The set of all Java programs that solve a given problem (for example, play chess) cannot be generated by a grammar (although the description of the set is finite).

• Note that each of the above sets is a subset of the previous set.

Page 423: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

423

Sets generated by the grammar types

Page 424: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

424

Set versus LanguageSets that can be generated by a grammar type

Languages that can be generated by a grammar type

Page 425: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

425

Set of XSLT programs that can be generated by the grammar types

The set of all “lexically correct” XSLT programs can be generated by a regular grammar. An XSLT <xsl:value-of> element is lexically correct if it starts with <xsl:value-of, optionally followed by one or more attribute/value pairs, followed by </xsl:value-of>

1. Value-of → <xsl:value-of Rest 2. Rest → string = “ Expression3. Rest → End-tag4. Expression → string ” End-tag5. End-tag → > </xsl:value-of >

Page 426: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

426

Set of XSLT programs that can be generated by the grammar types

The set of all “syntactically correct” XSLT programs can be generated by a context-free grammar. These programs conform to the CF grammar in the XSLT specification.

1. Value-of → Start-tag End-tag2. Start-tag → <xsl:value-of Select >3. Start-tag → <xsl:value-of Separator >4. Start-tag → <xsl:value-of Disable-output-escaping >5. End-tag → </xsl:value-of >6. Select → select = “ Expression ” 7. Separator → separator = “ AVT ”8. Disable-output-escaping → disable-output-escaping = “ YESNO ”9. YESNO → yes | no

Page 427: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

427

Set of XSLT programs that can be generated by the grammar types

• The set of all semantically correct XSLT programs can be generated by a CS grammar. These are the programs that pass through an XSLT processor without drawing error messages.

• The set of all XSLT programs that would terminate in finite time when run with a given input can be generated by an unrestricted phrase structure grammar. Such a grammar would, however, be very complicated, since it would incorporate detailed descriptions of the XPath function routines and the XSLT run-time system (e.g. optimizations).

• The set of all XSLT programs that solve a given problem (for example, play chess) cannot be generated by a grammar (although the description of the set is finite).

Page 428: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

428

The basis for the importance of context-free grammars

• A Chomsky grammar is a finite mechanism that produces a (usually) infinite set of strings, a “language”.

• Unlike many other set generation mechanisms, this production process assigns a structure to the produced string, which can be utilized to attach semantics to it. – For context-free (Type 2) grammars, this structure is a tree,

which allows the semantics to be composed from the semantics of the branches. This is the basis of the importance of context-free grammars.

Page 429: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

429

Tom, Dick and Harry example

The following slides illustrate each of the grammar types, using the tdh language.

Page 430: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

430

Example of a phrase structure (PS) grammar, Type 0

1. Sentence → Name | List End2. Name → tom | dick | harry3. List → Name | Name , List4. , Name End → and Name

Notice that rule 4 has more stuff on the left-hand side than the right-hand side. That’s what characterizes a PS grammar.Key Point: this grammar is in the form of a phrase-structure grammar but the language (set) it generates can be generated by a mere regular grammar. Here is what characterizes a set that is Type 0: an item not in the set cannot be ruled out in finite time. That is, it may take an infinite amount of time to determine that an item is not in the set. More formally, determining if a given item belongs to a set generated by a Type 0 grammar is undecidable.

Page 431: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

431

PS grammar vs. PS language

• A PS grammar is a grammar that has the proper form: no restrictions on the LHS of each rule, other than it must contain a non-terminal.

• A PS language is a language (set) that can only be generated by a PS grammar, not a CS grammar or CF grammar or FS grammar.

Page 432: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

432

Example of a context-sensitive (CS) grammar , Type 1

1. Sentence → Name | List2. Name → tom | dick | harry3. List → EndName | Name Comma List4. Comma EndName → and EndName5. and EndName → and Name6. Comma → ,

Notice that each rule has at least as much stuff on the right-hand side as on the left-hand side. Further, the RHS is exactly like the LHS, except one non-terminal has been changed. That’s what characterizes a CS grammar.Key Point: this grammar is in the form of a context-sensitive grammar but the language (set) it generates can be generated by a mere context-free grammar. What distinguishes a Type 1 grammar from a Type 0 grammar is that an item not in the set can be determined in finite time.

Page 433: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

433

Example of a context-free (CF) grammar , Type 2

1. Sentence → Name | List and Name2. Name → tom | dick | harry3. List → Name , List | Name

Notice that each rule has exactly one non-terminal on the LHS: the non-terminal is defined independent of context.

Page 434: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

434

Example of a regular/finite-state (FS) grammar , Type 3

1. Sentence → tom | dick | harry | List2. List → tom ListTail | dick ListTail | harry ListTail3. ListTail →, List | and tom | and dick | and harry

This is a right-regular grammar: each rule’s non-terminal is at the right end of the rule.

Page 435: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

435

Example of a finite-choice (FC) grammar , Type 4

Sentence → [tdh] | [tdh] and [tdh] | [tdh] , [tdh] & [tdh]

Note: Type 4 is not part of the Chomsky hierarchy.

Page 436: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

436

Standard example of a Type 1 language: anbncn

1. S → aSBC | aBC2. CB → HB3. HB → HC4. HC → BC5. aB → ab6. bB → bb7. bC → bc8. cC → cc

S (start)aSBC (rule 1)aaBCBC (rule 1)aabCBC (rule 5)aabHBC (rule 2)aabHCC (rule 3)aabBCC (rule 4)aabbCC (rule 6)aabbcC (rule 7)aabbcc (rule 8)

Derivation of a2b2c2CS grammar for anbncn

Page 437: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

437

Formal summary of Type 0, 1, 2, 3 grammars

A generative grammar G = (VN, VT, S, F) is said to be of type i if it satisfies the restrictions described in this list:

i = 0 No restrictions except the LHS must contain a non-terminal

i = 1: Every rewriting rule in F has the form Q1AQ2 → Q1PQ2, with Q1, Q2, and P in (VN V∪ T)*, A V∈ N, and P ≠ ε, except possibly for the rule S → ε, which may occur in F, in which case S does not occur on the right-hand sides of the rules.

i = 2: Every rule in F has the form A → P, where A V∈ N, and P (V∈ N V∪ T)*.

i = 3: Every rule in F has the form A → PB or A → P, where A, B V∈ N, and P V∈ T*.

Page 438: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

438

Type 0-3 languages are infinite

• Every language that is Type 0-3 is infinite.• If a language is finite, we can enumerate the

sentences using a Type 4 grammar.

Page 439: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

439

Why does a parse “reconstruct”?

• Why is parsing about “reconstructing”?• That implies that at one time the input string

was in the form of a tree (or graph) but somehow it lost that form and now we are reconstructing it.

Parsing is the task of reconstructing the derivation tree (or graph) for a given input string.

Page 440: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

440

The brain creates parse trees

According to Chomsky, and I think he is right in this, sentences in a language, natural or artificial, are constructed according to a grammar. While being generated they obtain a structure, the generation tree (or graph in the case of PS grammars). This structure encodes the meaning. When the sentence is spoken or written the terminal symbols (words) alone are transferred to the listener or reader, losing the structure (linearized). But since the meaning is attached to that structure the listener or reader will have to reconstruct the generation tree, now called the parse tree, to retrieve the meaning. That's why we need parsing. Actually I do not think Chomsky is a 100% right. CF grammars are not strong enough, and people don't use PS grammars. I think they use affix or attribute grammars (they are equivalent). But the above paragraph still holds.

Dick Grune

Page 441: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

441

The brain creates parse trees

Sentences, linear sequences of symbols, are really just serializations of parse trees we humans grok natively in hardware. To get an idea across to someone, we have to conjure up the same parse tree in their head using a word stream.

The Definitive ANTLR 4 Reference, p. 11

Page 442: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

442

The computer scientist is undaunted by undecidable problems

The issue addressed here is the formal linguist saying "you can't do this" (and he is correct) and the computer scientist saying "true, but I can handle an increasing number of instances with increasingly complicated algorithms". A good example is the package AmbiDexter (http://homepages.cwi.nl/~storm/publications/ambidexter.pdf), which "solves" an undecidable problem: is a given CF grammar ambiguous? (impossible to decide according to formal language theory). The program does so by trying all kinds of tricks to have the grammar produce two identical sentences (it was written by Bas Basten, one of those undaunted computer scientists).

Dick Grune

Page 443: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

443

OTHER SLIDES

Page 444: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

444

Grammar-oriented programming• Grammar-oriented programming (GOP) and

Grammar-oriented Object Design (GOOD) are good for designing and creating a domain-specific programming language (DSL) for a specific business domain.

• GOOD can be used to drive the execution of the application or it can be used to embed the declarative processing logic of a context-aware component (CAC) or context-aware service (CAS). GOOD is a method for creating and maintaining dynamically reconfigurable software architectures driven by business-process architectures. The business compiler was used to capture business processes within real-time workshops for various lines of business and create an executable simulation of the processes used.

http://en.wikipedia.org/wiki/Grammar-oriented_programming

Page 445: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

445

Rodney Brooks

Once Turing came up with a formalism for computation we were able make great progress fairly quickly. Now if you took any late 19th-century mathematicians, you could explain the fundamental ideas of computation to them in two or three days, lead them through the theorems, they could understand it and they wouldn't find it mind boggling in any way. It follows on from 19th-century mathematics. Once you have that notion of computation, you are able to do a lot with it.

http://www.edge.org/conversation/the-deep-question

Page 446: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

446

CS grammar for generating the same number of as, bs, and cs

1. S → S’ | ε2. S’ → S’ABC |

ABC3. AB → BA4. AC → CA5. BA → AB6. BC → CB7. CA → AC8. CB → BC9. A → a10. B → b11. C → c

Shuffle the as, bs, and cs

Page 447: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

447

Grammar for generating 1, 2, 4, 8, … as (2i as)

S -> ACaBCa -> aaCCB -> DBCB -> EaD -> DaAD -> ACaE -> EaAE -> ε

S -> ACaB -> AaaCB -> AaaE -> AaEa -> AEaa -> aa

S -> ACaB -> AaaCB -> AaDB -> ADaB -> ACaB -> AaaCB -> AaaDB -> AaDaB -> ADaaB -> ACaaB -> AaaCaaB -> AaaaaCaB -> AaaaaaaCB -> AaaaaaaE -> AaaaaaEa -> AaaaaEaa -> AaaaEaaaa -> AaaEaaaaa -> AaEaaaaa -> AEaaaaaa -> aaaaaa

Page 448: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

448

CS languagesOne of the simplest context-sensitive languages is: the language of all strings consisting of n occurrences of the symbol "a", then n "b"'s, then n "c"'s (abc, aabbcc, aaabbbccc, etc.). A superset of this language, called the Bach language,[1] is defined as the set of all strings where "a", "b" and "c" (or any other set of three symbols) occurs equally often (aabccb, baabcaccb, etc.) and is also context-sensitive.[2][3]

Another example of a context-sensitive language that is not context-free is L = { ap : p is a prime number }. L can be shown to be a context-sensitive language by constructing a linear bounded automaton which accepts L. The language can easily be shown to be neither regular nor context free by applying the respective pumping lemmas for each of the language classes to L.

http://en.wikipedia.org/wiki/Context-sensitive_language

Page 449: Parsing Techniques A Practical Guide by  Dick Grune and Ceriel  J. H. Jacobs

449

Symbols• ∈ = an element of• ∉ = not an element of• ℕ = the set of natural numbers• ε = the empty string• Σ = the alphabet• Σ* = the set of all strings over the alphabet Σ• δ = transition function for a single token• δ* = transition function for a sequence of tokens• → • ⊆ = proper subset of• = a language• = the complement of language • ∩ = intersection• ∪ = union• ↦• P• ∅• ≠• ¬

b

{q0, q1}