lexical and syntactic analysis — an example · connected with the syntax analysis, where the...

94
Lexical and Syntactic Analysis — an example Example: We would like to recognize a language of arithmetic expressions containing expressions such as: 34 x+1 -x * 2 + 128 * (y - z / 3) The expressions can contain number constants — sequences of digits 0, 1,..., 9. The expressions can contain names of variables — sequences consisting of letters, digits, and symbol “ ”, which do not start with a digit. The expressions can contain basic arithmetic operations — “+”, “-”, *”, “/”, and unary “-”. It is possible to use parentheses — “(” and “)”, and to use a standard priority of arithmetic operations. Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 1 / 54

Upload: others

Post on 19-Feb-2020

17 views

Category:

Documents


2 download

TRANSCRIPT

Lexical and Syntactic Analysis — an example

Example: We would like to recognize a language of arithmetic expressionscontaining expressions such as:

34 x+1 -x * 2 + 128 * (y - z / 3)

The expressions can contain number constants — sequences of digits0, 1, . . . , 9.

The expressions can contain names of variables — sequencesconsisting of letters, digits, and symbol “ ”, which do not start witha digit.

The expressions can contain basic arithmetic operations — “+”, “-”,“*”, “/”, and unary “-”.

It is possible to use parentheses — “(” and “)”, and to usea standard priority of arithmetic operations.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 1 / 54

Lexical and Syntactic Analysis — an example

The problem we want to solve:

Input: a sequence of characters (e.g., a string, a text file, etc.)

Output: an abstract syntax tree representing the structure of a givenexpression, or an information about a syntax error in the expression

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 2 / 54

Lexical and Syntactic Analysis — an example

It is convenient to decompose this problem into several parts:

Lexical analysis — recognizing of lexical elements (so calledtokens) such as for example identifiers, number constants, operators,etc.

Syntactic analysis — determining whether a given sequence oftokens corresponds to an allowed structure of expressions; basically, itmeans finding corresponding derivation (resp. derivation tree) fora given word in a context-free grammar representing the givenlanguage (e.g., in our case, the language of all well-formedexpressions).

Construction of an abstract syntax tree — this phase is usuallyconnected with the syntax analysis, where the result, actuallyproduced by the program, is typically not directly a derivation treebut rather some kind of abstract syntax tree or performing of someactions connected with rules of the given grammar.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 3 / 54

Lexical and Syntactic Analysis — an example

Terminals for the grammar representing well-formed expressions:

〈ident〉 — identifier, e.g. “x”, “q3”, “count r12”〈num 〉 — number constant, e.g. “5”, “42”, “65535”“(” — left parenthesis“)” — right parenthesis“+” — plus“-” — minus“*” — star“/” — slash

Remark: Recognizing of sequences of symbols that correspond toindividual terminals is the goal of lexical analysis.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 4 / 54

Lexical and Syntactic Analysis — an example

Example: Expression -x * 2 + 128 * (y - z / 3) is represented bythe following sequence of symbols:

- x * 2 + 1 2 8 * ( y - z / 3 )

The following sequence of tokens corresponds to this sequence of symbols;these tokens are terminal symbols of the given context-free grammar:

- 〈ident〉 * 〈num 〉 + 〈num 〉 * ( 〈ident〉 - 〈ident〉 / 〈num 〉 )

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 5 / 54

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the first try:

E → 〈ident〉 | 〈num 〉 | (E ) | -E | E +E | E -E | E *E | E /E

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 6 / 54

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the first try:

E → 〈ident〉 | 〈num 〉 | (E ) | -E | E +E | E -E | E *E | E /E

This grammar is ambiguous.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 6 / 54

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the second try:

E → T | T +E | T -E

T → F | F *T | F /T

F → 〈ident〉 | 〈num 〉 | (E ) | -F

Different levels of priority are represented by different nonterminals:

E — expression

T — term

F — factor

This grammar is unambiguous.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 6 / 54

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the third try:

E → T | T AE

A → + | -

T → F | F M T

M → * | /

F → 〈ident〉 | 〈num 〉 | (E ) | -F

We create separate nonterminals for operators on different levels of priority:

A — additive operator

M — multiplicative operator

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 6 / 54

Lexical and Syntactic Analysis — an example

The context-free grammar for the given language — the fourth try:

S → E 〈eof 〉E → T | T AE

A → + | -

T → F | F M T

M → * | /

F → 〈ident〉 | 〈num 〉 | (E ) | -F

It is useful to introduce special nonterminal 〈eof 〉 representing theend of input.

Moreover, in this grammar the initial nonterminal S does not occuron the right hand side of any grammar.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 6 / 54

Implementation of Lexical Analysis

Enumerated type Token kind representing different kinds of tokens:

T EOF — the end of inputT Ident — identifierT Number — number constantT LParen — “(”T RParen — “)”T Plus — “+”T Minus — “-”T Star — “*”T Slash — “/”

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 7 / 54

Implementation of Lexical Analysis

Variable c : a currently processed character (resp. a special value 〈eof 〉representing the end of input):

at the beginning, the first character in the input is read to variable c

function next-char() returns a next charater from the input

Some helper functions:

error() — outputs an information about a syntax error and abortsthe processing of the expression

is-ident-start-char(c) — tests whether c is a charater that can occurat the beginning of an identifier

is-ident-normal-char(c) — tests whether c is a character that canoccur in an identifier (on other positions except beginning)

is-digit(c) — tests whether c is a digit

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 8 / 54

Implementation of Lexical Analysis

Some other helper functions:

create-ident(s) — creates an identifier from a given string s

create-number(s) — creates a number from a given string s

Auxiliary variables:

last-ident — the last processed identifier

last-num — the last processed number constant

Function next-token() — the main part of the lexical analyser, itreturns the following token from the input

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 9 / 54

Implementation of Lexical Analysis

next-token ():while c ∈ {“ ”,“\t”} do

c := next-char();

if c == 〈eof 〉 then return T EOFelse switch c do

case “(”: do c := next-char(); return T LParencase “)”: do c := next-char(); return T RParencase “+”: do c := next-char(); return T Pluscase “–”: do c := next-char(); return T Minuscase “*”: do c := next-char(); return T Starcase “/”: do c := next-char(); return T Slashotherwise do

if is-ident-start-char(c) then return scan-ident()else if is-digit(c) then return scan-number()else error()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 10 / 54

Implementation of Lexical Analysis

scan-ident ():s := c

c := next-char()while is-ident-normal-char(c) do

s := s · cc := next-char()

last-ident := create-ident(s)return T Ident

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 11 / 54

Implementation of Lexical Analysis

scan-number ():s := c

c := next-char()while is-digit(c) do

s := s · cc := next-char()

last-num := create-number(s)return T Number

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 12 / 54

Implementation of Syntactic Analysis

Variable t :

the last processed token

A helper function:

init-scanner():

initializes the lexical analyser

reads the first character from the input into variable c, aby tam bylnachystan pro nasledna volanı funkce next-token()

Reading a next token:

next-token():

this is the previously described main function of the lexical analyserby repeatedly calling this function we read the tokensvariable c always contains the symbol that has been read last

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 13 / 54

Implementation of Syntactic Analysis

One of the often used methods of syntactic analysis is recursive descent:

For each nonterminal there is a corresponding function — thefunction corresponding to nonterminal A implements all rules withnonterminal A on the left-hand side.

In a given function, the next token is used to select betweencorresponding rules.

Instructions in the body of a function correspond to processing ofright-hand sides of the rules:

an occurrence of nonterminal B — the function corresponding tononterminal B is called

an occurrence of terminal a — it is checked that the following tokencorresponds to terminal a, when it does, the next token is read,otherwise an error is reported

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 14 / 54

Implementation of Syntactic Analysis

The previously described grammed is not very suitable for the recursivedescent because it is not possible for nonterminals E and T to determinein a deterministic way one of the given pair of rules by use of just onefollowing symbol:

S → E 〈eof 〉E → T | T AE

A → + | -

T → F | F M T

M → * | /

F → 〈ident〉 | 〈num 〉 | (E ) | -F

For example, if we want to rewrite nonterminal T and we know that thefollowing terminal in the input is 〈num 〉, this terminal can be generated byuse of any of the rules

T → F T → F M T

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 15 / 54

Implementation of Syntactic Analysis

The following modified grammar does not have this problem:

S → E 〈eof 〉E → T G

G → ε | AT G

A → + | -

T → F U

U → ε | M F U

M → * | /

F → -F | (E ) | 〈ident〉 | 〈num 〉

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 16 / 54

Implementation of Syntactic Analysis

Parse ():init-scanner()t := next-token()Parse-S()

S → E 〈eof 〉

Parse-S ():Parse-E()if t 6= T EOF then error()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 17 / 54

Implementation of Syntactic Analysis

E → T G

Parse-E ():Parse-T()

Parse-G()

G → ε | AT G

Parse-G ():if t ∈ {T Plus,T Minus} then

Parse-A()

Parse-T()

Parse-G()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 18 / 54

Implementation of Syntactic Analysis

T → F U

Parse-T ():Parse-F()Parse-U()

U → ε | M F U

Parse-U (e1):if t ∈ {T Star,T Slash} then

Parse-M()

Parse-F()parse-U()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 19 / 54

Implementation of Syntactic Analysis

A → + | -

Parse-A ():switch t do

case T Plus dot := next-token()

case T Minus dot := next-token()

else error()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 20 / 54

Implementation of Syntactic Analysis

M → * | /

Parse-M ():switch t do

case T Star dot := next-token()

case T Slash dot := next-token()

otherwise do error()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 21 / 54

Implementation of Syntactic Analysis

F → 〈ident〉| 〈num 〉| (E )

| -F

Parse-F ():switch t do

case T Ident dot := next-token()

case T Number dot := next-token()

case T LParen dot := next-token()Parse-E()if t 6= T RParen then error()t := next-token()

case T Minus dot := next-token()Parse-F()

otherwise do error()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 22 / 54

Implementation of Syntactic Analysis

If a function ends with a recursive call of itself, as for examplefunction Parse-G(), it is possible to replace this recursion with aniteration.

Functions Parse-E() and Parse-G() can be merged into onefunction.

Similarly, it is possible to replace a recursion with an iteration infunction Parse-U(), and functions Parse-T() and Parse-U() canbe merged into one function.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 23 / 54

E → T G

G → ε | AT G

Parse-E ():Parse-T()

while t ∈ {T Plus,T Minus} doParse-A()

Parse-T()

T → F U

U → ε | M F U

Parse-T ():Parse-F()while t ∈ {T Star,T Slash} do

Parse-M()

Parse-F()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 24 / 54

Implementation of Syntactic Analysis

The implementation described above just finds out whether the giveninput corresponds to some word that can be generated by the givengrammar.

If this is the case, it reads whole input and finishes successfully.

If it is not the case, function error() is called.

In real implementation, it is useful to provide function error() witherror messages describing the kind of error together with theinformation about a position in the input where the error occurred(e.g., this line and column where the currently processed token starts).

Function error() can use this information to create error messagesthat are displayed to a user.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 25 / 54

Implementation of Syntactic Analysis

Typically, we do not want to use syntactic analysis just to check thatthe input is correct but also to create abstract syntax tree or toperform some other types of actions connected with individual rules ofthe grammar.

The previously presented code can be used as a base that can beextended with other actions such as construction of an abstractsyntax tree, modifications of read expressions, and possibly someother types of computation.

When the functions that correspond to nonterminals should createthe corresponding abstract syntax tree, they can return theconstructed subtree, corresponding to the part of the expressiongenerated from the given nonterminal, as a return value.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 26 / 54

Implementation of Syntactic Analysis

Construction of an abstract syntax tree:

An enumerated type representing binary arithmetic operations:enum Bin op { Add, Sub, Mul, Div }

An enumerated type representing unary arithmetic operations:enum Un op { Un minus }

Functions for creation of different kinds of nodes of an abstractsyntax tree:

mk-var(ident) — creates a leaf representing a variable

mk-num(num) — creates a leaf representing a number constant

mk-unary(op, e) — creates a node with one child e, on whicha unary operation op (of type Un op) is applied

mk-binary(op, e1, e2) — creates a node with two children e1 and e2,on which a binary operation op (of type Bin op) is applied

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 27 / 54

Implementation of Syntactic Analysis

S → E 〈eof 〉

Parse ():init-scanner()t := next-token()e := Parse-E()if t 6= T EOF then error()return e

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 28 / 54

Implementation of Syntactic Analysis

E → T G

G → ε | AT G

Parse-E ():e1 := Parse-T()

while t ∈ {T Plus,T Minus} doop := Parse-A()

e2 := Parse-T()

e1 := mk-binary(op, e1, e2)

return e1

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 29 / 54

Implementation of Syntactic Analysis

A → + | -

Parse-A ():switch t do

case T Plus dot := next-token()return Add

case T Minus dot := next-token()return Sub

otherwise do error()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 30 / 54

Implementation of Syntactic Analysis

T → F U

U → ε | M F U

Parse-T ():e1 := Parse-F()while t ∈ {T Star,T Slash} do

op := Parse-M()

e2 := Parse-F()e1 := mk-binary(op, e1, e2)

return e1

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 31 / 54

Implementation of Syntactic Analysis

M → * | /

Parse-M ():switch t do

case T Star dot := next-token()return Mul

case T Slash dot := next-token()return Div

otherwise do error()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 32 / 54

F → 〈ident〉| 〈num 〉| (E )

| -F

Parse-F ():switch t do

case T Ident doe := mk-var(last-ident)t := next-token()return e

case T Number doe := mk-num(last-num)

t := next-token()return e

case T LParen dot := next-token()e := Parse-E()if t 6= T RParen then error()t := next-token()return e

case T Minus dot := next-token()e := Parse-F()return mk-unary(Un minus, e)

otherwise do error()

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 33 / 54

Reduction of a Context-Free Grammar

Definition

A context-free grammar G = (Π,Σ, S ,P) is reduced if for every A ∈ Π:

there are some u, v ∈ Σ∗ such that S ⇒∗ uAv , and

there is some w ∈ Σ∗ such that A ⇒∗ w .

Remark: Obviously, if S ⇒∗ uAv and A ⇒

∗ w where u, v ,w ∈ Σ∗, thenS ⇒

∗ uwv , and so A is used in some derivation of a word from Σ∗.

On the other hand, if A is used in some derivation S ⇒∗ z of

a word z ∈ Σ∗, then z can be divided into parts u, v ,w such that z = uwv

and S ⇒∗ uAv and A ⇒

∗ w .

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 34 / 54

Reduction of a Context-Free Grammar

Obviously, every A ∈ Π with the property that

there are no u, v ∈ Σ∗ such that S ⇒∗ uAv , or

there is no w ∈ Σ∗ such that A ⇒∗ w ,

can be safely removed from the grammar (together with all rules where itoccurs) without affecting the generated language.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 35 / 54

Reduction of a Context-Free Grammar

An algorithm that for a given CFG G contructs an equivalent reducedgrammar:

1 Construct the set T of all nonterminals that can generate a terminalword:

T = {A ∈ Π | (∃w ∈ Σ∗)(A ⇒∗ w) }

2 Remove from G all nonterminals from the set Π− T together with allrules where they occur.Denote the rusulting grammar G ′ = (Π ′, Σ, S ,P ′).

3 Construct the set D of all nonterminals that can be “reached” fromthe initial nonterminal S :

D = {A ∈ Π ′ | (∃α,β ∈ (Π ′ ∪ Σ)∗)(S ⇒∗ αAβ) }

4 Remove from G ′ all nonterminals from the set Π ′ −D together withall rules where they occur.The rusulting grammar G ′′ is the result of the whole algorithm.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 36 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε

D0 = {S}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε

D0 = {S}

D1 = {S ,A,C }

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε

D0 = {S}

D1 = {S ,A,C }

D = {S ,A,C }

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Reduction of a Context-Free Grammar

Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε

D0 = {S}

D1 = {S ,A,C }

D = {S ,A,C }

S → AC

A → aC | AbA

C → aa

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 37 / 54

Some Properties of Context-free Grammars

Let us assume we have a context-free grammar G = (Π,Σ, S ,P).

We can easily construct algorithms for the following problems dealing withsome properties of context-free grammar G:

To find out for given α ∈ (Π ∪ Σ)∗ whether α ⇒∗ ε.

To find, for given α ∈ (Π ∪ Σ)∗, the set first(α), where

first(α) = { a ∈ Σ | α ⇒∗ aβ for some β ∈ (Π ∪ Σ)∗ }

To find, for given α ∈ (Π ∪ Σ)∗, the set last(α), where

last(α) = { a ∈ Σ | α ⇒∗ βa for some β ∈ (Π ∪ Σ)∗ }

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 38 / 54

Some Properties of Context-free Grammars

To find, for given nonterminal A ∈ Π, the set follow(A), where

follow(A) = { a ∈ Σ | S ⇒∗ β1Aaβ2 for some β1, β2 ∈ (Π ∪ Σ)∗ }

To find all nonterminals A ∈ Π, for which grammar G contains theleft recursion, i.e., those for which

A ⇒+ Aα for some α ∈ (Π ∪ Σ)∗

To find all nonterminals A ∈ Π, for which grammar G contains theright recursion, i.e., those for which

A ⇒+ αA for some α ∈ (Π ∪ Σ)∗

Remark: Notation α ⇒+ β, where α,β ∈ (Π ∪ Σ)∗, denotes that α can

be rewritten to β (i.e., α ⇒∗ β) by a derivation with a nonzero number of

steps.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 39 / 54

Some Properties of Context-free Grammars

To be able to use a given context-free grammar G for a straightforwardimplementation of recursive descent, it must have some particularproperties:

It must not contain left recursion.

For each nonterminal A ∈ Π and all rules with A on the left-handside, i.e.,

A → α1 | α2 | · · · | αn

the sets first(α1), first(α2), . . . , first(αn) must be pairwise disjoint.

For every nonterminal A ∈ Π and all rules A → α1 | α2 | · · · | anthere can be at most one right-hand side αi such that αi ⇒

∗ ε.

If there is such right-hand side (and so A ⇒∗ ε), the sets first(α1),

first(α2), . . . , first(αn) must be disjoint with the set follow(A).

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 40 / 54

Removing Epsilon-rules

Rules of the form A → ε are called epsilon-rules (ε-rules).

Proposition

For every context-free grammar G there is a context-free grammar G ′

without ε-rules such that L(G ′) = L(G) − {ε}.

Proof: Construct the set E of all nonterminals that can be rewritten to ε,i.e.,

E = {A ∈ Π | A ⇒∗ ε }

Remove all ε-rules and replace every other rule A → α with a set of rulesobtained by all possible rules of the form A → α ′ where α ′ is obtainedfrom α by possible ommitting of (some) occurrences of nonterminalsfrom E .

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 41 / 54

Removing Epsilon-rules

Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 42 / 54

Removing Epsilon-rules

Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 42 / 54

Removing Epsilon-rules

Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}

E1 = {B ,D}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 42 / 54

Removing Epsilon-rules

Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}

E1 = {B ,D}

E2 = {B ,D,A}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 42 / 54

Removing Epsilon-rules

Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}

E1 = {B ,D}

E2 = {B ,D,A}

E = {B ,D,A}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 42 / 54

Removing Epsilon-rules

Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}

E1 = {B ,D}

E2 = {B ,D,A}

E = {B ,D,A}

S → ASA | SA | AS | S | aBC | aC | b

A → BD | B | D | aAB | aB | aA | a

B → bB | b

C → AaA | aA | Aa | a | b

D → AD | D | A | BBB | BB | B | a

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 42 / 54

Removing Epsilon-rules

For every context-free grammar G = (Π,Σ, S ,P) there is a context-freegrammar G ′ = (Π ′, Σ, S ′,P ′) such that L(G ′) = L(G) and either:

G ′ does not contain ε-rules, or

the only ε-rule in G ′ is the rule S ′→ ε and S ′ does not occur on the

right-hand side of any rule in G ′.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 43 / 54

Removing Unit-rules

Rules of the form A → B where A,B ∈ Π are called unit rules.

Proposition

For every context-free grammar G there is a context-free grammar G ′

without ε-rules and without unit rules such that L(G ′) = L(G) − {ε}.

Proof: Assume G = (Π,Σ, S ,P) does not contain ε-rules.

For each A ∈ Π compute the set NA of all nonterminals that can beobtained from A by using only unit rules, i.e.,

NA = {B ∈ Π | A ⇒∗ B }

Construct CFG G ′ = (Π,Σ, S ,P ′) where P ′ consist of rules of the formA → β where A ∈ Π, β is not a single nonterminal, and (B → β) ∈ P forsome B ∈ NA.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 44 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}

N 1D = {D,B}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}

N 1D = {D,B}

N 2D = {D,B ,C }

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}

N 1D = {D,B}

N 2D = {D,B ,C }

NS = {S ,C ,D,B}

NA = {A}

NB = {B ,C ,D}

NC = {C ,D,B}

ND = {D,B ,C }

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}

N 1D = {D,B}

N 2D = {D,B ,C }

NS = {S ,C ,D,B}

NA = {A}

NB = {B ,C ,D}

NC = {C ,D,B}

ND = {D,B ,C }

S → AB | AA | AaA | ABb | b

A → a | bA

B → b | AA | AaA | ABb

C → AA | AaA | ABb | b

D → ABb | b | AA | AaA

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 45 / 54

Chomsky Normal Form

Definition

A context-free grammar is in Chomsky normal form if every rule is of onof the following forms:

A → BC

A → a

where a is any terminal and A, B ,and C are any nonterminals.

In addition we permit the rule S → ε, where S the initial nonterminal. Inthat case, nonterminal S cannot occur on the right-hand side of any rule.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 46 / 54

Chomsky Normal Form

Proposition

For every context-free grammar G there is an equivalent context-freegrammar G ′ in Chomsky normal form.

Proof: Perform the following transformations on G:

1 Decompose each rule A → α where |α| ≥ 3 into a sequence of ruleswhere each right-hand size has length 2.

2 Remove ε-rules.

3 Remove unit rules.

4 For each terminal a occurring on the right-hand size of some ruleA → α where |α| = 2 introduce a new nonterminal Na, replaceoccurrences of a on such right-hand sides with Na, and add Na → a

as a new rule.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 47 / 54

Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 48 / 54

Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 48 / 54

Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 48 / 54

Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}

S0 → S

S → AZ | Z | aB | a

Z → SA | S

A → B | S

B → b

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 48 / 54

Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}

S0 → S

S → AZ | Z | aB | a

Z → SA | S

A → B | S

B → b

Step 3:

NS0= {S0,S ,Z }

NS = {S ,Z }

NZ = {Z ,S}

NA = {A,B ,S ,Z }

NB = {B}

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 48 / 54

Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}

S0 → S

S → AZ | Z | aB | a

Z → SA | S

A → B | S

B → b

Step 3:

NS0= {S0,S ,Z }

NS = {S ,Z }

NZ = {Z ,S}

NA = {A,B ,S ,Z }

NB = {B}

S0 → AZ | aB | a | SA

S → AZ | aB | a | SA

Z → SA | AZ | aB | a

A → b | AZ | aB | a | SA

B → b

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 48 / 54

Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}

S0 → S

S → AZ | Z | aB | a

Z → SA | S

A → B | S

B → b

Step 3:

NS0= {S0,S ,Z }

NS = {S ,Z }

NZ = {Z ,S}

NA = {A,B ,S ,Z }

NB = {B}

S0 → AZ | aB | a | SA

S → AZ | aB | a | SA

Z → SA | AZ | aB | a

A → b | AZ | aB | a | SA

B → b

Step 4:

S0 → AZ | YB | a | SA

S → AZ | YB | a | SA

Z → SA | AZ | YB | a

A → b | AZ | YB | a | SA

B → b

Y → a

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 48 / 54

Chomsky Normal Form

Grammar G = (Π,Σ, S ,P)in Chomsky normal form has some propertiesthat allow to determine whether w ∈ Σ∗ belongs to the languagegenerated by grammar G (i.e., if w ∈ L(G)):

Let us assume that w ∈ L(G) (and so S ⇒∗ w)and that |w | = n,

where n ≥ 1. Then for (every) derivation S ⇒∗ w holds:

The rules of the form A → a (i.e., a nonterminal is rewritten toexactly one terminal) are used in exactly n steps of the derivation.

The rules of the form A → BC (i.e., a nonterminal is rewritten toa pair of nonterminals) are used in exactly n− 1 steps of the derivation.

So every derivation S ⇒∗ w , where |w | = n, has 2n − 1 steps, where n of

these steps are of the form A → a and n − 1 of the form A → BC .

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 49 / 54

Chomsky Normal Form

To find out whether S ⇒∗ w , it is sufficient to try by brute force all

possible derivations of length 2n − 1.

Such algorithm has exponential time complexity with respect to the lengthof w .

Such systematic trying of all possibilities can be implemented by using socalled dynamic programming in a way that is much more efficient thana straightforward algorithm that generates all derivations of the givenlength.

Cocke-Younger-Kasami algorithm, with time complexity O(n3), is basedon this idea. (Assuming a fixed grammar G.)

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 50 / 54

Cocke-Younger-Kasami Algorithm

The question if S ⇒∗ w is a special case of the question if

A ⇒∗ w ,

where A ∈ Π is an arbitrary nonterminal and w ∈ Σ∗ is an arbitrary wordconsisting of terminals.

It is obvious that:

If |w | = 1: Then A ⇒∗ w iff there is a rule A → b in P where w = b.

If |w | > 1: Then A ⇒∗ w iff there is a rule A → BC in P where for

some words u and v such that w = uv , |u| ≥ 1 and |v | ≥ 1, it holdsthat B ⇒

∗ u and C ⇒∗ v .

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 51 / 54

Cocke-Younger-Kasami Algorithm

Let us assume that a word w ∈ Σ∗ with |w | = n where n ≥ 1 and

w = a1a2 · · · an .

Instead of solving the original question whether S ⇒∗ w , we will solve the

following more general problem for all nonempty subwords v of the word w :

To find the set of all nonterminals A from the set Π such thatA ⇒

∗ v .

Let us denote the set of all nonterminals generating subword v of length i

and starting on position j as F [i ][j ], i.e., for each A ∈ Π it holds that

A ∈ F [i ][j ] ⇐⇒ A ⇒∗ ajaj+1 . . . aj+(i−1)

To find out whether S ⇒∗ w , is therefore the same problem as to find out

whether S ∈ F [n][1].Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 52 / 54

Cocke-Younger-Kasami Algorithm

The algorithm computes values F [i ][j ] at first for subwords oflength 1 (i.e., i = 1), then for subwords of length 2 (i.e., i = 2), thenfor subwords of length 3, length 4, etc.

Values F [i ][j ] are stored in a twodimensional array F , where1 ≤ i ≤ n a 1 ≤ j ≤ n − i + 1, where the elements of this array aresubsets of nonterminals from the set Π.

In the computation of the value F [i ][j ] the previously computedvalues F [i ′][j ′], where i ′ < i , are used.

Let us assume that at the beginning all elements of array F areinitialized to ∅.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 53 / 54

Algoritmus Cocke-Younger-Kasami

for j := 1 to n dofor each (A → b) ∈ P do

if b = aj thenadd A to F [1][j ]

for i := 2 to n dofor j := 1 to n − i + 1 do

for k := 1 to i − 1 dofor each (A → BC ) ∈ P do

if B ∈ F [k ][j ] and C ∈ F [i − k ][j + k ] thenadd A to F [i ][j ]

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 54 / 54