lexical and syntactic analysis — an example · connected with the syntax analysis, where the...

Lexical and Syntactic Analysis — an example

Example: We would like to recognize a language of arithmetic expressionscontaining expressions such as:

34 x+1 -x * 2 + 128 * (y - z / 3)

The expressions can contain number constants — sequences of digits0, 1, . . . , 9.

The expressions can contain names of variables — sequencesconsisting of letters, digits, and symbol “ ”, which do not start witha digit.

The expressions can contain basic arithmetic operations — “+”, “-”,“*”, “/”, and unary “-”.

It is possible to use parentheses — “(” and “)”, and to usea standard priority of arithmetic operations.

Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 1 / 54


The problem we want to solve:

Input: a sequence of characters (e.g., a string, a text file, etc.)

Output: an abstract syntax tree representing the structure of a givenexpression, or an information about a syntax error in the expression



It is convenient to decompose this problem into several parts:

Lexical analysis — recognizing of lexical elements (so calledtokens) such as for example identifiers, number constants, operators,etc.

Syntactic analysis — determining whether a given sequence oftokens corresponds to an allowed structure of expressions; basically, itmeans finding corresponding derivation (resp. derivation tree) fora given word in a context-free grammar representing the givenlanguage (e.g., in our case, the language of all well-formedexpressions).

Construction of an abstract syntax tree — this phase is usuallyconnected with the syntax analysis, where the result, actuallyproduced by the program, is typically not directly a derivation treebut rather some kind of abstract syntax tree or performing of someactions connected with rules of the given grammar.



Terminals for the grammar representing well-formed expressions:

〈ident〉 — identifier, e.g. “x”, “q3”, “count r12”〈num 〉 — number constant, e.g. “5”, “42”, “65535”“(” — left parenthesis“)” — right parenthesis“+” — plus“-” — minus“*” — star“/” — slash

Remark: Recognizing of sequences of symbols that correspond toindividual terminals is the goal of lexical analysis.



Example: Expression -x * 2 + 128 * (y - z / 3) is represented bythe following sequence of symbols:

- x * 2 + 1 2 8 * ( y - z / 3 )

The following sequence of tokens corresponds to this sequence of symbols;these tokens are terminal symbols of the given context-free grammar:

- 〈ident〉 * 〈num 〉 + 〈num 〉 * ( 〈ident〉 - 〈ident〉 / 〈num 〉 )



The context-free grammar for the given language — the first try:

E → 〈ident〉 | 〈num 〉 | (E ) | -E | E +E | E -E | E *E | E /E



The context-free grammar for the given language — the first try:

E → 〈ident〉 | 〈num 〉 | (E ) | -E | E +E | E -E | E *E | E /E

This grammar is ambiguous.



The context-free grammar for the given language — the second try:

E → T | T +E | T -E

T → F | F *T | F /T

F → 〈ident〉 | 〈num 〉 | (E ) | -F

Different levels of priority are represented by different nonterminals:

E — expression

T — term

F — factor

This grammar is unambiguous.



The context-free grammar for the given language — the fourth try:

S → E 〈eof 〉E → T | T AE

A → + | -

T → F | F M T

M → * | /

F → 〈ident〉 | 〈num 〉 | (E ) | -F

It is useful to introduce special nonterminal 〈eof 〉 representing theend of input.

Moreover, in this grammar the initial nonterminal S does not occuron the right hand side of any grammar.


Implementation of Lexical Analysis

Enumerated type Token kind representing different kinds of tokens:

T EOF — the end of inputT Ident — identifierT Number — number constantT LParen — “(”T RParen — “)”T Plus — “+”T Minus — “-”T Star — “*”T Slash — “/”



Variable c : a currently processed character (resp. a special value 〈eof 〉representing the end of input):

at the beginning, the first character in the input is read to variable c

function next-char() returns a next charater from the input

Some helper functions:

error() — outputs an information about a syntax error and abortsthe processing of the expression

is-ident-start-char(c) — tests whether c is a charater that can occurat the beginning of an identifier

is-ident-normal-char(c) — tests whether c is a character that canoccur in an identifier (on other positions except beginning)

is-digit(c) — tests whether c is a digit



Some other helper functions:

create-ident(s) — creates an identifier from a given string s

create-number(s) — creates a number from a given string s

Auxiliary variables:

last-ident — the last processed identifier

last-num — the last processed number constant

Function next-token() — the main part of the lexical analyser, itreturns the following token from the input



next-token ():while c ∈ {“ ”,“\t”} do

c := next-char();

if c == 〈eof 〉 then return T EOFelse switch c do

case “(”: do c := next-char(); return T LParencase “)”: do c := next-char(); return T RParencase “+”: do c := next-char(); return T Pluscase “–”: do c := next-char(); return T Minuscase “*”: do c := next-char(); return T Starcase “/”: do c := next-char(); return T Slashotherwise do

if is-ident-start-char(c) then return scan-ident()else if is-digit(c) then return scan-number()else error()



scan-ident ():s := c

c := next-char()while is-ident-normal-char(c) do

s := s · cc := next-char()

last-ident := create-ident(s)return T Ident



scan-number ():s := c

c := next-char()while is-digit(c) do

s := s · cc := next-char()

last-num := create-number(s)return T Number


Implementation of Syntactic Analysis

Variable t :

the last processed token

A helper function:

init-scanner():

initializes the lexical analyser

reads the first character from the input into variable c, aby tam bylnachystan pro nasledna volanı funkce next-token()

Reading a next token:

next-token():

this is the previously described main function of the lexical analyserby repeatedly calling this function we read the tokensvariable c always contains the symbol that has been read last



One of the often used methods of syntactic analysis is recursive descent:

For each nonterminal there is a corresponding function — thefunction corresponding to nonterminal A implements all rules withnonterminal A on the left-hand side.

In a given function, the next token is used to select betweencorresponding rules.

Instructions in the body of a function correspond to processing ofright-hand sides of the rules:

an occurrence of nonterminal B — the function corresponding tononterminal B is called

an occurrence of terminal a — it is checked that the following tokencorresponds to terminal a, when it does, the next token is read,otherwise an error is reported



The previously described grammed is not very suitable for the recursivedescent because it is not possible for nonterminals E and T to determinein a deterministic way one of the given pair of rules by use of just onefollowing symbol:

S → E 〈eof 〉E → T | T AE

A → + | -

T → F | F M T

M → * | /

F → 〈ident〉 | 〈num 〉 | (E ) | -F

For example, if we want to rewrite nonterminal T and we know that thefollowing terminal in the input is 〈num 〉, this terminal can be generated byuse of any of the rules

T → F T → F M T



Parse ():init-scanner()t := next-token()Parse-S()

S → E 〈eof 〉

Parse-S ():Parse-E()if t 6= T EOF then error()



E → T G

Parse-E ():Parse-T()

Parse-G()

G → ε | AT G

Parse-G ():if t ∈ {T Plus,T Minus} then

Parse-A()

Parse-T()

Parse-G()



T → F U

Parse-T ():Parse-F()Parse-U()

U → ε | M F U

Parse-U (e1):if t ∈ {T Star,T Slash} then

Parse-M()

Parse-F()parse-U()



A → + | -

Parse-A ():switch t do

case T Plus dot := next-token()

case T Minus dot := next-token()

else error()



M → * | /

Parse-M ():switch t do

case T Star dot := next-token()

case T Slash dot := next-token()

otherwise do error()



F → 〈ident〉| 〈num 〉| (E )

| -F

Parse-F ():switch t do

case T Ident dot := next-token()

case T Number dot := next-token()

case T LParen dot := next-token()Parse-E()if t 6= T RParen then error()t := next-token()

case T Minus dot := next-token()Parse-F()




If a function ends with a recursive call of itself, as for examplefunction Parse-G(), it is possible to replace this recursion with aniteration.

Functions Parse-E() and Parse-G() can be merged into onefunction.

Similarly, it is possible to replace a recursion with an iteration infunction Parse-U(), and functions Parse-T() and Parse-U() canbe merged into one function.


E → T G

G → ε | AT G

Parse-E ():Parse-T()

while t ∈ {T Plus,T Minus} doParse-A()

Parse-T()

T → F U

U → ε | M F U

Parse-T ():Parse-F()while t ∈ {T Star,T Slash} do

Parse-M()

Parse-F()



The implementation described above just finds out whether the giveninput corresponds to some word that can be generated by the givengrammar.

If this is the case, it reads whole input and finishes successfully.

If it is not the case, function error() is called.

In real implementation, it is useful to provide function error() witherror messages describing the kind of error together with theinformation about a position in the input where the error occurred(e.g., this line and column where the currently processed token starts).

Function error() can use this information to create error messagesthat are displayed to a user.



Typically, we do not want to use syntactic analysis just to check thatthe input is correct but also to create abstract syntax tree or toperform some other types of actions connected with individual rules ofthe grammar.

The previously presented code can be used as a base that can beextended with other actions such as construction of an abstractsyntax tree, modifications of read expressions, and possibly someother types of computation.

When the functions that correspond to nonterminals should createthe corresponding abstract syntax tree, they can return theconstructed subtree, corresponding to the part of the expressiongenerated from the given nonterminal, as a return value.



Construction of an abstract syntax tree:

An enumerated type representing binary arithmetic operations:enum Bin op { Add, Sub, Mul, Div }

An enumerated type representing unary arithmetic operations:enum Un op { Un minus }

Functions for creation of different kinds of nodes of an abstractsyntax tree:

mk-var(ident) — creates a leaf representing a variable

mk-num(num) — creates a leaf representing a number constant

mk-unary(op, e) — creates a node with one child e, on whicha unary operation op (of type Un op) is applied

mk-binary(op, e1, e2) — creates a node with two children e1 and e2,on which a binary operation op (of type Bin op) is applied



S → E 〈eof 〉

Parse ():init-scanner()t := next-token()e := Parse-E()if t 6= T EOF then error()return e



E → T G

G → ε | AT G

Parse-E ():e1 := Parse-T()

while t ∈ {T Plus,T Minus} doop := Parse-A()

e2 := Parse-T()

e1 := mk-binary(op, e1, e2)

return e1



A → + | -

Parse-A ():switch t do

case T Plus dot := next-token()return Add

case T Minus dot := next-token()return Sub




T → F U

U → ε | M F U

Parse-T ():e1 := Parse-F()while t ∈ {T Star,T Slash} do

op := Parse-M()

e2 := Parse-F()e1 := mk-binary(op, e1, e2)

return e1



M → * | /

Parse-M ():switch t do

case T Star dot := next-token()return Mul

case T Slash dot := next-token()return Div



F → 〈ident〉| 〈num 〉| (E )

| -F

Parse-F ():switch t do

case T Ident doe := mk-var(last-ident)t := next-token()return e

case T Number doe := mk-num(last-num)

t := next-token()return e

case T LParen dot := next-token()e := Parse-E()if t 6= T RParen then error()t := next-token()return e

case T Minus dot := next-token()e := Parse-F()return mk-unary(Un minus, e)



Reduction of a Context-Free Grammar

Definition

A context-free grammar G = (Π,Σ, S ,P) is reduced if for every A ∈ Π:

there are some u, v ∈ Σ∗ such that S ⇒∗ uAv , and

there is some w ∈ Σ∗ such that A ⇒∗ w .

Remark: Obviously, if S ⇒∗ uAv and A ⇒

∗ w where u, v ,w ∈ Σ∗, thenS ⇒

∗ uwv , and so A is used in some derivation of a word from Σ∗.

On the other hand, if A is used in some derivation S ⇒∗ z of

a word z ∈ Σ∗, then z can be divided into parts u, v ,w such that z = uwv

and S ⇒∗ uAv and A ⇒

∗ w .



Obviously, every A ∈ Π with the property that

there are no u, v ∈ Σ∗ such that S ⇒∗ uAv , or

there is no w ∈ Σ∗ such that A ⇒∗ w ,

can be safely removed from the grammar (together with all rules where itoccurs) without affecting the generated language.



An algorithm that for a given CFG G contructs an equivalent reducedgrammar:

1 Construct the set T of all nonterminals that can generate a terminalword:

T = {A ∈ Π | (∃w ∈ Σ∗)(A ⇒∗ w) }

2 Remove from G all nonterminals from the set Π− T together with allrules where they occur.Denote the rusulting grammar G ′ = (Π ′, Σ, S ,P ′).

3 Construct the set D of all nonterminals that can be “reached” fromthe initial nonterminal S :

D = {A ∈ Π ′ | (∃α,β ∈ (Π ′ ∪ Σ)∗)(S ⇒∗ αAβ) }

4 Remove from G ′ all nonterminals from the set Π ′ −D together withall rules where they occur.The rusulting grammar G ′′ is the result of the whole algorithm.



Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}



Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}



Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}



Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}



Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε



Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε

D0 = {S}



Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε

D0 = {S}

D1 = {S ,A,C }



Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε

D0 = {S}

D1 = {S ,A,C }

D = {S ,A,C }



Example:

S → AC | B

A → aC | AbA

B → Ba | BbA | DB

C → aa | aBC

D → aA | ε

T0 = {C ,D}

T1 = {C ,D,A}

T2 = {C ,D,A, S}

T = {C ,D,A, S}

S → AC

A → aC | AbA

C → aa

D → aA | ε

D0 = {S}

D1 = {S ,A,C }

D = {S ,A,C }

S → AC

A → aC | AbA

C → aa


Some Properties of Context-free Grammars

Let us assume we have a context-free grammar G = (Π,Σ, S ,P).

We can easily construct algorithms for the following problems dealing withsome properties of context-free grammar G:

To find out for given α ∈ (Π ∪ Σ)∗ whether α ⇒∗ ε.

To find, for given α ∈ (Π ∪ Σ)∗, the set first(α), where

first(α) = { a ∈ Σ | α ⇒∗ aβ for some β ∈ (Π ∪ Σ)∗ }

To find, for given α ∈ (Π ∪ Σ)∗, the set last(α), where

last(α) = { a ∈ Σ | α ⇒∗ βa for some β ∈ (Π ∪ Σ)∗ }



To find, for given nonterminal A ∈ Π, the set follow(A), where

follow(A) = { a ∈ Σ | S ⇒∗ β1Aaβ2 for some β1, β2 ∈ (Π ∪ Σ)∗ }

To find all nonterminals A ∈ Π, for which grammar G contains theleft recursion, i.e., those for which

A ⇒+ Aα for some α ∈ (Π ∪ Σ)∗

To find all nonterminals A ∈ Π, for which grammar G contains theright recursion, i.e., those for which

A ⇒+ αA for some α ∈ (Π ∪ Σ)∗

Remark: Notation α ⇒+ β, where α,β ∈ (Π ∪ Σ)∗, denotes that α can

be rewritten to β (i.e., α ⇒∗ β) by a derivation with a nonzero number of

steps.



To be able to use a given context-free grammar G for a straightforwardimplementation of recursive descent, it must have some particularproperties:

It must not contain left recursion.

For each nonterminal A ∈ Π and all rules with A on the left-handside, i.e.,

A → α1 | α2 | · · · | αn

the sets first(α1), first(α2), . . . , first(αn) must be pairwise disjoint.

For every nonterminal A ∈ Π and all rules A → α1 | α2 | · · · | anthere can be at most one right-hand side αi such that αi ⇒

∗ ε.

If there is such right-hand side (and so A ⇒∗ ε), the sets first(α1),

first(α2), . . . , first(αn) must be disjoint with the set follow(A).


Removing Epsilon-rules

Rules of the form A → ε are called epsilon-rules (ε-rules).

Proposition

For every context-free grammar G there is a context-free grammar G ′

without ε-rules such that L(G ′) = L(G) − {ε}.

Proof: Construct the set E of all nonterminals that can be rewritten to ε,i.e.,

E = {A ∈ Π | A ⇒∗ ε }

Remove all ε-rules and replace every other rule A → α with a set of rulesobtained by all possible rules of the form A → α ′ where α ′ is obtainedfrom α by possible ommitting of (some) occurrences of nonterminalsfrom E .



Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}



Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}

E1 = {B ,D}



Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}

E1 = {B ,D}

E2 = {B ,D,A}



Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}

E1 = {B ,D}

E2 = {B ,D,A}

E = {B ,D,A}



Example:

S → ASA | aBC | b

A → BD | aAB

B → bB | ε

C → AaA | b

D → AD | BBB | a

E0 = {B}

E1 = {B ,D}

E2 = {B ,D,A}

E = {B ,D,A}

S → ASA | SA | AS | S | aBC | aC | b

A → BD | B | D | aAB | aB | aA | a

B → bB | b

C → AaA | aA | Aa | a | b

D → AD | D | A | BBB | BB | B | a



For every context-free grammar G = (Π,Σ, S ,P) there is a context-freegrammar G ′ = (Π ′, Σ, S ′,P ′) such that L(G ′) = L(G) and either:

G ′ does not contain ε-rules, or

the only ε-rule in G ′ is the rule S ′→ ε and S ′ does not occur on the

right-hand side of any rule in G ′.


Removing Unit-rules

Rules of the form A → B where A,B ∈ Π are called unit rules.

Proposition

For every context-free grammar G there is a context-free grammar G ′

without ε-rules and without unit rules such that L(G ′) = L(G) − {ε}.

Proof: Assume G = (Π,Σ, S ,P) does not contain ε-rules.

For each A ∈ Π compute the set NA of all nonterminals that can beobtained from A by using only unit rules, i.e.,

NA = {B ∈ Π | A ⇒∗ B }

Construct CFG G ′ = (Π,Σ, S ,P ′) where P ′ consist of rules of the formA → β where A ∈ Π, β is not a single nonterminal, and (B → β) ∈ P forsome B ∈ NA.


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}

N 1D = {D,B}


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}

N 1D = {D,B}

N 2D = {D,B ,C }


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}

N 1D = {D,B}

N 2D = {D,B ,C }

NS = {S ,C ,D,B}

NA = {A}

NB = {B ,C ,D}

NC = {C ,D,B}

ND = {D,B ,C }


Removing Unit-rules

Example:

S → AB | C

A → a | bA

B → C | b

C → D | AA | AaA

D → B | ABb

N 0S = {S}

N 1S = {S ,C }

N 2S = {S ,C ,D}

N 3S = {S ,C ,D,B}

N 0A = {A}

N 0B = {B}

N 1B = {B ,C }

N 2B = {B ,C ,D}

N 0C = {C }

N 1C = {C ,D}

N 2C = {C ,D,B}

N 0D = {D}

N 1D = {D,B}

N 2D = {D,B ,C }

NS = {S ,C ,D,B}

NA = {A}

NB = {B ,C ,D}

NC = {C ,D,B}

ND = {D,B ,C }

S → AB | AA | AaA | ABb | b

A → a | bA

B → b | AA | AaA | ABb

C → AA | AaA | ABb | b

D → ABb | b | AA | AaA


Chomsky Normal Form

Definition

A context-free grammar is in Chomsky normal form if every rule is of onof the following forms:

A → BC

A → a

where a is any terminal and A, B ,and C are any nonterminals.

In addition we permit the rule S → ε, where S the initial nonterminal. Inthat case, nonterminal S cannot occur on the right-hand side of any rule.


Chomsky Normal Form

Proposition

For every context-free grammar G there is an equivalent context-freegrammar G ′ in Chomsky normal form.

Proof: Perform the following transformations on G:

1 Decompose each rule A → α where |α| ≥ 3 into a sequence of ruleswhere each right-hand size has length 2.

2 Remove ε-rules.

3 Remove unit rules.

4 For each terminal a occurring on the right-hand size of some ruleA → α where |α| = 2 introduce a new nonterminal Na, replaceoccurrences of a on such right-hand sides with Na, and add Na → a

as a new rule.


Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε


Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε


Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}


Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}

S0 → S

S → AZ | Z | aB | a

Z → SA | S

A → B | S

B → b


Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}

S0 → S

S → AZ | Z | aB | a

Z → SA | S

A → B | S

B → b

Step 3:

NS0= {S0,S ,Z }

NS = {S ,Z }

NZ = {Z ,S}

NA = {A,B ,S ,Z }

NB = {B}


Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}

S0 → S

S → AZ | Z | aB | a

Z → SA | S

A → B | S

B → b

Step 3:

NS0= {S0,S ,Z }

NS = {S ,Z }

NZ = {Z ,S}

NA = {A,B ,S ,Z }

NB = {B}

S0 → AZ | aB | a | SA

S → AZ | aB | a | SA

Z → SA | AZ | aB | a

A → b | AZ | aB | a | SA

B → b


Chomsky Normal Form

Example:

S → ASA | aB

A → B | S

B → b | ε

Step 1:

S → AZ | aB

Z → SA

A → B | S

B → b | ε

Step 2:

E = {B ,A}

S0 → S

S → AZ | Z | aB | a

Z → SA | S

A → B | S

B → b

Step 3:

NS0= {S0,S ,Z }

NS = {S ,Z }

NZ = {Z ,S}

NA = {A,B ,S ,Z }

NB = {B}

S0 → AZ | aB | a | SA

S → AZ | aB | a | SA

Z → SA | AZ | aB | a

A → b | AZ | aB | a | SA

B → b

Step 4:

S0 → AZ | YB | a | SA

S → AZ | YB | a | SA

Z → SA | AZ | YB | a

A → b | AZ | YB | a | SA

B → b

Y → a


Chomsky Normal Form

Grammar G = (Π,Σ, S ,P)in Chomsky normal form has some propertiesthat allow to determine whether w ∈ Σ∗ belongs to the languagegenerated by grammar G (i.e., if w ∈ L(G)):

Let us assume that w ∈ L(G) (and so S ⇒∗ w)and that |w | = n,

where n ≥ 1. Then for (every) derivation S ⇒∗ w holds:

The rules of the form A → a (i.e., a nonterminal is rewritten toexactly one terminal) are used in exactly n steps of the derivation.

The rules of the form A → BC (i.e., a nonterminal is rewritten toa pair of nonterminals) are used in exactly n− 1 steps of the derivation.

So every derivation S ⇒∗ w , where |w | = n, has 2n − 1 steps, where n of

these steps are of the form A → a and n − 1 of the form A → BC .


Chomsky Normal Form

To find out whether S ⇒∗ w , it is sufficient to try by brute force all

possible derivations of length 2n − 1.

Such algorithm has exponential time complexity with respect to the lengthof w .

Such systematic trying of all possibilities can be implemented by using socalled dynamic programming in a way that is much more efficient thana straightforward algorithm that generates all derivations of the givenlength.

Cocke-Younger-Kasami algorithm, with time complexity O(n3), is basedon this idea. (Assuming a fixed grammar G.)


Cocke-Younger-Kasami Algorithm

The question if S ⇒∗ w is a special case of the question if

A ⇒∗ w ,

where A ∈ Π is an arbitrary nonterminal and w ∈ Σ∗ is an arbitrary wordconsisting of terminals.

It is obvious that:

If |w | = 1: Then A ⇒∗ w iff there is a rule A → b in P where w = b.

If |w | > 1: Then A ⇒∗ w iff there is a rule A → BC in P where for

some words u and v such that w = uv , |u| ≥ 1 and |v | ≥ 1, it holdsthat B ⇒

∗ u and C ⇒∗ v .



Let us assume that a word w ∈ Σ∗ with |w | = n where n ≥ 1 and

w = a1a2 · · · an .

Instead of solving the original question whether S ⇒∗ w , we will solve the

following more general problem for all nonempty subwords v of the word w :

To find the set of all nonterminals A from the set Π such thatA ⇒

∗ v .

Let us denote the set of all nonterminals generating subword v of length i

and starting on position j as F [i ][j ], i.e., for each A ∈ Π it holds that

A ∈ F [i ][j ] ⇐⇒ A ⇒∗ ajaj+1 . . . aj+(i−1)

To find out whether S ⇒∗ w , is therefore the same problem as to find out

whether S ∈ F [n][1].Z. Sawa (TU Ostrava) Theoretical Computer Science November 13, 2019 52 / 54


The algorithm computes values F [i ][j ] at first for subwords oflength 1 (i.e., i = 1), then for subwords of length 2 (i.e., i = 2), thenfor subwords of length 3, length 4, etc.

Values F [i ][j ] are stored in a twodimensional array F , where1 ≤ i ≤ n a 1 ≤ j ≤ n − i + 1, where the elements of this array aresubsets of nonterminals from the set Π.

In the computation of the value F [i ][j ] the previously computedvalues F [i ′][j ′], where i ′ < i , are used.

Let us assume that at the beginning all elements of array F areinitialized to ∅.


Algoritmus Cocke-Younger-Kasami

for j := 1 to n dofor each (A → b) ∈ P do

if b = aj thenadd A to F [1][j ]

for i := 2 to n dofor j := 1 to n − i + 1 do

for k := 1 to i − 1 dofor each (A → BC ) ∈ P do

if B ∈ F [k ][j ] and C ∈ F [i − k ][j + k ] thenadd A to F [i ][j ]


lexical and syntactic analysis — an example · connected with the syntax analysis, where the...

Documents