lexical and syntax analysis · lexical and syntax analysis top-down parsing . data structure easy...

Lexical and Syntax Analysis

Top-Down Parsing

Data structure

Easy for programs

to transform

String of characters

Easy for humans to write and understand

Lexemes identified

String of tokens

Syntax

A syntax is a set of rules defining the valid strings of a language, often specified by a context-free grammar.

For example, a grammar E for arithmetic expressions:

e → x | y | e + e | e – e | e * e | ( e )

Derivations

A derivation is a proof that some string conforms to a grammar.

A leftmost derivation:

e ⇒ e + e ⇒ x + e ⇒ x + ( e ) ⇒ x + ( e * e ) ⇒ x + ( y * e ) ⇒ x + ( y * x )

Derivations

A rightmost derivation:

e ⇒ e + e ⇒ e + ( e ) ⇒ e + ( e * e ) ⇒ e + ( e * x ) ⇒ e + ( y * x ) ⇒ x + ( y * x )

Many ways to derive the same string: many ways to write the same proof.

Parse tree: motivation

Also a proof that a given input is valid according to the grammar. But a parse tree:

is more concise: we don’t write out the sentence every time a non-terminal is expanded.

abstracts over the order in which rules are applied.

Parse tree: intuition

If non-terminal n has a production

n → X Y Z

where X, Y, and Z are terminals or non-terminals, then a parse tree may have an interior node labelled n with three children labelled X, Y, and Z.

n

X Y Z

Parse tree: definition

A parse tree is a tree in which:

the root is labelled by the start symbol;

each leaf is labelled by a terminal symbol, or 𝜀;

each interior node is labelled by a non-terminal;

if n is a non-terminal labelling an interior node whose children are X1, X2, ⋯, Xn then there must exist a production n→ X1 X2 ⋯ Xn.

Example 1

Example input string:

A resulting parse tree according to grammar E:

x + y * x

e

x

+

* e

e

e

y

x

e

Example 2

The following is not a parse tree according to grammar E.

e

x

+

* e

e

e

y

x

Why? Because e → x + e is not a production in grammar E.

Grammar notation

Non-terminals are underlined.

Rather than writing

we may write:

(Also, symbols → and ::= will be used interchangeably.)

e → x e → e + e

e → x | e + e

Syntax Analysis

String of symbols

Parse tree

A parse tree is:

1. A proof that a given input is valid according to the grammar;

2. A data structure that is convenient for compilers to process.

(Syntax analysis may also report that the input string is invalid.)

Ambiguity

If there exists more than one parse tree for any string then the grammar is ambiguous. For example, the string x+y*x has two parse trees:

e

e + e

x e * e

y x

e

* e

e + e

x y

e

x

Operator precedence

Different parse trees often have different meanings, so we usually want unambiguous grammars.

Conventionally, * has a higher precedence (binds tighter) than +, so there is only one interpretation of x+y*x, namely x+(y*x).

Operator associativity

Binary operators are either:

Conventionally, - is left-associative, so there is only one interpretation of x-x-x-x, namely ((x-x)-x)-x.

left-associative;

right-associative;

non-associative.

Even with precedence rules, ambiguity remains, e.g. x-x-x-x.

Ambiguity removal

All operators are left associative, and * binds tighter than + and –.

e → x | y | e + e | e – e | e * e | ( e )

Example input:

Ambiguity removal

Example output:

e → e + e1

| e – e1

| e1

e1 → e1 * e2

| e2

e2 → ( e ) | x | y

Note: ignoring bracketed expressions e1 disallows + and –

e2 disallows +, -, and *

Disallowed parse trees

e

* e

e + e

x y

e

x

LHS of * cannot

contain a +.

RHS of + cannot

contain a -.

e

e + e

x e - e

y x

After disambiguation, there are no parse trees corresponding to the following originals:

Ambiguity removal: step-by-step

Given a non-terminal e which involves operators at n levels of precedence:

Step 1: introduce n+1 new non-terminals, e0 ⋯ en.

Step 2a: replace each production

e → e op e

with

ei → ei op ei+1

| ei+1

if op is left-associative, or

ei → ei+1 op ei

| ei+1

if op is right-associative

Let op denote an operator with precedence i.

Step 2b: replace each production

e → op e

with

ei → op ei

| ei+1

Step 2c: replace each production

e → e op

with

ei → ei op

| ei+1

Grammar E after step 2 becomes:

e0 → e0 + e1

| e0 – e1

| e1

e1 → e1 * e2

| e2

e → ( e ) | x | y

Operator Precedence

+, - 0

* 1

Construct the precedence table:

Step 3: replace each production

e → ⋯

with

en → ⋯

e0 → e0 + e1

| e0 – e1

| e1

e1 → e1 * e2

| e2

e2 → ( e ) | x | y

After step 3:

Step 4: replace all occurrences of e0 with e.

e → e + e1

| e – e1

| e1

e1 → e1 * e2

| e2

e2 → ( e ) | x | y

After step 4:

Exercise 1

Consider the following ambiguous grammar for logical propositions.

p → 0 (Zero) | 1 (One) | ~ p (Negation) | p + p (Disjunction) | p * p (Conjunction)

Now let + and * be right associative and the operators in increasing order of binding strength be : +, *, ~.

Give an unambiguous grammar for

logical propositions.

Exercise 2

Which of the following grammars are ambiguous?

s → if b then s | if b then s else s | skip

e → + e e | – e e | x

b → 0 b 1 | 0 1

Homework exercise

Consider the following ambiguous grammar G.

s → if b then s | if b then s else s | skip

Give a unambiguous grammar that accepts the same language as G.

Summary so far

Syntax of a language is often specified by a context-free grammar

Derivations and parse trees are proofs.

Parse trees lead to a concise definition of ambiguity.

Construction of unambiguous grammars using rules of precedence and associativity.

PART 2: TOP-DOWN PARSING

• Recursive-Descent

• Backtracking

• Left-Factoring

• Predictive Parsing

• Left-Recursion Removal

• First and Follow Sets

• Parsing tables and LL(1)

Top-down parsing

Top-down: begin with the start symbol and expand non-terminals, succeeding when the input string is matched.

A good strategy for writing parsers:

1. Implement a syntax checker to accept or refute input strings.

2. Modify the checker to construct a parse tree – straightforward.

RECURSIVE DESCENT

A popular top-down parsing technique.

Recursive descent

A recursive descent parser consists of a set of functions, one for each non-terminal.

The function for non-terminal n returns true if some prefix of the input string can be derived from n, and false otherwise.

Consuming the input

int eat(char c) { if (*next == c) { next++; return 1; } return 0; }

Consume c from input if possible.

We assume a global variable next points to the input string.

char* next;

Recursive descent

int N() { char* save = next;

for each N → X1 X2 ⋯ Xn

if (parse(X1) && parse(X2) && ⋯ && parse(Xn)) return 1; else next = save;

return 0; }

For each non-terminal N, introduce:

Let parse(X) denote

X() if X is a non-terminal

eat(X) if X is a terminal

Backtrack

Exercise 4

Consider the following grammar G with start symbol e.

Using recursive descent, write a syntax checker for grammar G.

e → ( e + e ) | ( e * e ) | v v → x | y

Answer (part 1)

int e() { char* save = next;

if (eat('(') && e() && eat('+') && e() && eat(')')) return 1; else next = save;

if (eat('(') && e() && eat('*') && e() && eat(')')) return 1; else next = save;

if (v()) return 1; else next = save;

return 0; }

Answer (part 2)

int v() { char* save = next; if (eat('x')) return 1; else next = save; if (eat('y')) return 1; else next = save; return 0; }

Exercise 5

How many function calls are made by the recursive descent parser to parse the following strings?

(x*x)

((x*x)*x)

(((x*x)*x)*x)

(See animation of backtracking.)

Answer

Input string Length Calls

(x*x) 5 21

((x*x)*x) 9 53

(((x*x)*x)*x) 13 117

Number of calls is quadratic in the length of the input string.

Lesson: backtracking expensive!

String length

Fun

ctio

n c

alls

LEFT FACTORING

Reducing backtracking!

Left factoring

When two productions for a non-terminal share a common prefix, expensive backtracking can be avoided by left-factoring the grammar.

Idea: Introduce a new non-terminal that accepts each of the different suffixes.

Example 3

Left-factoring grammar G by introducing non-terminal r:

e → ( e r | v r → + e ) | * e ) v → x | y

Common prefix

Different suffixes

Effect of left-factoring

Input string Length Calls

(x*x) 5 13

((x*x)*x) 9 22

(((x*x)*x)*x) 13 31

Number of calls is now linear in the length of input string.

Lesson: left-factoring a grammar reduces backtracking.

String length

Fun

ctio

n c

alls

PREDICTIVE PARSING

Eliminating backtracking!

Predictive parsing

Idea: know which production of a non-terminal to choose based solely on the next input symbol.

Advantage: very efficient since it eliminates all backtracking.

Disadvantage: not all grammars can be parsed in this way. (But many useful ones can.)

Running example

The following grammar H will be used as a running example to demonstrate predictive parsing.

Example:

e → e + e | e * e | ( e ) | x | y

x+y*(y+x)

Removing ambiguity

Since + and * are left-associative and * binds tighter than +, we can derive an unambiguous variant of H.

e → e + t | t t → t * f | f f → ( e ) | x | y

Left recursion

Problem: left-recursive grammars cause recursive descent parsers to loop forever.

int e() { char* save = next; if (e() && eat('+') && t()) return 1; next = save; if (t()) return 1; next = save; return 0; }

Call to self without consuming any input

Eliminating left recursion

n → 𝛼 n → 𝛼 n' ⟹

n' → 𝛼 n' ⟹ Rule 1

Rule 2

where 𝛼 does not begin with n

Let 𝛼 denote any sequence of grammar symbols.

n' → 𝜀

Rule 3 Introduce new

production

n → n 𝛼

Eliminating left recursion

Example before:

e → e + v | v v → x | y

and after:

e → v e' v → x | y e' → 𝜀 | + v e'

Example 4

Running example, after eliminating left-recursion.

e → t e' e' → + t e' | 𝜀

t → f t' t' → * f t' | 𝜀

f → ( e ) | x | y

first and follow sets

Predictive parsers are built using the first and follow sets of each non-terminal in a grammar.

Definition of first sets

Let 𝛼 denote any sequence of grammar symbols.

If 𝛼 can derive a string beginning with terminal a then a ∊ first(𝛼).

If 𝛼 can derive 𝜀 then 𝜀 ∊ first(𝛼).

Computing first sets

If a is a terminal then a ∊ first(a 𝛼).

If X1X2⋯Xn is a sequence of grammar symbols

and ∃i · a ∊ first(Xi)

and ∀j < i · 𝜀 ∊ first(Xj)

then a ∊ first(X1X2⋯ Xn ).

The empty string 𝜀 ∊ first(𝜀).

If n → 𝛼 is a production then

first( n ) = first(𝛼).

Exercise 6

Give all members of the sets:

e → ( e + e ) | ( e * e ) | v v → x | 𝜀

first( v )

first( e )

first( v e )

Exercise 7

What are the first sets for each non-terminal in the following grammar.

e → t e' e' → + t e' | 𝜀

t → f t' t' → * f t' | 𝜀

f → ( e ) | x | y

Answer

first( f ) = { ‘(‘, ‘x’, ‘y’ } first( t' ) = { ‘*’, 𝜀 } first( t ) = { ‘(‘, ‘x’, ‘y’ } first( e' ) = { ‘+’, 𝜀 } first( e ) = { ‘(‘, ‘x’, ‘y’ }

Definition of follow sets

Let 𝛼 and 𝛽 denote any sequence of grammar symbols.

Terminal a ∊ follow(n) if the start symbol of the grammar can derive a string of grammar symbols in which a immediately follows n.

The set follow(n) never contains 𝜀.

End markers

In predictive parsing, it is useful to mark the end of the input string with a $ symbol.

((x*x)*x)$

$ is equivalent to '\0' in C.

Computing follow sets

If s is the start symbol of the grammar then $ ∊ follow(s).

If n → 𝛼 x 𝛽 then everything in first(𝛽) except 𝜀 is in follow(x).

If n → 𝛼 x

or n → 𝛼 x 𝛽 and 𝜀 ∊ first(𝛽)

then everything in follow(n) is in follow(x).

Exercise

Give all members of the sets:

e → ( e + e ) | ( e * e ) | v v → x | 𝜀

follow( e )

follow( v )

Exercise 8

What are the follow sets for each non-terminal in the following grammar.

e → t e' e' → + t e' | 𝜀

t → f t' t' → * f t' | 𝜀

f → ( e ) | x | y

Answer

follow( e' ) = { $, ‘)’ } follow( e ) = { $, ‘)’ } follow( t' ) = { ‘+’, $, ‘)’ } follow( t ) = { ‘+’, $, ‘)’ } follow( f ) = { ‘*’, ‘+’, ‘)’, $ }

Predictive parsing table

For each non-terminal n, a parse table T defines which production of n should be chosen, based on the next input symbol a.

( + ...

e e → ( e r

r r → + e

v

Terminals

No

n-T

erm

inal

s

Production

Predictive parsing table

for each production n → 𝛼 for each a ∊ first(𝛼) add n → 𝛼 to T[n , a] if 𝜀 ∊ first(𝛼) then for each b ∊ follow(n) add n → 𝛼 to T[n , b]

Exercise 9

Construct a predictive parsing table for the following grammar.

e → t e' e' → + t e' | 𝜀

t → f t' t' → * f t' | 𝜀

f → ( e ) | x | y

LL(1) grammars

If each cell in the parse table contains at most one entry then the a non-backtracking parser can be constructed and the grammar is said to be LL(1).

First L: left-to-right scanning of the input.

Second L: a leftmost derivation is constructed.

The (1): using one input symbol of look-ahead to decide which grammar production to choose.

Exercise 10

Write a syntax checker for the grammar of Exercise 9, utilising the predictive parsing table.

int e() { ... }

It should return a non-zero value if some prefix of the string pointed to by next conforms to the grammar, otherwise it should return zero.

Answer (part 1)

int e() { if (*next == 'x') return t() && e1(); if (*next == 'y') return t() && e1(); if (*next == '(') return t() && e1(); return 0; }

int e1() { if (*next == '+') return eat('+') && t() && e1(); if (*next == ')') return 1; if (*next == '\0') return 1; return 0; }

Answer (part 2)

int t() { if (*next == 'x') return f() && t1(); if (*next == 'y') return f() && t1(); if (*next == '(') return f() && t1(); return 0; }

int t1() { if (*next == '+') return 1; if (*next == '*‘) return eat('*') && f() && t1(); if (*next == ')') return 1; if (*next == '\0') return 1; return 0; }

Answer (part 3)

int f() { if (*next == 'x') return eat('x'); if (*next == 'y') return eat('y'); if (*next == '(') return eat('(') && e() && eat(')'); return 0; }

(Notice how backtracking is not required.)

Predictive parsing algorithm

Let s be a stack, initially containing the start symbol of the grammar, and let next point to the input string.

while (top(s) != $) if (top(s) is a terminal) { if (top(s) == *next) { pop(s); next++; } else error(); } else if (T[top(s), *next] == X → Y1⋯ Yn) { pop(s); push(s, Yn⋯ Y1) /* Y1 on top */ }

Exercise 11

Give the steps that a predictive parser takes to parse the following input.

x + x * y

For each step (loop iteration), show the input stream, the stack, and the parser action.

Acknowledgements

Plus Stanford University lecture notes by Maggie Johnson and Julie Zelenski.

APPENDIX

Context-free grammars

Have four components:

1. A set of terminal symbols.

2. A set of non-terminal symbols.

3. A set of productions (or rules) of the form:

where n is a non-terminal and

X1⋯Xn is any sequence of terminals, non-terminals, and 𝜀.

4. The start symbol (one of the non-terminals).

n → X1⋯ Xn

Notation

Non-terminals are underlined.

Rather than writing

we may write:

(Also, symbols → and ::= will be used interchangeably.)

e → x e → e + e

e → x | e + e

Why context-free?

Regular

Context Free

Context Sensitive

Unrestricted

Nice balance between expressive power and efficiency of parsing.

Chomsky hierarchy

Grammar Valid productions

Unrestricted 𝛼 → 𝛽

Context-Sensitive 𝛼 x γ → 𝛼 𝛽 γ

Context-Free x → 𝛽

Regular x → t x → t z x → 𝜀

Let t range over terminals, x and z over non-terminals and , 𝛽 and γ over sequences of terminals, non-

terminals, and 𝜀.

Backus-Naur Form

BNF is a standard ASCII notation for specification of context-free grammars whose terminals are ASCII characters. For example:

<exp> ::= <exp> "+" <exp> | <exp> "-" <exp> | <var> <var> ::= "x" | "y"

The BNF notation can itself be specified in BNF.

lexical and syntax analysis · lexical and syntax analysis top-down parsing . data structure easy...

Documents